Notes from the WM TEG meeting, Oct 25, 2011 Davide Salomoni & Torre Wenaus ==== Attendees CERN: Maarten Litmaath - security Andrey Kiryanov - IT, pilot factory prototype Pablo Saiz - Alice, Alien Ulrich Schwickerath - lxcloud, virtualization Ricardo Graciani - Barcelona, lhcb Xiaomei Zhang - Beijing, Dirac EVO: Torre Wenaus - atlas/panda Davide Salomoni - infn Lawrence Field - information systems Oxana Smirnova - ndgf tier1 and nordic tier2s Di Qing - certification testing, T1 testing, grid middleware, grid sites Igor Sfiligoi - cms, glideinwms Andreas Heiss - kit Federico Stagni - lhcb, Dirac Claudio Grandi- CMS services, grid integration, WLCG, middleware engineering Peter Solagna - egi, ops policy, INSPIRE, middleware reqmts gathering from ops community. EGI ops perspective. Rod Walker - atlas (production, panda, tier-2) Marco Cecchi - INFN, gLite WMS, compute area coordinator CREAM and WMS Steffen Schreiner - security Jhen-Wei Huang - Taipei, ASGC T1 service mgr, ATLAS Ricardo Silva - CERN site, batch systems, CE Phone bridge: Burt Holzman - Fermilab, OSG info services, US CMS grid services, T1 facility mgr, glideinWMS project mgr ==== Meeting notes On the TEG mandate (Davide): The reason for these TEGs is, after the successful start up of LHC data taking and of the related software frameworks, the desire to come up with a WLCG long-term strategy. A few considerations worth mentioning: - some middleware components were seen as not suitable for LCG needs, so experiments developed their own solutions. A lot of the commonality has disappeared. This is typically an important source of problems for multi-experiment sites. - pilot job frameworks simplified the requirements placed on components such as the WMS and to some extent the CE - failures have been shown to be more at the service or infrastructure level, rather than at the middleware level. Can we identify where the source of most of these problems is? - sustainability. We should remove dependence on a single provider and ensure that there will be a path for the future. We need to clearly state to software providers what our needs are. Can we use (or use more) standard, existing solutions rather than inventing our own? Very important: any changes brought about by a reassessment of the current status will need first and foremost not to disrupt the existing situation. Try to bring out some common themes, in particular: - commonalities between experiments - commonalities between grid infrastructures - commonalities between site requirements - keep in mind operational, deployment and security issues. Deliverables we are asked to produce: 1) an assessment of the current situation. This should cover: -- middleware -- operation -- support 2) a strategy document defining -- the needs -- a plan to satisfy these needs for the next 2 to 5 years. Torre: the task is to look 2 to 5 years in the future. However, many components are around us now, and very likely the components that will be production quality in 2 to 5 years are visible now? How specific should we be on tools and technologies? (e.g. glideinwms, condor - others) Davide: perhaps the scope of our deliverable is not so much to give precise definition on what will or will not make it (but if we agree what something will be in, by all means let's mention it, or request it to technology provider), but rather to look at what we have now and (also based on that) deliver a vision of our needs for the next X years. ==== Discussions on topics: *pilots and frameworks* There was agreement to present at the f2f meeting the main key points of each framework. This should not focus on the actual implementations but rather on the perspective gained on WM from experience with the framework, eg. - why it was necessary to create such frameworks and what did not really work with the middleware (Davide's note: Ian's document mentioned "real or perceived weaknesses" - we should try to understand what was a real and what was a perceived weakness, and why it was perceived as such) - how are pilots submitted? Through WMS or else? What are the real needs in terms of a WMS? - how is scheduling done at a site? - how does an internal workload management system work? - what is the interaction with information systems? What are they used for? What is missing? - how is security dealt with? - what are the cons of the framework? From an experiment perspective? From a site perspective? (the latter should come up in the discussion) - what is the use of / the need for a Computing Element? - where is there potential for commonality? Torre, Igor - It's important that frameworks improve transparency to the site, make info available from the frameworks available to the site. Claudio - We should understand the philosophy behind choices made in developing the frameworks. Are pilots the right way, or the expedient way for current infrastructure and boundary conditions? Extract reasons behind choices. Andrey - grid pipeline too expensive for experiments. Trying to work out common way of doing pilot injection. Experiment just sees pilot job running on WN. * resource allocations and resource management * Mostly covered by the above. In general, we'd need to define what we mean by "CE". For example, a CE is the standard interface to site resources used by customers. At the site boundary, CE deals with authorization, accepting workloads. Will continue to need this. Today a CE is bound to the concept of "job", but this may not be the case tomorrow. Does this change the needs of / requirements of a CE? (one could call CE or in another way) How do pilots deal with more complex jobs, eg. parallel jobs. What is the connection to underlying CE, WMS. * Use of information services * See above. They should be covered in the frameworks section. * Security models * What is the responsibility? How is it shared between experiment and sites? Now resource allocation and job execution are separated, what is the implication on security? We take many things for granted, e.g. we assume there is delegation. This may not be the case in the future. Do we really rely on/ need delegation? Also, which authorization mechanisms do we want to use? We now rely on VOMS, but this is a single implementation, not a vision (and also very much bound to Grid, so far). Where is the decision taken? On the CEs or elsewhere? Why? Igor - Proxy and payload are generally independent for the pilot frameworks. No strong correlation between them. We should expect multiple types of credentials. eg. Amazon, Rackspace etc. Some expts already doing this, esp. small experiments. For the security part, we are not asked to define a full model; we more asked to analyze which security characteristics we (experiments, sites) would like to support. The full definition is probably best left to the security TEG. * New computing models * For GPUs, we would need to eventually facilitate access to them. Is this an application issue only, or should anything be done at the site or middleware levels as well? How do we define access policies? Are there WM issues in using many-core queues? Sites have difficult optimization problem to service many-core jobs while preserving full utilization. Pilots help with late binding, can accommodate heterogeneous resources. They discover the configuration when they land and are dispatched an appropriate workload. Policy in how the resource is offered has a big impact. What is the expected use of cloud computing interfaces? How does this integrate with the existing frameworks? ==== Logistic issues Given the tight timescale, a bi-weekly meeting at least before the holidays might be in order. We'll discuss at the f2f. Second workshop: when? mid-january? Let's program this in advance. For the f2f, we expect 10 to 20 people. We have a room booked on Thu from 11am to 3pm, and another from 3.30pm to 8pm. Pablo will look if there's a room available for Thursday (early) morning. This would mean start the f2f at e.g. 9-10 am and discuss less critical things like new computing models (use of virtualization, cloud computing) and perhaps security in the morning, and dedicate the afternoon at the discussion on frameworks. We could then start again on Friday morning and leave by mid-afternoon so that EU people may have a chance to return back home on Friday night.