Minutes of the LCG SC2 meeting, 14/10/05

Present: Jean-Jacques Blaising, German Cancio (secretary), Wisla Carena, Matthias Kasemann (chair), Eric Lan�on, Gerhard Raven (via VRVS), Les Robertson, Albert de Roeck, Jim Shank (via VRVS)
Invited: Frederic Hemmer, Massimo Lamanna

Organisational matters. 1

News from the PEB.. 1

Middleware Area Review.. 2

gLite Middleware Status (Frederic Hemmer) 2

ARDA (Massimo Lamanna) 4

AOB.. 5

Organisational matters

The previous minutes (link) were approved.
The next SC2 meetings are on Friday November 25 (Q3 Status Report review) and on Friday December 16 (Fabric Area review).
For the Q3 status report review, the usual procedure applies: After having reviewed the quarterly status report, all SC2 members should come up with questions and concerns (in particular on the assigned godparent section). The questions and concerns should be sent to the area manager with Cc: to Les and Matthias.
The last POB meeting was held on September 19. The slides presented by Matthias can be found here.

News from the PEB

Les informs that the management structure for LCG phase II is now being set up. A first Management Board (MB) meeting is planned for October 23. The MB will supersede the PEB. In addition to the LCG Project Leader and the LCG Area Managers, the MB will include the Technical Tier-1 Center Managers and the Experiment Computing Coordinators. The exact composition can be found here. In order to ease the decision taking, it is planned to have weekly meetings, including both face-to-face and phone meetings.
A new Working Group will be created to address reporting, monitoring and internal reviewing in LCG. This Working Group will look at how to improve reporting procedures, in particular by considering experiments and regional centers. Its members are Lothar Bauerdick, Dominique Boutigny, Dario Barberis, and David Britton, an external consultant from GridPP. The WG will be coordinated by Alberto Aimar, who replaces Juergen Knobloch as LCG Planning Officer. He will contact the SC2 members in order to collect input and ideas.
Service Challenges: All experiments have now started in SC3. LHCb is running smoothly. CMS experienced problems with CASTOR2, which are now fixed apart two exceptions. The problem-fixing rate in CASTOR2 is thus considered to be good. However, CASTOR2 has not yet been sufficiently tested by the experiments. In fact, the migration of production activities to CASTOR2 is behind the schedule, since the experiments do not find the necessary time. Until beginning of September, the only serious tests were performed by the ALICE Data Challenges.
A joint EGEE/LCG/OSG Operations Workshop was held end of September in Abingdon (link). One of the results of this workshop has been the achievement of basic agreements on how to measure availability and reliability. Service availability monitoring will be implemented as part of the Site Functional Tests (SFT) framework, which allows to run tests automatically and to feed results into the BDII information system. Experiments will be able to add their own tests into SFT. The next step is to define first which services need to be run by the sites and then the tests and algorithms that may turn test results into availability figures. The aim is to finish this work within the next month and to reach an agreement at the Grid Deployment Board. The resulting availability numbers will be compared with the ones initially defined in the MoU. A dashboard will be created in order to visualize the status of each participating site. Also, experiments will be able to use availability/reliability information within their own frameworks for site selection and job scheduling.
Answering a question by Matthias, Les explains that standard service test failures reported by SFT will be dealt by the operations team. Failure analysis becomes more complicated for application jobs, in particular when no errors are reported for standard services. However, test jobs should act as a reasonable benchmark. SFT tests will be run on both Tier-1 and Tier-2 centres.
New milestones are being drafted for Phase-II (link). These milestones should be visible to the LHCC referees. Also, service success criteria (e.g. for Service Challenge 4) are being defined as sets of targets that must be met by a number of sites.
The Architects forum and the PEB (and therefore all the experiments) agreed on the phase-II plan for the Applications Area.
The next LHC Computing Comprehensive Review will take place on November 14-15 (link).

Middleware and Area Review

gLite Middleware Status (Frederic Hemmer)

Frederic�s slides on the gLite status can be found here.

Discussion/Comments:

Slide 3:

Deployment and Operation activities are outside the gLite process and therefore in a different color.
The EMT (Engineering Management Team) consists of the JRA1 management and representatives from each development cluster.

Slide 7: The rows in grey denote completed milestones. Light grey stands for milestones under review.
Slide 8: The releases in light yellow (1.1.1 and 1.1.2) were internal. Frederic points out the significant amount of non-automatable manual work for producing the documentation of each release: installation notes, known issues, bugfixes, per-package documentation including changes, dependencies, open/closed bugs etc.
Pre-Production Service (PPS): The PPS currently runs a mixture of gLite 1.2 and 1.3; release 1.4 is being certified. It takes a considerable time to test, certify and deploy new releases onto all instances of the PPS. Massimo explains that end-users have access to the PPS only since recently and the number of users is still reduced (order of tens of users). Frederic adds that the idea behind the PPS is to provide an exercise facility not only for end-users but also for site administrators. The time to install the PPS varies significantly from site to site (from one hour to ten days).
Milestones (slide 9-13): Frederic points out that the work plan milestones are very detailed and of technical nature. Adding performance milestones would be a possibility but it does not always make sense e.g. for security related work. In addition to the release number, Frederic agrees to add a short description of the functionality to the milestones provided in the Quarterly Report.
Baseline Services (slide 14):

Even if Fireman will not be used by LHC experiments it is required by non-HEP communities like Biomed or the DILIGENT project. Answering to a question by Jean-Jacques, Frederic explains that only a reduced amount of manpower is invested into Fireman.
Frederic highlights that OSG is planning to use the Condor-G based CE, which is developed by the gLite team.
There is an incremental process for including gLite modules into the list of Baseline services. Modules are first subject to successful tests on the Pre-Production Service and can then be integrated into the LCG distribution. The gLite CE will become part of LCG-2 once the compatibility with the LCG Worker Node is verified.

Other services (slide 15): The File Placement Service is a layer on top of the FTS and the Catalogue. The G-Pbox policy engine allows defining fine-grained ACL�s and priorities within a given Virtual Organisation.
Testing/certification plans (slide 19): Currently, the plans are formally defined within the lifetime of EGEE phase I. The final release (1.5) is scheduled for end of December. Bug fixing, integration and testing activities will continue during the transition phase between EGEE-I and EGEE-II.
Integration/testing (slide 20): The Integration/Testing team consists of 4 testers and 4 integrators, all based at CERN. Replying to a question by Jean-Jacques, Frederic states that there has been a significant improvement in software quality over the last year. Improvements in EGEE-II will be the merging of LCG certification and EGEE integration/testing activities on one hand, and building on work done by the ETICS project (link) on the other.
EGEE-II management changes (slide 28): Frederic points out that even if the JRA1 leadership will be transferred to INFN, the new JRA1 leader will spend most of the time at CERN.
Concerns and risks (slides 31-33):

The existence of two incompatible versions of RFIO is a particular source of frustration, because it complicates the switching between production data on CASTOR and the pre-production service using DPM. Moreover, it obliges developers/users to setup two different instances of gLite on two servers.
Les explains that the VOMS related problems were discussed in detail during the last EGEE conference. INFN will address this issue at the beginning of next year. From a service perspective, any split-ups of VOMS would be very unfortunate for LCG.
The mitigation plan for facing the different concerns and risks varies from case to case. Issues around VOMS and RFIO have been escalated to the appropriate management levels in SA3 and GD/FIO, respectively. The different mechanisms used for software configuration (Yaim in LCG, gLite in EGEE) will be discussed at the next EGEE conference in Pisa. For the integration and testing process, the focus is to ensure that the process will always be in place, and that any bypass attempt is rejected.
The newly appointed Technical Director (E. Laure), who was previously the deputy JRA1 Manager, is aware of the managerial shortcomings pointed out by Frederic.

Frederic�s replies to Jim and Tony�s comments regarding the LCG 2^nd Quarterly report are found on slides 34-37. A preview of the contributions for the next Quarterly Report is found on slides 38-39.
During November/December, Frederic will gradually step down from his function as JRA1 Manager in order to take over his new role as CERN-IT Deputy Department Head while ensuring a smooth transition in the EGEE management.

ARDA (Massimo Lamanna)

Massimo�s slides on ARDA status are found here.

Discussion/Comments:

Slide 7: Answering a question by Matthias, Massimo explains that in many cases, components of the development testbed are tested even before the official JRA1 integration/testing phase starts. This is supported by the JRA1 management, since it allows to provide early feedback on usability of new developments.
Slide 9: The Pre-Production Service is distributed over several sites. The main ones are CERN and a number of Tier-1/2 sites including CNAF and PIC. (Details can be found on this slide).
Slide 12: Les explains that the GAG (Grid Applications Group, link) was stopped beginning of December and was replaced by the Baseline Services Working Group.
Slide 14: Answering a question of Jean-Jacques on AMGA (ARDA Metadata Catalogue Project), Massimo explains that AMGA incorporates ideas from several sources, but is a new product providing features which were missing, like a common and standardized interface for all applications. The allocated manpower is a team member plus an externally funded student. Massimo reports that LHCb is actively using AMGA, which is also used internally by GANGA. Frederic adds that also the biomed community is using AMGA.
Slide 21: Massimo explains that GANGA is a front-end for specifying and controlling jobs. It supports a variety of back-ends (local computer, batch systems, grid systems). GANGA takes care of translating job descriptions, splitting and merging jobs, and status monitoring. AMGA is used internally for storing job status information in the so-called job repository.
Slide 25: The Access Library is a generic library developed by the ARDA team that caches user authentication tokens. As this avoids repeated re-authentications to different services (using a shared secret between a proxy client and server), performance is improved.
Slide 27: The depicted catalogue is the one belonging to AliEn, but an interface to LFC exists as well.
Slide 28: Massimo points out that the CMS dashboard helped to locate data leaks in R-GMA, since status/monitoring information is first collected from multiple sources and then compared.
Slide 33: The ARDA project is defined until the end of phase I of EGEE. Milestones for phase-II are not yet formalized.
EGEE-I review: Frederic and Massimo explain that the project review will focus on Middleware, Applications and Grid Operations.

AOB

None.