Present: Jean-Jacques Blaising, German
Cancio (secretary), Wisla Carena, Matthias Kasemann (chair), Eric Lançon,
Gerhard Raven (via VRVS), Les Robertson, Albert de Roeck, Jim Shank (via VRVS)
Invited: Frederic Hemmer, Massimo Lamanna
Organisational
matters. 1
News from the PEB.. 1
Middleware Area Review.. 2
gLite
Middleware Status (Frederic Hemmer) 2
ARDA (Massimo
Lamanna) 4
AOB.. 5
- The previous minutes (link)
were approved.
- The next SC2 meetings are on Friday
November 25 (Q3 Status Report review) and on Friday December 16 (Fabric
Area review).
- For the Q3 status report
review, the usual procedure applies: After having reviewed the quarterly
status report, all SC2 members should come up with questions and concerns
(in particular on the assigned godparent section). The questions and
concerns should be sent to the area manager with
Cc: to Les and Matthias.
- The
last POB meeting was held on September 19. The
slides presented by Matthias can be found here.
- Les informs that the management
structure for LCG phase II is now being set up. A first Management Board
(MB) meeting is planned for October 23. The MB
will supersede the PEB. In addition to the LCG Project Leader and the LCG
Area Managers, the MB will include the Technical Tier-1 Center Managers
and the Experiment Computing Coordinators. The exact composition can be found here. In
order to ease the decision taking, it is planned
to have weekly meetings, including both face-to-face and phone meetings.
- A new Working Group will
be created to address reporting, monitoring and internal reviewing
in LCG. This Working Group will look at how to improve reporting
procedures, in particular by considering experiments and regional centers.
Its members are Lothar Bauerdick, Dominique Boutigny,
Dario Barberis, and David Britton, an external
consultant from GridPP. The WG will be coordinated by
Alberto Aimar, who replaces Juergen Knobloch as LCG Planning Officer. He will
contact the SC2 members in order to collect input and ideas.
- Service Challenges: All experiments
have now started in SC3. LHCb is running smoothly. CMS experienced
problems with CASTOR2, which are now fixed apart two exceptions. The
problem-fixing rate in CASTOR2 is thus considered to be
good. However, CASTOR2 has not yet been
sufficiently tested by the experiments. In fact, the migration of
production activities to CASTOR2 is behind the schedule, since the
experiments do not find the necessary time. Until beginning of September,
the only serious tests were performed by the
ALICE Data Challenges.
- A joint EGEE/LCG/OSG Operations
Workshop was held end of September in Abingdon (link). One of the results of this workshop
has been the achievement of basic agreements on how to measure
availability and reliability. Service availability monitoring will be implemented as part of the Site Functional
Tests (SFT) framework, which allows to run tests automatically and to feed
results into the BDII information system. Experiments will be able to add
their own tests into SFT. The next step is to define first
which services need to be run by the sites and then the tests and
algorithms that may turn test results into availability figures. The aim
is to finish this work within the next month and to reach an agreement at
the Grid Deployment Board. The resulting availability numbers will be compared with the ones initially defined in the
MoU. A dashboard will be
created in order to visualize the status of each participating
site. Also, experiments will be able to use
availability/reliability information within their own frameworks for site
selection and job scheduling.
- Answering a question by Matthias, Les
explains that standard service test failures reported by
SFT will be dealt by the operations team. Failure analysis becomes
more complicated for application jobs, in particular when no errors are reported for standard services. However, test jobs
should act as a reasonable benchmark. SFT tests will be
run on both Tier-1 and Tier-2 centres.
- New milestones are
being drafted for Phase-II (link).
These milestones should be visible to the LHCC referees. Also, service
success criteria (e.g. for Service Challenge 4) are being defined as sets
of targets that must be met by a number of sites.
- The Architects forum and the PEB (and
therefore all the experiments) agreed on the phase-II plan for the
Applications Area.
- The next LHC Computing Comprehensive
Review will take place on November 14-15 (link).
gLite Middleware Status (Frederic Hemmer)
Frederic’s
slides on the gLite status can be found here.
Discussion/Comments:
- Slide 3:
- Deployment and Operation activities
are outside the gLite process and therefore in a different color.
- The EMT (Engineering Management Team)
consists of the JRA1 management and representatives from each development
cluster.
- Slide 7: The rows in grey denote
completed milestones. Light grey stands for milestones under review.
- Slide 8: The releases in light yellow
(1.1.1 and 1.1.2) were internal. Frederic points out the significant
amount of non-automatable manual work for producing the documentation of
each release: installation notes, known issues, bugfixes,
per-package documentation including changes, dependencies, open/closed
bugs etc.
Pre-Production Service (PPS): The PPS currently runs a mixture of gLite
1.2 and 1.3; release 1.4 is being certified. It
takes a considerable time to test, certify and deploy new releases onto
all instances of the PPS. Massimo explains that end-users have access to
the PPS only since recently and the number of users is still reduced
(order of tens of users). Frederic adds that the idea behind the PPS is to
provide an exercise facility not only for end-users but also for site
administrators. The time to install the PPS varies significantly from site
to site (from one hour to ten days).
- Milestones (slide 9-13): Frederic
points out that the work plan milestones are very detailed and
of technical nature. Adding performance milestones would be a possibility
but it does not always make sense e.g. for security related work. In
addition to the release number, Frederic agrees to add a short description
of the functionality to the milestones provided in the Quarterly Report.
- Baseline Services (slide 14):
- Even if Fireman will not be used by
LHC experiments it is required by non-HEP communities
like Biomed or the DILIGENT project.
Answering to a question by Jean-Jacques, Frederic explains
that only a reduced amount of manpower is
invested into Fireman.
- Frederic highlights
that OSG is planning to use the Condor-G based CE, which is developed by the gLite team.
- There is an incremental process for
including gLite modules into the list of Baseline services. Modules are
first subject to successful tests on the Pre-Production Service and can then be integrated into the LCG distribution. The
gLite CE will become part of LCG-2 once the compatibility with the LCG
Worker Node is verified.
- Other services (slide 15): The File
Placement Service is a layer on top of the FTS and the Catalogue. The G-Pbox policy engine allows defining fine-grained ACL’s
and priorities within a given Virtual Organisation.
- Testing/certification plans (slide 19):
Currently, the plans are formally defined within
the lifetime of EGEE phase I. The final release (1.5) is
scheduled for end of December. Bug fixing, integration and testing
activities will continue during the transition phase between EGEE-I and EGEE-II.
- Integration/testing (slide 20): The
Integration/Testing team consists of 4 testers
and 4 integrators, all based at CERN. Replying to a question by
Jean-Jacques, Frederic states that there has
been a significant improvement in software quality over the last year.
Improvements in EGEE-II will be the merging of LCG certification and EGEE
integration/testing activities on one hand, and building on work done by
the ETICS project (link) on
the other.
- EGEE-II management changes (slide 28):
Frederic points out that even if the JRA1 leadership will
be transferred to INFN, the new JRA1 leader will spend most of the
time at CERN.
- Concerns and risks (slides 31-33):
- The existence of two incompatible
versions of RFIO is a particular source of frustration, because it
complicates the switching between production data on CASTOR and the
pre-production service using DPM. Moreover, it obliges developers/users
to setup two different instances of gLite on two servers.
- Les explains that the VOMS related
problems were discussed in detail during the
last EGEE conference. INFN will address this issue at the beginning of
next year. From a service perspective, any split-ups of VOMS would be
very unfortunate for LCG.
- The mitigation plan for facing the
different concerns and risks varies from case to case. Issues around VOMS
and RFIO have been escalated to the appropriate
management levels in SA3 and GD/FIO, respectively. The different
mechanisms used for software configuration (Yaim in LCG, gLite in EGEE) will be discussed at the next EGEE conference in Pisa. For the
integration and testing process, the focus is to ensure that the process
will always be in place, and that any bypass attempt is
rejected.
- The newly appointed Technical Director
(E. Laure), who was previously the deputy JRA1
Manager, is aware of the managerial shortcomings pointed out by Frederic.
- Frederic’s replies to Jim and Tony’s
comments regarding the LCG 2nd Quarterly report are found on slides 34-37. A preview of the
contributions for the next Quarterly Report is found
on slides 38-39.
- During November/December, Frederic
will gradually step down from his function as JRA1 Manager in
order to take over his new role as CERN-IT Deputy Department Head while
ensuring a smooth transition in the EGEE management.
Massimo’s
slides on ARDA status are found here.
Discussion/Comments:
- Slide 7: Answering a question by
Matthias, Massimo explains that in many cases, components of the
development testbed are tested even before the
official JRA1 integration/testing phase starts. This is
supported by the JRA1 management, since it allows to provide early
feedback on usability of new developments.
- Slide 9: The Pre-Production Service is distributed
over several sites. The main ones are CERN and a number of Tier-1/2 sites
including CNAF and PIC. (Details can be found on this slide).
- Slide 12: Les explains that the GAG
(Grid Applications Group, link) was stopped beginning of December and was replaced by the
Baseline Services Working Group.
- Slide 14: Answering a question of
Jean-Jacques on AMGA (ARDA Metadata Catalogue Project), Massimo explains
that AMGA incorporates ideas from several
sources, but is a new product providing features which
were missing, like a common and standardized interface for all
applications. The allocated manpower is a team
member plus an externally funded student. Massimo reports that LHCb is
actively using AMGA, which is also used
internally by GANGA. Frederic adds that
also the biomed community is using AMGA.
- Slide 21: Massimo explains that GANGA is a front-end for specifying and controlling
jobs. It supports a variety of back-ends (local computer, batch systems,
grid systems). GANGA takes care of
translating job descriptions, splitting and merging jobs, and status
monitoring. AMGA is used internally for
storing job status information in the so-called job repository.
- Slide 25: The Access Library is a
generic library developed by the ARDA team that caches user authentication
tokens. As this avoids repeated re-authentications to different services
(using a shared secret between a proxy client and server), performance is
improved.
- Slide 27: The depicted catalogue is the
one belonging to AliEn, but an interface to LFC exists as well.
- Slide 28: Massimo points out that the
CMS dashboard helped to locate data leaks in R-GMA, since
status/monitoring information is first collected from
multiple sources and then compared.
- Slide 33: The ARDA project is defined
until the end of phase I of EGEE. Milestones for phase-II are not yet
formalized.
- EGEE-I review: Frederic and Massimo
explain that the project review will focus on Middleware, Applications and
Grid Operations.