Present: Ian Bird, Jean-Jacques Blaising, German
Cancio (secretary), Wisla Carena, Matthias Kasemann (chair), Marcel Kunze, Eric
Lançon, Albert de Roeck, Les Robertson, Jim Shank
(via VRVS)
Apologies: Rick Cavanaugh 5
- The previous minutes (link)
were approved.
- The POB meeting where SC2 will report is
on June 20th.
Added after
the meeting: The SC2 slides for the POB meeting are available on the Agenda
page (link).
- The next meeting (Middleware
Area review focus) is confirmed to be on July 1st
(Agenda page).
Ian’s
slides can be found here.
- A sequence of Service
Challenges (SC’s) have been scheduled for 2005-2006. The goal is to
prepare for LHC Service operation, which will start in September 2006, 6
months prior to the first collisions. Service Challenge 2 (SC2) had a
focus on data throughput, while as SC3 and SC4 have a focus on sustained
operations.
- In SC3 and SC4, the results achieved in
a first throughput phase should be sustained in a
second service-oriented phase. In SC3, experiments will be involved by
running production jobs. SC4 will include end-user analysis and should
demonstrate that all elements for the real service that will start
thereafter are understood and ready for operation.
- SC2 conclusions: Even if the throughput
targets for SC2 were exceeded (>600 MB/s daily average for 10 days with
500MB/s being the goal), there was no automated outage recovery in place
and the data transfer service cannot yet be labeled as reliable. Setting
up the infrastructure and achieving reliable transfers requires much
technical work and coordination. There is a lack of manpower
dedication, because it is often observed that those people who are running
the day-to-day services do also run the SC’s. A further non-technical
problem is the multiplicity of involved partners, sites and time zones.
- SC3: In the throughput phase, the
primary goals are to achieve sustained T0 to T1’s disk-to-disk and disk-to-tape
transfers (at 150MB/s and 60MB/s, respectively). A number of T2’s will
participate in the T2->T1 upstream transfers. The service phase will
start in September and will involve the four experiments testing their offline
computing use cases (except for analysis). All identified T1’s will
participate in managed transfers (but not necessarily during the
throughput phase).
- Tier-2 sites will be involved for the
first time in Service Challenge 3. The coordination will
be done via national organizations (e.g. GridPP). Data transfer
will be bidirectional, as Tier-2 sites will produce simulation data
uploaded to their corresponding T1 centres. In
principle, T2’s will be linked with their
national T1; T2’s without a national T1 need to identify which T1 they should
be linked to.
- In terms of network connectivity, SC3
will use dedicated links. T1 sites are expected to provide 10Gb links to CERN. However many sites are still at 1Gb and there is a site on 600Mb.
- The basic components that are required
for the setup phase include: a) SRM 1.1 on CERN/T0 and all T1’s and b) a File
Transfer Service at the T0. In the case of CMS, PhEDEx will
be used for data transfers between FNAL and PIC.
- The LCG File Catalog (LFC) will be exposed to experiment tests. The LFC is considered a basic component since most experiments
and sites will presumably prefer to offer common services.
- SRM status: Most sites have a SRM set
up in front of their production mass storage systems (mostly CASTOR and
dCache). It is understood that the SRM 1.1
specification is not sufficient. However, the full SRM 2.1 is neither required.
The “LCG-required” functionality set was agreed by the Baseline Services
Working Group and includes SRM 1.1 plus some of the 2.1 features, in
particular space management (e.g. reservations, file pinning). For SC3,
the V1.1 functionality is sufficient, whereas for SC4 the complete LCG
functionality set is required.
- Baseline services: The Baseline
Services WG has reached an understanding on what should
be regarded as baseline services. This includes SRM-based storage
management, GridFTP, file transfer service, catalogues, workload and VO
management, grid monitoring, applications software installation, and “VO
agent” frameworks for experiment-specific long-lived processes. File
Placement Services are not currently in the list but are
provided by the experiment frameworks.
Answering a question from Matthias, Ian clarifies that most of these
services will be tested in SC3 in order to
achieve stability for SC4.
- On top of the baseline services, a
number of additional components need to be provided.
These include Applications Area and experiment-specific software and
services. Sites involved in the SC have to understand that these
additional components need to be run as a service
in addition to the Grid middleware.
- File Catalogue status: ALICE will use their
own catalog; ATLAS and CMS require local catalogues at all sites, and LHCb
requires a central catalogue with 1-2 read-only copies. An LFC system is
being setup at CERN, but the deployment model(s) may change in the future.
Many sites are likely to run LFC on MySQL or Oracle.
- Tier-2 centres:
The roles and services of Tier-2 centres have been discussed and clarified, and a simple model
has been agreed. T2’s are configured to upload
generated MC data to and download data from a given T1. In case the T1 is
not available, the T2 will wait and retry for data upload, or use an
alternate T1 for data download. T2 sites need to provide services for managed
storage and reliable file transfer. LCG will supply documentation for the
needed services such as DPM, FTS and LFC. It is expected that Tier-1 sites
will assist their Tier-2’s in establishing services.
- Experiment goals for SC3: The main focus will be on the demonstration of the
infrastructure’s robustness. Experiments aim to run their offline
frameworks during the service phase similar to what is
done during a regular Data Challenge. Experiments will be invited
to involve as many users as possible in order to make the service phase as
realistic as possible.
- A SC3 preparation workshop (link) will
focus on the technical planning of the whole SC3 exercise, with
appropriate input from the experiments.
- SC4: SC4 starts in April 2004 and will
end with the deployment of the full production service. It will add
further complexity over SC3 in terms of services, service levels, use cases,
and number of involved sites. In particular, the Analysis Use Case needs
to be better understood and classified, as it
covers many diverse activities ranging from AOD generation to interactive
ROOT analysis.
Discussion:
- Answering a question from Matthias and
Jean-Jacques, Ian explains that the gLite integration will take place
progressively as modules are released, verified and
tested. All required middleware needs to be available prior the end
of 2005.
- Ian points out that
experiments have been informed timely by LCG of the upcoming
Service Challenges and that they are preparing for them.
- Les emphasizes that CERN management is
aware of the importance of allocating sufficient resources to the Service
Challenges.
LCG Status Report Review
1. Grid Deployment Area
- An important milestone achievement is
the demonstration of sustained file transfers between T0 and T1 sites
(Service Challenge 2).
- Replying to a question from Eric, Ian
clarifies that a large fraction of the failures, which were
reported in the Atlas section of the Quarterly Report (p. 28), are
due to CASTOR errors. These errors, which have been observed at CERN and
PIC, have been understood and will be fixed with the deployment of the new
Castor version (see the CERN Fabric Area section below). Jean-Jacques
highlights the importance of having error recovery mechanisms (e.g.
retries or job resubmission). Fault tolerant functionality such as managed
storage and reliable file transfer service will appear with the deployment
of services for SC3.
- The purpose of having status reports of
other grid infrastructure projects added to the LCG Quarterly Reports is
discussed. It is considered important to
demonstrate that the LCG project and the underlying grid infrastructure
projects (EGEE, OSG and NorduGrid) work in
coordination, particularly in the context of the Service Challenges.
SC2 endorses the proposal of including status reports on US and NorduGrid infrastructure projects, including progress
towards the reaching of the required service levels. The reporting, which
should be grouped by country, should be collected by the Service Challenge
manager (Jamie Shiers).
- Les clarifies that in order to avoid
confusions, the Grid projects providing infrastructure to LCG (EGEE, OSG
and NorduGrid) will be referred to as
“operational units”. The updated terminology will be reflected in the LCG
TDR.
2. CERN Fabric Area
- CASTOR deployment: Les explains that
the rollout plan for the new CASTOR release will be
presented at the next LCG PEB meeting. Its deployment will be
ongoing with the target to complete in a year’s timeframe. It is foreseen to migrate first experiment production before
end users; the impact on the Service Challenges still needs to be
assessed. Wisla and Marcel request that in order to distinguish CASTOR versions
and for future reference, a release number should be
quoted in the report.
SC2 considers it important to define a schedule for the switchover of all
experiments to the new CASTOR version.
- The above-mentioned ATLAS problems are
due to a limitation in the CASTOR stager catalogue implementation, which has been fixed in the new release. Until this new
release is deployed, it is recommended to reduce
the catalogue entries by avoiding using large quantities of small files.
The inefficiency of handling small files applies to all mass storage
systems and still represents an open question. A solution being investigated is the possibility to group files
into containers. However, it is not yet clear what grouping type is more
beneficial: at the application level or at the mass storage system level.
- While the achievement of the ALICE DC6
milestone and the good progress made towards the new CASTOR release is
acknowledged, it is observed that the CASTOR project has a accumulated (but stable) delay of about 3 months.
- Wisla requests that the milestone for
Alice DC 7 is split up into IT-specific and ALICE-specific
milestones, as it was previously agreed. She would like to see this
split-up in the next report, even if not all dates have
been defined.
- Answering a question from Wisla and
Marcel, Les explains that due to a recent reorganization in CERN-IT, the
CASTOR development and deployment teams have been merged
and are now part of the Fabric Infrastructure and Operations group. This
reorganization should affect neither the current plans nor the milestones.
3. Middleware Area
Jim informs that a meeting with the
Middleware Area coordinators has been planned and will
be held prior to the next POB.
- Jim welcomes the decision of the
management task force to limit gLite developments to a single software
stack.
- Relationship between the Baseline
Services Working Group (BSWG) and gLite: Ian clarifies that even if there
is no formal tie between the BLSWG and the gLite team, close contacts have been established. Jim and Matthias suggest that
ties should be tightened between gLite and the
BSWG (which will be restarted after summer).
- Jim reiterates that the Middleware Area
milestones are too high level and should be broken down. This would allow
showing a better description of the current status
of components and functionality.
- A pie chart classifying bugs by type
(crashes, feature requests, performance issues, etc.) would be very
helpful. Also, adding reports on performance
measurement tests would help for a better definition of validation
milestones.
4. Application Area
- The Applications Area has been excluded from the Status Report review, because
it was recently subject of an internal review and of an SC2 focus meeting.
- Albert points out that there is a large
shift in one of the Simulation milestones (Comparison of LHC calorimeters
for EM shower development, p. 3). The reasons are that the workload was
much higher than anticipated and experiments could not provide all
required input.
- Jean-Jacques confirms that under Pere’s coordination, the Application Area
reorganization is smoothly progressing.
General
remarks:
- History and evolution of milestones:
Marcel and Wisla point out that some of the remarks made in the previous
reviews regarding milestone management have not yet been
taken into account (e.g. historical tracking of milestones,
distribution in open milestones delay). Les informs that this will be implemented for the next Quarterly Report.
- Following a comment from Wisla on
milestone 1.4.1.12 (ATLAS validation of LCG-2, on p. 29), it is pointed out that some of the validation milestones
have been weakly defined. This may lead to measurement difficulties and
therefore discrepancies when judging whether a milestone has been achieved or not. In this light, SC2 recommends
that upcoming milestones should be as qualitative as possible.
AOB