Minutes of the LCG SC2 meeting, 3/6/05

Present: Ian Bird, Jean-Jacques Blaising, German Cancio (secretary), Wisla Carena, Matthias Kasemann (chair), Marcel Kunze, Eric Lançon, Albert de Roeck, Les Robertson, Jim Shank (via VRVS)
Apologies: Rick Cavanaugh 5

Organisational matters

The previous minutes (link) were approved.
The POB meeting where SC2 will report is on June 20^th.
Added after the meeting: The SC2 slides for the POB meeting are available on the Agenda page (link).
The next meeting (Middleware Area review focus) is confirmed to be on July 1^st (Agenda page).

Planning for SC3 (Ian Bird)

Ian’s slides can be found here.

A sequence of Service Challenges (SC’s) have been scheduled for 2005-2006. The goal is to prepare for LHC Service operation, which will start in September 2006, 6 months prior to the first collisions. Service Challenge 2 (SC2) had a focus on data throughput, while as SC3 and SC4 have a focus on sustained operations.
In SC3 and SC4, the results achieved in a first throughput phase should be sustained in a second service-oriented phase. In SC3, experiments will be involved by running production jobs. SC4 will include end-user analysis and should demonstrate that all elements for the real service that will start thereafter are understood and ready for operation.
SC2 conclusions: Even if the throughput targets for SC2 were exceeded (>600 MB/s daily average for 10 days with 500MB/s being the goal), there was no automated outage recovery in place and the data transfer service cannot yet be labeled as reliable. Setting up the infrastructure and achieving reliable transfers requires much technical work and coordination. There is a lack of manpower dedication, because it is often observed that those people who are running the day-to-day services do also run the SC’s. A further non-technical problem is the multiplicity of involved partners, sites and time zones.
SC3: In the throughput phase, the primary goals are to achieve sustained T0 to T1’s disk-to-disk and disk-to-tape transfers (at 150MB/s and 60MB/s, respectively). A number of T2’s will participate in the T2->T1 upstream transfers. The service phase will start in September and will involve the four experiments testing their offline computing use cases (except for analysis). All identified T1’s will participate in managed transfers (but not necessarily during the throughput phase).
Tier-2 sites will be involved for the first time in Service Challenge 3. The coordination will be done via national organizations (e.g. GridPP). Data transfer will be bidirectional, as Tier-2 sites will produce simulation data uploaded to their corresponding T1 centres. In principle, T2’s will be linked with their national T1; T2’s without a national T1 need to identify which T1 they should be linked to.
In terms of network connectivity, SC3 will use dedicated links. T1 sites are expected to provide 10Gb links to CERN. However many sites are still at 1Gb and there is a site on 600Mb.
The basic components that are required for the setup phase include: a) SRM 1.1 on CERN/T0 and all T1’s and b) a File Transfer Service at the T0. In the case of CMS, PhEDEx will be used for data transfers between FNAL and PIC.
The LCG File Catalog (LFC) will be exposed to experiment tests. The LFC is considered a basic component since most experiments and sites will presumably prefer to offer common services.
SRM status: Most sites have a SRM set up in front of their production mass storage systems (mostly CASTOR and dCache). It is understood that the SRM 1.1 specification is not sufficient. However, the full SRM 2.1 is neither required. The “LCG-required” functionality set was agreed by the Baseline Services Working Group and includes SRM 1.1 plus some of the 2.1 features, in particular space management (e.g. reservations, file pinning). For SC3, the V1.1 functionality is sufficient, whereas for SC4 the complete LCG functionality set is required.
Baseline services: The Baseline Services WG has reached an understanding on what should be regarded as baseline services. This includes SRM-based storage management, GridFTP, file transfer service, catalogues, workload and VO management, grid monitoring, applications software installation, and “VO agent” frameworks for experiment-specific long-lived processes. File Placement Services are not currently in the list but are provided by the experiment frameworks.
Answering a question from Matthias, Ian clarifies that most of these services will be tested in SC3 in order to achieve stability for SC4.
On top of the baseline services, a number of additional components need to be provided. These include Applications Area and experiment-specific software and services. Sites involved in the SC have to understand that these additional components need to be run as a service in addition to the Grid middleware.
File Catalogue status: ALICE will use their own catalog; ATLAS and CMS require local catalogues at all sites, and LHCb requires a central catalogue with 1-2 read-only copies. An LFC system is being setup at CERN, but the deployment model(s) may change in the future. Many sites are likely to run LFC on MySQL or Oracle.
Tier-2 centres: The roles and services of Tier-2 centres have been discussed and clarified, and a simple model has been agreed. T2’s are configured to upload generated MC data to and download data from a given T1. In case the T1 is not available, the T2 will wait and retry for data upload, or use an alternate T1 for data download. T2 sites need to provide services for managed storage and reliable file transfer. LCG will supply documentation for the needed services such as DPM, FTS and LFC. It is expected that Tier-1 sites will assist their Tier-2’s in establishing services.
Experiment goals for SC3: The main focus will be on the demonstration of the infrastructure’s robustness. Experiments aim to run their offline frameworks during the service phase similar to what is done during a regular Data Challenge. Experiments will be invited to involve as many users as possible in order to make the service phase as realistic as possible.
A SC3 preparation workshop (link) will focus on the technical planning of the whole SC3 exercise, with appropriate input from the experiments.
SC4: SC4 starts in April 2004 and will end with the deployment of the full production service. It will add further complexity over SC3 in terms of services, service levels, use cases, and number of involved sites. In particular, the Analysis Use Case needs to be better understood and classified, as it covers many diverse activities ranging from AOD generation to interactive ROOT analysis.

Discussion:

Answering a question from Matthias and Jean-Jacques, Ian explains that the gLite integration will take place progressively as modules are released, verified and tested. All required middleware needs to be available prior the end of 2005.
Ian points out that experiments have been informed timely by LCG of the upcoming Service Challenges and that they are preparing for them.
Les emphasizes that CERN management is aware of the importance of allocating sufficient resources to the Service Challenges.

LCG Status Report Review

1. Grid Deployment Area

An important milestone achievement is the demonstration of sustained file transfers between T0 and T1 sites (Service Challenge 2).
Replying to a question from Eric, Ian clarifies that a large fraction of the failures, which were reported in the Atlas section of the Quarterly Report (p. 28), are due to CASTOR errors. These errors, which have been observed at CERN and PIC, have been understood and will be fixed with the deployment of the new Castor version (see the CERN Fabric Area section below). Jean-Jacques highlights the importance of having error recovery mechanisms (e.g. retries or job resubmission). Fault tolerant functionality such as managed storage and reliable file transfer service will appear with the deployment of services for SC3.
The purpose of having status reports of other grid infrastructure projects added to the LCG Quarterly Reports is discussed. It is considered important to demonstrate that the LCG project and the underlying grid infrastructure projects (EGEE, OSG and NorduGrid) work in coordination, particularly in the context of the Service Challenges.
SC2 endorses the proposal of including status reports on US and NorduGrid infrastructure projects, including progress towards the reaching of the required service levels. The reporting, which should be grouped by country, should be collected by the Service Challenge manager (Jamie Shiers).
Les clarifies that in order to avoid confusions, the Grid projects providing infrastructure to LCG (EGEE, OSG and NorduGrid) will be referred to as “operational units”. The updated terminology will be reflected in the LCG TDR.

2. CERN Fabric Area

CASTOR deployment: Les explains that the rollout plan for the new CASTOR release will be presented at the next LCG PEB meeting. Its deployment will be ongoing with the target to complete in a year’s timeframe. It is foreseen to migrate first experiment production before end users; the impact on the Service Challenges still needs to be assessed. Wisla and Marcel request that in order to distinguish CASTOR versions and for future reference, a release number should be quoted in the report.
SC2 considers it important to define a schedule for the switchover of all experiments to the new CASTOR version.
The above-mentioned ATLAS problems are due to a limitation in the CASTOR stager catalogue implementation, which has been fixed in the new release. Until this new release is deployed, it is recommended to reduce the catalogue entries by avoiding using large quantities of small files. The inefficiency of handling small files applies to all mass storage systems and still represents an open question. A solution being investigated is the possibility to group files into containers. However, it is not yet clear what grouping type is more beneficial: at the application level or at the mass storage system level.
While the achievement of the ALICE DC6 milestone and the good progress made towards the new CASTOR release is acknowledged, it is observed that the CASTOR project has a accumulated (but stable) delay of about 3 months.
Wisla requests that the milestone for Alice DC 7 is split up into IT-specific and ALICE-specific milestones, as it was previously agreed. She would like to see this split-up in the next report, even if not all dates have been defined.
Answering a question from Wisla and Marcel, Les explains that due to a recent reorganization in CERN-IT, the CASTOR development and deployment teams have been merged and are now part of the Fabric Infrastructure and Operations group. This reorganization should affect neither the current plans nor the milestones.

3. Middleware Area

Jim informs that a meeting with the Middleware Area coordinators has been planned and will be held prior to the next POB.

Jim welcomes the decision of the management task force to limit gLite developments to a single software stack.
Relationship between the Baseline Services Working Group (BSWG) and gLite: Ian clarifies that even if there is no formal tie between the BLSWG and the gLite team, close contacts have been established. Jim and Matthias suggest that ties should be tightened between gLite and the BSWG (which will be restarted after summer).
Jim reiterates that the Middleware Area milestones are too high level and should be broken down. This would allow showing a better description of the current status of components and functionality.
A pie chart classifying bugs by type (crashes, feature requests, performance issues, etc.) would be very helpful. Also, adding reports on performance measurement tests would help for a better definition of validation milestones.

4. Application Area

The Applications Area has been excluded from the Status Report review, because it was recently subject of an internal review and of an SC2 focus meeting.
Albert points out that there is a large shift in one of the Simulation milestones (Comparison of LHC calorimeters for EM shower development, p. 3). The reasons are that the workload was much higher than anticipated and experiments could not provide all required input.
Jean-Jacques confirms that under Pere’s coordination, the Application Area reorganization is smoothly progressing.

General remarks:

History and evolution of milestones: Marcel and Wisla point out that some of the remarks made in the previous reviews regarding milestone management have not yet been taken into account (e.g. historical tracking of milestones, distribution in open milestones delay). Les informs that this will be implemented for the next Quarterly Report.
Following a comment from Wisla on milestone 1.4.1.12 (ATLAS validation of LCG-2, on p. 29), it is pointed out that some of the validation milestones have been weakly defined. This may lead to measurement difficulties and therefore discrepancies when judging whether a milestone has been achieved or not. In this light, SC2 recommends that upcoming milestones should be as qualitative as possible.

AOB

None.