PEB Meeting 4 May, 2004

Present: Juergen Knobloch, John Harvey, Alberto Masoni, Bernd Panzer, Ian Bird, Philippe Charpentier, Massimo Lamanna, Federico Carminati, Chris Eck, Frederic Hemmer, Dario Barberis, Torre Wenaus, Vincenzo Innocente

Phone: Nick Brook

Apologies: Les Robertson, David Foster, David Stickland, Bob Jones

Organizational matters

Minutes of the last meeting approved (27-April-2004)
Next meetings: May 11^th

Reports from meetings

Architects forum (Torre):

Considered the new Level 2 milestones and identified one (a Geant4 L2 milestone due at end 2004) as a Level 1 validation milestone – this will be proposed to the PEB.
Agreed to develop a plan to integrate Root v4 into POOL – step by step integration of functionality to be driven by the needs of the experiments as some of the changes are not backward compatible. Dirk will develop the detailed plan.
Porting to CEL3 complete (PI to be finished this week) as reference platform. MacOS – there is clearly a need as a laptop platform, but not yet clear as a farm platform; not even clear if it would cost-effective compared to AMD for a farm. Accepted as development platform and portability verification platform. Will have this discussion in PEB soon.
Conditions database – no clear commitment from the experiments for effort; reported that there is a concern that DB effort might be moved elsewhere without this commitment. Federico commented that this should be discussed first since this project has been presented at project reviews.
Need to agree between AA and GDA on several things. Ian will go to AA soon.
New lead for the simulation project is in the works.

Data Challenges Status

ALICE (Federico):

Running happily (700 jobs concurrently), 70% of simulation phase is finished. LCG is relatively stable and has improved since the start, Alien and LCG working together well, in fact more has been run on LCG than Alien. Bringing up first IA64 cluster, but can’t bring up too many more resources as this will fill the available stage space at CERN too quickly. Running out of space at the moment – still do not have the 30 TB requested. Planning 2^nd phase is ongoing – populate outside sites with data from CERN followed by distributed reconstruction and then register output locally and copy back to CERN. Then the 3^rd phase will be analysis of those reconstructed events using ARDA. Juergen asked whether the reconstructed data is smaller? Yes, significantly, however, all events will be used 5 times so output data is in total not much smaller.

ATLAS (Dario):

Started (nominally) yesterday. Still dealing with initial problems – expect to use this week for testing software distribution/installation and the production system. Frequent and constructive coordination meetings with GDA, to resolve technical issues.

CMS (Vincenzo):

DC04 finished officially last week. Next week is the CMS software and computing week, where the post-mortem analysis of DC04 will start and will continue until the summer, in order to understand planning needed for DC05. Planning to start MC production to complete the data sets for physics, a new cycle of software development is starting.

LHCb (Nick):

Start serious production tomorrow. Final pre-production of DIRAC was done successfully last weekend. Two thirds of DIRAC sites ready to roll-out tomorrow, LCG DIRAC agent not quite ready yet – hope to have this ready next week. No big showstoppers found yet for production phase.

GDA (Ian):

Resource contention is a potential issue very soon as 3 experiments will be running on LCG. Will watch behaviour of the system. Discussions with Root team have led to the start of GFAL integration with ROOT and hence POOL. Dario asked what the policy would be for deploying upgrades during production as Atlas needed some tools only in the new release. Ian replied that for those tools specifically there was no problem as they exist as stand-alone rpm’s. In general the plan for deploying new releases will be agreed in the GDA weekly meetings. John asked whether the results from DC experience would be made generally available. Ian commented that a joint GDA-experiment document for each experiment have been proposed as milestones.

Computing Resources for remainder of 2004 (Bernd)

See slides attached to agenda.
Disk space: Ongoing problems with disk server hardware are slowly being resolved – but this is very slow because CERN have to ship the systems to Elonex. Need extra space as buffer during this process. New systems have arrived, but needed very strict testing and so the process of adding new space has been slow. It was only in April that the latest tender went out – delivery expected for August, but actual amount that can be purchased depends on budget situation. Also the moves in the computer room needs extra disks (need buffer space, for re-pack of tapes, etc). It is also necessary to redistribute the 63 TB existing between LHC and FT experiments. Last year Bernd expected CERN to have 100 TB, but realistic estimate now is only 50TB for the 4 experiments for the DC disk buffers. Bernd commented that this is a pessimistic scenario – it might get better. Alberto said that this is a change of a factor 3 relative to the original plan. Bernd: 160TB was not a real number, but it is certainly a reduction with respect to the 100TB expectation. These numbers were presented at a previous C-RRB. Hopefully the situation will get better at the end of May. Federico commented that in this scenario they cannot avoid thrashing on/off tape. Bernd: as DC enters more prolonged phase, exchange between experiments will become harder. Cannot regain anything from FT experiments after the runs. Currently it is only ALICE that is not entirely happy with the amount of disk space available to them during the data challenges, for the other experiments reasonable solutions were found in discussion.
CPU servers: A lot of resources are in testbeds etc. and that these are taken from what could potentially be in the batch system. Also the move and renumbering of nodes causes problems with availability. Now 500 nodes have LCG software installed and have outgoing access. Ian raised the point of adding these new nodes to grid queues and ensuring the relative priorities with local jobs were appropriate. The 340 nodes given to Atlas online are now back in lxbatch and the LCG prototype. Note IT must also provide the FT experiments their needed resources. Around 1300kSI2000 are currently available with 90% efficiency. This is consistent with C-RRB numbers. The capacity is OK for DC’s except for 10 days in July when ATLAS is doing their T0 reconstruction at CERN.
Issues: Very many constraints – it is very hard to shuffle resources. DC’s are productions and moving into continuous mode. There must be a balance between production farms and testbeds and a balance between resource dedication and resource sharing.
Bernd: in future could consider giving less cpu at CERN and having more storage. Alberto commented that he would support this view, as it seems to be no problem so far to get cpu resources outside of CERN. He also sees that plans for remote sites have relatively little disk planned. Dario: this has been historically true – always been blocked by disk space.
The issue of disk space was emphasized by all the experiments. The GDB should ensure that sites now really make sure that they put enough disk with their farms for the current data challenges as this might already be a problem. Juergen asked what the situation was for other sites in the rest of LCG? The ratio of disk/cpu varies wildly between sites. This should be followed up in the GDB.

Status of AA Personnel planning (John Harvey)

See slides attached to the agenda.
Funding assumptions:

CERN manpower estimates are straightforward to get using contract start/end dates from the HR database. Includes staff from:

LCG AA,
from PH (experiments’ contributed effort), typically contributions to projects are large fractions of an FTE (typically 1) . A strong fall off is assumed on the effort contributed by the experiments;
from PH (SFT) – and assumes continuity for LD, Fellows, and Associates into the future; from IT – with similar assumption using department long term planning numbers.

Non-CERN manpower estimates are based on numbers from the coordinators – probably less precise and typically smaller contributions per person (smaller FTE fractions). Assumes continuity in 2004 and a fall off in 2005.
Remove ARDA resources and plans from May 1 from this planning.
Use LCG FTE weightings (0.5-1.2)

Result is that by the beginning of 2006 the fall off that is starting now is complete (slide 4) due to LCG staff leaving and fall off in experiment contributions (as above) – then a stable situation from then on, but this is based on the assumption that IT, PH and experiment contributions have some continuity.
Requirements: assume current AA work program is completed in 2004-2007 and the estimate is based on the existing work plans as presented to SC2 and PEB recently.

Effort required exceeds the effort that is funded everywhere, but the worst is in the years 2005-2007.
Tasks not covered by missing effort:

SPI: 1 FTE missing for QA and testing framework in long term, and an open question over French EGEE contribution.
SEAL: 2 FTE missing after 2005, long term 1 FTE missing.
PI: assumed assimilated into other projects by this time.
SIMU: after 2005 1 FTE missing on hadronics and event generator coordination; long term same problem.
ROOT: 2 FTE missing in GUI builder & documentation and for PROOF; long term 1 FTE missing for GUI builder and documentation.
POOL: relies heavily on LCG effort; after 2005 misses 3 FTEs on storage manager and conditions db, long term 1 FTE missing on storage manager.

Some opportunities for consolidating common activities between LCG, ROOT, GEANT4 teams (e.g. underlying infrastructure tools are different); consolidate math libraries; support infrastructure etc. Build common set of expertise.
Requirements are probably very optimistic. There is a good case for an AA component in LCG phase 2 if current work plans are to be completed.
Dario worried that the bulk of the missing effort is due to POOL and partly SEAL and this would be ATLAS’ biggest concern. Federico was also concerned about missing ROOT effort, because without this nothing on top (e.g. POOL) will be working. Vincenzo commented that there could opportunities for other labs to take responsibility for some of the projects with missing effort: Torre added that these should be opportunities also for constructive collaboration! Juergen concluded that there is clearly a problem, and this could mean that several of these projects might need to be descoped. Chris added that the missing 20M CHF if we got it would certainly cover this, but not certain that this would happen.
Juergen asked if the meeting could agree on this list of missing effort (without priorities). Federico responded that for ALICE this could only be with the proviso that ALICE does not see all of these missing efforts as real problems for them, apart from ROOT, as stated above, as they do not use any of these projects. It was agreed to continue the discussion next week with feedback from experiments.

AOB

Juergen: Next meeting propose to have a phase 2 strategy discussion – also would continue the AA resource discussion and give people some time to think about some of this.