PEB Meeting 4 May, 2004
Present: Juergen Knobloch,
John Harvey, Alberto
Masoni, Bernd Panzer, Ian Bird, Philippe Charpentier, Massimo Lamanna, Federico Carminati,
Chris Eck, Frederic Hemmer, Dario Barberis, Torre Wenaus,
Vincenzo Innocente
Phone: Nick Brook
Apologies: Les
Robertson, David Foster, David Stickland, Bob Jones
Organizational matters
- Minutes of the last meeting approved (27-April-2004)
- Next meetings: May 11th
Reports from meetings
- Architects forum (Torre):
- Considered the new Level 2 milestones and identified one (a
Geant4 L2 milestone due at end 2004) as a Level 1 validation milestone –
this will be proposed to the PEB.
- Agreed to develop a plan to integrate Root v4 into POOL – step
by step integration of functionality to be driven by the needs of the
experiments as some of the changes are not backward compatible. Dirk will develop the detailed plan.
- Porting to CEL3 complete (PI to be finished this week) as
reference platform. MacOS – there is clearly a need as a laptop platform,
but not yet clear as a farm platform; not even clear if it would
cost-effective compared to AMD for a farm. Accepted as development platform and
portability verification platform.
Will have this discussion in PEB soon.
- Conditions database – no clear commitment from the experiments
for effort; reported that there is a concern that DB effort might be
moved elsewhere without this commitment.
Federico commented that this should be discussed first since this
project has been presented at project reviews.
- Need to agree between AA and GDA on several things. Ian will go to AA soon.
- New lead for the simulation project is in the works.
Data Challenges Status
- ALICE (Federico):
- Running happily (700 jobs concurrently), 70% of simulation
phase is finished. LCG is
relatively stable and has improved since the start, Alien and LCG working
together well, in fact more has been run on LCG
than Alien. Bringing up first IA64
cluster, but can’t bring up too many more resources as this will fill the
available stage space at CERN too quickly. Running out of space at the moment –
still do not have the 30 TB requested.
Planning 2nd phase is ongoing – populate outside sites
with data from CERN followed by distributed reconstruction and then
register output locally and copy back to CERN. Then the 3rd phase will be
analysis of those reconstructed events using ARDA. Juergen asked
whether the reconstructed data is smaller? Yes, significantly, however, all events
will be used 5 times so output data is in total not much smaller.
- ATLAS (Dario):
- Started (nominally) yesterday.
Still dealing with initial problems – expect to use this week for
testing software distribution/installation and the production
system. Frequent and constructive
coordination meetings with GDA, to resolve technical issues.
- CMS (Vincenzo):
- DC04 finished officially last week. Next week is the CMS software and
computing week, where the post-mortem analysis of DC04 will start and
will continue until the summer, in order to understand planning needed
for DC05. Planning to start MC
production to complete the data sets for physics, a new cycle of software
development is starting.
- LHCb (Nick):
- Start serious production tomorrow. Final pre-production of DIRAC was done
successfully last weekend. Two
thirds of DIRAC sites ready to roll-out tomorrow, LCG DIRAC agent not
quite ready yet – hope to have this ready next week. No big showstoppers found yet for
production phase.
- GDA (Ian):
- Resource contention is a potential issue very soon as 3
experiments will be running on LCG.
Will watch behaviour of the system. Discussions with Root team have led to
the start of GFAL integration with ROOT and hence POOL. Dario asked what the policy would be
for deploying upgrades during production as Atlas needed some tools only
in the new release. Ian replied
that for those tools specifically there was no problem as they exist as
stand-alone rpm’s. In general the
plan for deploying new releases will be agreed in the GDA weekly
meetings. John asked whether the
results from DC experience would be made generally available. Ian
commented that a joint GDA-experiment document for each experiment have
been proposed as milestones.
Computing Resources for remainder of 2004 (Bernd)
- See slides attached to agenda.
- Disk space: Ongoing problems with disk server hardware are
slowly being resolved – but this is very slow because CERN have to ship
the systems to Elonex. Need extra space as buffer during this
process. New systems have arrived,
but needed very strict testing and so the process of adding new space has
been slow. It was only in April that the latest tender went out – delivery
expected for August, but actual amount that can be purchased depends on
budget situation. Also the moves in
the computer room needs extra disks (need buffer space, for re-pack of
tapes, etc). It is also necessary
to redistribute the 63 TB existing between LHC and FT experiments. Last year Bernd expected CERN to have
100 TB, but realistic estimate now is only 50TB for the 4 experiments for
the DC disk buffers. Bernd
commented that this is a pessimistic scenario – it might get better. Alberto said that this is a change of a
factor 3 relative to the original plan.
Bernd: 160TB was not a real number, but it is certainly a reduction
with respect to the 100TB expectation.
These numbers were presented at a previous C-RRB. Hopefully the situation will get better
at the end of May. Federico
commented that in this scenario they cannot avoid thrashing on/off
tape. Bernd: as DC enters more
prolonged phase, exchange between experiments will become harder. Cannot regain anything from FT
experiments after the runs.
Currently it is only ALICE
that is not entirely happy with the amount of disk space available to them
during the data challenges, for the other experiments reasonable solutions
were found in discussion.
- CPU servers: A lot of resources are in testbeds
etc. and
that these are taken from what could potentially be in the batch
system. Also the move and
renumbering of nodes causes problems with availability. Now 500 nodes have LCG software
installed and have outgoing access.
Ian raised the point of adding these new nodes to grid queues and
ensuring the relative priorities with local jobs were appropriate. The 340 nodes given to Atlas online are
now back in lxbatch and
the LCG prototype. Note IT must
also provide the FT experiments their needed resources. Around 1300kSI2000 are currently
available with 90% efficiency. This
is consistent with C-RRB numbers.
The capacity is OK for DC’s except for 10 days in July when ATLAS
is doing their T0 reconstruction at CERN.
- Issues: Very many constraints – it is very hard to shuffle
resources. DC’s are productions and
moving into continuous mode. There
must be a balance between production farms and testbeds
and a balance between resource dedication and resource sharing.
- Bernd: in future could consider giving less cpu at CERN and having more storage. Alberto commented that he would support
this view, as it seems to be no problem so far to get cpu resources outside of CERN. He also sees that plans for remote sites
have relatively little disk planned.
Dario: this has been historically true – always been blocked by
disk space.
- The issue of disk space was emphasized by all the
experiments. The GDB should ensure
that sites now really make sure that they put enough disk
with their farms for the current data challenges as this might already be
a problem. Juergen
asked what the situation was for other sites in the rest of LCG? The ratio
of disk/cpu varies wildly between sites. This should be followed up in the GDB.
Status of AA Personnel planning (John
Harvey)
- See slides attached to the agenda.
- Funding assumptions:
- CERN manpower estimates are straightforward to get using
contract start/end dates from the HR database. Includes staff from:
- LCG AA,
- from PH
(experiments’ contributed effort), typically contributions to projects
are large fractions of an FTE (typically 1) . A strong fall off is assumed on the
effort contributed by the experiments;
- from PH
(SFT) – and assumes continuity for LD, Fellows, and Associates into the
future; from IT – with similar assumption using department long term
planning numbers.
- Non-CERN manpower estimates are based on numbers from the
coordinators – probably less precise and typically smaller contributions
per person (smaller FTE fractions).
Assumes continuity in 2004 and a fall off in 2005.
- Remove ARDA resources and plans from May 1 from this planning.
- Use LCG FTE weightings (0.5-1.2)
- Result is that by the beginning of 2006 the fall off that is
starting now is complete (slide 4) due to LCG staff leaving and fall off
in experiment contributions (as above) – then a stable situation from then
on, but this is based on the assumption that IT, PH and experiment
contributions have some continuity.
- Requirements: assume current AA work program is completed in
2004-2007 and the estimate is based on the existing work plans as
presented to SC2 and PEB recently.
- Effort required exceeds the effort that is funded everywhere,
but the worst is in the years 2005-2007.
- Tasks not covered by missing effort:
- SPI: 1 FTE missing for QA and testing framework in long term,
and an open question over French EGEE contribution.
- SEAL: 2 FTE missing after 2005, long term 1 FTE missing.
- PI: assumed assimilated into other projects by this time.
- SIMU: after 2005 1 FTE missing on hadronics
and event generator coordination; long term same problem.
- ROOT: 2 FTE missing in GUI builder & documentation and for
PROOF; long term 1 FTE missing for GUI builder and documentation.
- POOL: relies heavily on LCG effort; after 2005 misses 3 FTEs
on storage manager and conditions db, long term 1 FTE missing on storage
manager.
- Some opportunities for consolidating common activities between
LCG, ROOT, GEANT4 teams (e.g. underlying
infrastructure tools are different); consolidate math libraries; support
infrastructure etc. Build common
set of expertise.
- Requirements are probably very optimistic. There is a good case for an AA component
in LCG phase 2 if current work plans are to be completed.
- Dario worried that the bulk of the missing effort is due to
POOL and partly SEAL and this would be ATLAS’ biggest concern. Federico
was also concerned about missing ROOT effort, because without this nothing
on top (e.g. POOL) will be working.
Vincenzo commented that there could
opportunities for other labs to take responsibility for some of the
projects with missing effort: Torre added that
these should be opportunities also for constructive collaboration! Juergen
concluded that there is clearly a problem, and this could mean that
several of these projects might need to be descoped. Chris added that the missing 20M CHF if
we got it would certainly cover this, but not certain that this would
happen.
- Juergen asked
if the meeting could agree on this list of missing effort (without
priorities). Federico responded that for ALICE
this could only be with the proviso that ALICE does not see all of these missing
efforts as real problems for them, apart from ROOT, as stated above, as
they do not use any of these projects.
It was agreed to continue the discussion next week with feedback
from experiments.
AOB
- Juergen: Next
meeting propose to have a phase 2 strategy discussion – also would
continue the AA resource discussion and give people some time to think
about some of this.