Executive Summary of CCRC08 Face to Face meeting of 5 Feb 2008
The agenda with attached documents is to be found at:
http://indico.cern.ch/conferenceDisplay.py?confId=26922
The meeting was chaired by J.Shiers with notes taken by H.Renshall.
Attendance: Representatives of all experiments and most Tier-1 sites were
present in person or by teleconference.
The chairman started by pointing out that the agenda is deliberately
loose to have lots of time for discussions.
Summary of January F2F Meeting:
-------------------------------
Reviewing the minutes of the meeting of Jan 10 J.Shiers reminded that there
are 3 sets of metrics to be monitored in CCRC'08 and the experiments
should be continuously monitoring theirs. Some included a 30 minute
problem resolution time and this is felt to be unrealistic. He showed a
draft paper (attached to the agenda) for presentation to the MB
where the target for an operator response to an alarm or call to CERN
central operations (75011) was that 99% of them should receive such
a response (i.e. acknowledge receipt of the problem) within 30 minutes.
He is proposing that on failure to meet a target a post-mortem should be
launched. He admitted the numbers are currently arbitrary but we will
measure what actually happens. He pointed out that CERN has now started
24 by 7 support rotas for FTS, LFC and CASTOR services but not yet for
the physics databases.
M.Kasemann asked if these targets applied to all services or just critical
ones. J.Shiers said it was written for all but the operational procedure
for a less critical service could well be to leave it down till the next
day. He reminded that individual servers are given an importance rating
where a value of 50 or more will raise a piquet call. M.Kasemann said
that CMS should take another look at their online buffering to see if it
matches these times. J.Shiers thought similar tables should be made
for Tier 1 and Tier 2 sites and reminded that the LCGServiceChallenges
Twiki includes a Tier 1 Contacts list which shows, for example, a 24 by 7
phone number for TRIUMF.
Communications (paper attached to agenda):
------------------------------------------
J.Shiers suggested we need regional Tier 2 coordinators and he has already
created a mailing list for them. They should attend these F2F meetings and
the GDB and follow MB minutes to communicate in both directions what
matters to their communities. He is asking for volunteers and will follow
this up at future MB meetings. He is also suggested regional coordinators
for the physics databases. M.Kasemann said that CMS already have Tier2
coordinators while N.Brook said LHCb have no specific Tier 2 sites so
for them the interest should come from the Tier 2 level.
Storage Solutions Group:
------------------------
J.Shiers announced this group is now at work and had a good meeting on 4
Feb focussing on dcache issues. He would like all storage solutions to
join this series to follow up problems seen in the Feb CCRC08 to be fully
ready for the May run. He is suggesting a weekly phone conference.
J.Templon asked why this is not under the GSSD SRM production deployment
series. J.Shiers replied this group is to fix specific problems and then
dissolve. P.Charpentier said that SRM production deployment is not finished
so you are just replacing one meeting by another.
CCRC'08 Calendar:
-----------------
P.Mendez showed the Twiki calendar she has prepared at
https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08Calendar
It has open editing so sites and experiments may enter items or send
any requests to her and the intention is for this to be a master high
level view of CCRC activities. H.Renshall said that this was probably
now the more appropriate view than the one he has maintained under the
SC4ExperimentPlans Twiki correlating activities to individual Tier 1 sites.
Baseline Middleware:
--------------------
O.Keeble presented his slides. R.Santinelli (LHCb) asked if there is any
plan to port the RB middleware to SL4 to which the answer was no since
it is replaced the the WMS/LB middleware. P.Charpentier pointed out a
problem with the -m option of lcg_utils. In reponse to another question
O.Keeble said SA3 had starting integrating the AMGA metadata catalog
on top of Oracle and it should be ready in 4 weeks. J.Shiers asked if it
should be a metric that sites deploy the approved baseline middleware
versions. B.Koblitz said it was very difficult for ATLAS to work out
what versions sites are running - it should be in the information system.
CASTOR/ CASTOR SRM:
-------------------
S.Ponce presented his slides. N.Brook asked if file checksums on disk
were rechecked before migration to tape and were they available to users.
The answer was not currently but there is also a tape checksum and the
two will be correlated and made available to users in the next release.
They have one remaining problem in CASTOR 2.1.6 namely the performance
of garbage collection for ATLAS. S.Ponce then moved to CASTOR SRM saying
that the minimum requirement for CCRC sites was version 1.3-10 though
this did not support srm_copy. CERN is using version 1.3-11. They are
aiming for a next release in March where any delay would be in testing.
J.Shiers reminded the intention that the April F2F meeting finalise
the baseline versions to be used in May then asked about migration of
ALICE and LHCb to CASTOR 1.6.7. M.dos Santos said this was up to the
experiments and suggested mid-February. He agreed that ALICE could
trigger them at short notice.
Dcache:
-------
P.Fuhrmann presented his slides and took questiions. He said US-CMS is
not using space tokens so their version of dcache does not matter. K.Bos
said that for ATLAS the 'possible' ACLs were a definite requirement.
Disk Pool Manager:
------------------
J-P.Baud presented his slides. He said sites should be running dpm 1.6.7
but he found many sites on 1.6.5. He will skip releasing 1.6.8 because
of the time to certify then 1.6.9 will be mandatory for gLite software.
They are now finalising 1.6.10 and should release 1.7.0 early April for
deployment for the May CCRC. It will support spaces for a single user or
a Voms FQAN so not a real ACL. They would look at supporting ACLs on pools
if there was an agreement with other storage systems.
STORM:
------
L.Magnoni presented his slides and took questions. Their deployment of
T1D1 storage is as a TSM backup so they will check with IBM the best way
to trigger a recall from tape. N.Brook was worried how this will work
for LHCb in the February run and P.Charpentier said that a T1D1 class
with no tape recall was useless to them. L.del Agnello of CNAF promised
they would manage the endpoint for LHCb and added that there were no
plans to move T1D0 class data out of CASTOR at CNAF.
Concluding the morning session J.Shiers remarked that he thought we were
better prepared for the February run than in previous challenges and
that for May he hoped to have a very solid middleware base fully deployed.
Site Readiness
--------------
H.Renshall presented his slides concluding that the cpu situation for
the February run is much improved. For May most sites will have their
full 2008 resources though several will acquire tape and disk
incrementally as demand grows. NL-T1 will not get their 2008 resources
till November and, when asked, thought it would be in one acquisition.
N.Brook said that for LHCb the available resources for February in the
referenced spreadsheet were much too low and H.Renshall replied that these
were the steady state 2007/8 resource requirements not those for the CCRC.
On disk and tape cleaning it was agreed experiments would delete their
files leaving the sites to recover tapes. Sites wanted to separate out
temporary from permanent tape data for the experiments that required this.
ATLAS thought this to be a site issue and said they would want to use
SRM bulk deletion methods.
ALICE Readiness:
----------------
L.Betev presented his slides pointing out the partial overlap of the
February CCRC with their detector commissioning. They were planning to
run at 50% of the standard p+p data rate, so compatible with the expected
2008 accelerator efficiency, and requiring a total of 13 TB of disk space
and 60 TB of tape space over their 6 Tier 1 sites. He was asked if the
GSI plugin for the ALICE security model was specific for ALICE. J.van Eldik
replied that it could be used by any experiment and added that CERN is
preparing a cookbook for CASTOR-xrootd deployment.
ATLAS:
------
S.Campana said that ATLAS are in the phase of testing what they needed
for CCRC'08. For the first week they will be performing a Tier 0 full
scale dress rehearsal. They have asked sites to create 4 space tokens
of which 2, DATADISK and DATATAPE, are the important ones. FZK, RAL and
TRIUMF have tested OK. ASGC is currently down and CNAF CASTOR is ok but
not STORM and the remaining sites have not been tested. They will want
the new LFC middleware to exercise bulk deletes.
LHCb:
-----
N.Brook presented his slides where they have updated the site resource
numbers following new Tier 1 ratios (from RAL and NL-T1). They would
really appreciate feedback from the Tier 1 on what resources they will
have for LHCb in February. They plan to have a new version of Dirac but
the timescale for testing is very tight. They will not be using the
conditions database in February. G.Merino pointed out that the
requirements for PIC have doubled which is unfortunate given their cpu
problems. J.Templon asked how to interpret their (NL-T1) 12 KSi2K cpu
days ? N.Brook said for a two week run just divide by 14 to get the
continuous cpu requirement.
CMS:
----
D.Bonacorsi explained that the CMS February exercise is made up of
functional blocks of which some, e.g. the Tier 0 component, have already
started. They are reviewing reprocessing now and will start with prestaging.
They are trying to perform T0 to T1 exports then will start T1 to T2 and
T1 to T1. All functional blocks should be running together in the last
week. They need to know the status of SRMv2.2 at their Tier 2 sites.
Tracking the challenge:
-----------------------
J.Casey demonstrated the CCRC'08 electronic log books at:
https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/
L.Betev asked if entries were linked to GGUS tickets and this is in fact
done as a simple text entry. S.Campana asked how ATLAS shifters would get
a problem to a site after hours ? J.Casey then showed a prototype of
an experiment critical services gridmap. These might be displayed per
experiment in the forthcoming grid control room. He asked for feedback on
how useful these tools are. J.Templon said he only wanted to look in one
place to see how his multiple VOs are performing. J.Shiers thought this
presentation is addressing the 3 metrics we want to observe coherently
in CCRC08 and which we have agreed to report on. K.Bos said he would
prefer to see maps with the ATLAS critical tests broken down by site
and J.Casey thought we could probably do that.
The chairman, J.Shiers, concluded the meeting by repeating that he thought
we were much better prepared than we have been before and looked forward
to seeing the attendees again in a months time for the next F2F review.