CCRC'08 F2F Meeting

Europe/Zurich
40-S2-B01 morning, 160-1-009 afternoon (CERN)

40-S2-B01 morning, 160-1-009 afternoon

CERN

40-S2-B01 until 15:00 160 1-009 from 12:00
Description
Mailing list for CCRC08

EVO booking: WLCG CCRC'08 F2F

Meeting Access Information:

EVO Phone Bridge Telephone Numbers:
  • USA (Caltech, Pasadena, CA) +1 626 395 2112
  • Switzerland (CERN, Geneva) +41 22 76 71400
  • Slovakia (UPJS, Kosice) +421 55 234 2420
  • Italy (INFN, several cities) Enter '4000' to access the EVO bridge
  • Germany (DESY, Hamburg) +49 40 8998 1340

Dial-in numbers: +41227676000 (Main)
Access codes: 0102968 (Leader)
0112941 (Participant)
Leader site: https://audioconf.cern.ch/call/0102968
Participant site: https://audioconf.cern.ch/call/0112941

    • 09:30 09:35
      Introduction, Minutes of the meeting of 5 February & House-keeping 5m
      Speaker: Jamie Shiers
      Executive Summary of CCRC08 Face to Face meeting of 5 Feb 2008 The agenda with attached documents is to be found at: http://indico.cern.ch/conferenceDisplay.py?confId=26922 The meeting was chaired by J.Shiers with notes taken by H.Renshall. Attendance: Representatives of all experiments and most Tier-1 sites were present in person or by teleconference. The chairman started by pointing out that the agenda is deliberately loose to have lots of time for discussions. Summary of January F2F Meeting: ------------------------------- Reviewing the minutes of the meeting of Jan 10 J.Shiers reminded that there are 3 sets of metrics to be monitored in CCRC'08 and the experiments should be continuously monitoring theirs. Some included a 30 minute problem resolution time and this is felt to be unrealistic. He showed a draft paper (attached to the agenda) for presentation to the MB where the target for an operator response to an alarm or call to CERN central operations (75011) was that 99% of them should receive such a response (i.e. acknowledge receipt of the problem) within 30 minutes. He is proposing that on failure to meet a target a post-mortem should be launched. He admitted the numbers are currently arbitrary but we will measure what actually happens. He pointed out that CERN has now started 24 by 7 support rotas for FTS, LFC and CASTOR services but not yet for the physics databases. M.Kasemann asked if these targets applied to all services or just critical ones. J.Shiers said it was written for all but the operational procedure for a less critical service could well be to leave it down till the next day. He reminded that individual servers are given an importance rating where a value of 50 or more will raise a piquet call. M.Kasemann said that CMS should take another look at their online buffering to see if it matches these times. J.Shiers thought similar tables should be made for Tier 1 and Tier 2 sites and reminded that the LCGServiceChallenges Twiki includes a Tier 1 Contacts list which shows, for example, a 24 by 7 phone number for TRIUMF. Communications (paper attached to agenda): ------------------------------------------ J.Shiers suggested we need regional Tier 2 coordinators and he has already created a mailing list for them. They should attend these F2F meetings and the GDB and follow MB minutes to communicate in both directions what matters to their communities. He is asking for volunteers and will follow this up at future MB meetings. He is also suggested regional coordinators for the physics databases. M.Kasemann said that CMS already have Tier2 coordinators while N.Brook said LHCb have no specific Tier 2 sites so for them the interest should come from the Tier 2 level. Storage Solutions Group: ------------------------ J.Shiers announced this group is now at work and had a good meeting on 4 Feb focussing on dcache issues. He would like all storage solutions to join this series to follow up problems seen in the Feb CCRC08 to be fully ready for the May run. He is suggesting a weekly phone conference. J.Templon asked why this is not under the GSSD SRM production deployment series. J.Shiers replied this group is to fix specific problems and then dissolve. P.Charpentier said that SRM production deployment is not finished so you are just replacing one meeting by another. CCRC'08 Calendar: ----------------- P.Mendez showed the Twiki calendar she has prepared at https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08Calendar It has open editing so sites and experiments may enter items or send any requests to her and the intention is for this to be a master high level view of CCRC activities. H.Renshall said that this was probably now the more appropriate view than the one he has maintained under the SC4ExperimentPlans Twiki correlating activities to individual Tier 1 sites. Baseline Middleware: -------------------- O.Keeble presented his slides. R.Santinelli (LHCb) asked if there is any plan to port the RB middleware to SL4 to which the answer was no since it is replaced the the WMS/LB middleware. P.Charpentier pointed out a problem with the -m option of lcg_utils. In reponse to another question O.Keeble said SA3 had starting integrating the AMGA metadata catalog on top of Oracle and it should be ready in 4 weeks. J.Shiers asked if it should be a metric that sites deploy the approved baseline middleware versions. B.Koblitz said it was very difficult for ATLAS to work out what versions sites are running - it should be in the information system. CASTOR/ CASTOR SRM: ------------------- S.Ponce presented his slides. N.Brook asked if file checksums on disk were rechecked before migration to tape and were they available to users. The answer was not currently but there is also a tape checksum and the two will be correlated and made available to users in the next release. They have one remaining problem in CASTOR 2.1.6 namely the performance of garbage collection for ATLAS. S.Ponce then moved to CASTOR SRM saying that the minimum requirement for CCRC sites was version 1.3-10 though this did not support srm_copy. CERN is using version 1.3-11. They are aiming for a next release in March where any delay would be in testing. J.Shiers reminded the intention that the April F2F meeting finalise the baseline versions to be used in May then asked about migration of ALICE and LHCb to CASTOR 1.6.7. M.dos Santos said this was up to the experiments and suggested mid-February. He agreed that ALICE could trigger them at short notice. Dcache: ------- P.Fuhrmann presented his slides and took questiions. He said US-CMS is not using space tokens so their version of dcache does not matter. K.Bos said that for ATLAS the 'possible' ACLs were a definite requirement. Disk Pool Manager: ------------------ J-P.Baud presented his slides. He said sites should be running dpm 1.6.7 but he found many sites on 1.6.5. He will skip releasing 1.6.8 because of the time to certify then 1.6.9 will be mandatory for gLite software. They are now finalising 1.6.10 and should release 1.7.0 early April for deployment for the May CCRC. It will support spaces for a single user or a Voms FQAN so not a real ACL. They would look at supporting ACLs on pools if there was an agreement with other storage systems. STORM: ------ L.Magnoni presented his slides and took questions. Their deployment of T1D1 storage is as a TSM backup so they will check with IBM the best way to trigger a recall from tape. N.Brook was worried how this will work for LHCb in the February run and P.Charpentier said that a T1D1 class with no tape recall was useless to them. L.del Agnello of CNAF promised they would manage the endpoint for LHCb and added that there were no plans to move T1D0 class data out of CASTOR at CNAF. Concluding the morning session J.Shiers remarked that he thought we were better prepared for the February run than in previous challenges and that for May he hoped to have a very solid middleware base fully deployed. Site Readiness -------------- H.Renshall presented his slides concluding that the cpu situation for the February run is much improved. For May most sites will have their full 2008 resources though several will acquire tape and disk incrementally as demand grows. NL-T1 will not get their 2008 resources till November and, when asked, thought it would be in one acquisition. N.Brook said that for LHCb the available resources for February in the referenced spreadsheet were much too low and H.Renshall replied that these were the steady state 2007/8 resource requirements not those for the CCRC. On disk and tape cleaning it was agreed experiments would delete their files leaving the sites to recover tapes. Sites wanted to separate out temporary from permanent tape data for the experiments that required this. ATLAS thought this to be a site issue and said they would want to use SRM bulk deletion methods. ALICE Readiness: ---------------- L.Betev presented his slides pointing out the partial overlap of the February CCRC with their detector commissioning. They were planning to run at 50% of the standard p+p data rate, so compatible with the expected 2008 accelerator efficiency, and requiring a total of 13 TB of disk space and 60 TB of tape space over their 6 Tier 1 sites. He was asked if the GSI plugin for the ALICE security model was specific for ALICE. J.van Eldik replied that it could be used by any experiment and added that CERN is preparing a cookbook for CASTOR-xrootd deployment. ATLAS: ------ S.Campana said that ATLAS are in the phase of testing what they needed for CCRC'08. For the first week they will be performing a Tier 0 full scale dress rehearsal. They have asked sites to create 4 space tokens of which 2, DATADISK and DATATAPE, are the important ones. FZK, RAL and TRIUMF have tested OK. ASGC is currently down and CNAF CASTOR is ok but not STORM and the remaining sites have not been tested. They will want the new LFC middleware to exercise bulk deletes. LHCb: ----- N.Brook presented his slides where they have updated the site resource numbers following new Tier 1 ratios (from RAL and NL-T1). They would really appreciate feedback from the Tier 1 on what resources they will have for LHCb in February. They plan to have a new version of Dirac but the timescale for testing is very tight. They will not be using the conditions database in February. G.Merino pointed out that the requirements for PIC have doubled which is unfortunate given their cpu problems. J.Templon asked how to interpret their (NL-T1) 12 KSi2K cpu days ? N.Brook said for a two week run just divide by 14 to get the continuous cpu requirement. CMS: ---- D.Bonacorsi explained that the CMS February exercise is made up of functional blocks of which some, e.g. the Tier 0 component, have already started. They are reviewing reprocessing now and will start with prestaging. They are trying to perform T0 to T1 exports then will start T1 to T2 and T1 to T1. All functional blocks should be running together in the last week. They need to know the status of SRMv2.2 at their Tier 2 sites. Tracking the challenge: ----------------------- J.Casey demonstrated the CCRC'08 electronic log books at: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/ L.Betev asked if entries were linked to GGUS tickets and this is in fact done as a simple text entry. S.Campana asked how ATLAS shifters would get a problem to a site after hours ? J.Casey then showed a prototype of an experiment critical services gridmap. These might be displayed per experiment in the forthcoming grid control room. He asked for feedback on how useful these tools are. J.Templon said he only wanted to look in one place to see how his multiple VOs are performing. J.Shiers thought this presentation is addressing the 3 metrics we want to observe coherently in CCRC08 and which we have agreed to report on. K.Bos said he would prefer to see maps with the ATLAS critical tests broken down by site and J.Casey thought we could probably do that. The chairman, J.Shiers, concluded the meeting by repeating that he thought we were much better prepared than we have been before and looked forward to seeing the attendees again in a months time for the next F2F review.
    • 09:35 10:05
      Middleware Review - Problems Encountered & Roadmap / Schedule 30m
      Speaker: Markus Schulz
      Slides
    • 10:05 10:25
      FTS Review 20m
      Speaker: Gavin McCance (CERN)
      Slides
    • 10:25 10:40
      Review of Tape Usage at CERN 15m
      Speaker: Tim Bell (CERN)
      more information
      Slides
    • 10:40 11:10
      Service / Operations Review & Summary 30m
      Including problem reporting / tracking, problem escalation, out-of-hours issues, monitoring, logging & reporting
      Speakers: Harry Renshall (CERN), Julia Andreeva (CERN), Maria Dimou (CERN)
    • 11:30 12:15
      CCRC'08 Review from the Perspective of (a) Site(s) 45m
      Speaker: Derek Ross (RAL)
      Slides
    • 12:15 13:10
      lunch break 55m
    • 13:10 13:40
      CMS Review 30m
      Speaker: Daniele Bonacorsi (INFN)
      Slides
    • 13:40 14:10
      ATLAS Review 30m
      Speaker: Kors Bos (NIKHEF, CERN, ATLAS)
      Slides
    • 14:10 14:40
      ALICE review 30m
      Speaker: Latchezar Betev (CERN)
      Slides
    • 14:40 15:10
      LHCb Review 30m
      Speaker: Stuart Paterson (CERN)
      Slides
    • 15:10 15:40
      Storage-ware Review: Problems Encountered & Roadmap 30m
      Speakers: Flavia Donno (CERN), Storage-ware providers
      Slides
    • 15:40 16:00
      CCRC'08 - Monthly Review of Calendar 20m
      Speaker: Patricia Mendez Lorenzo (CERN)
      Calendar