CCRC'08 F2F Meeting

Europe/Zurich
32-1-A24 morning, 160-1-009 afternoon (CERN)

32-1-A24 morning, 160-1-009 afternoon

CERN

32-1-A24 until 13:00 160 1-009 from 13:00
Description
Mailing list for CCRC08

EVO booking: WLCG CCRC'08 F2F

Meeting Access Information:

EVO Phone Bridge Telephone Numbers:
  • USA (Caltech, Pasadena, CA) +1 626 395 2112
  • Switzerland (CERN, Geneva) +41 22 76 71400
  • Slovakia (UPJS, Kosice) +421 55 234 2420
  • Italy (INFN, several cities) Enter '4000' to access the EVO bridge
  • Germany (DESY, Hamburg) +49 40 8998 1340
    • 09:30 09:35
      Minutes of the 4 March Meeting 5m 32-1-A24 morning, 160-1-009 afternoon

      32-1-A24 morning, 160-1-009 afternoon

      CERN

      32-1-A24 until 13:00 160 1-009 from 13:00
      Speaker: H.Renshall
      Notes of LCG CCRC'08 Face to Face meeting of 4 March 2008 ************************************************************************ The agenda with attached material is at http://indico.cern.ch/conferenceDisplay.py?confId=29170 Middleware Review - M.Schulz -------------------------------------------- Points raised during the presentation included: dpm, for smaller sites, has better functionality than the classic SE except that it cannot be NFS mounted. Access is supported via gridftp and http. No hep VOs were affected by the long VO FQN names bug. The FTS proxy delegation bug (race condition) has been in the system for a long time but not much prior use had been made of this functionality. FTS has suffered from bdii cache volatility so a static cache (e.g. of 4 hours) mechanism will be introduced. J.Gordon asked how many PPS sites were running the SL4 version of the WMS ? MS admitted to not knowing but said that for testing the important thing was not the number of servers but the number of jobs. He thought it might reach production by May but that not many sites would deploy it. JG stated that it was disappointing that we did not achieve the planned week of software stability before starting CCRC'08 phase 1. FTS Review - G.McCance ------------------------------------- The presentation reviewed the issues uncovered during the February CCRC’08. On problem tracking F.Donno supported the use of GGUS which GM agreed was useful. J.Gordon thought that to aid tracking Tier 1 should be able to submit tickets that do not get bounced back to themselves. M dos Santos asked what would be the most important improvement in FTS monitoring. GM replied there is a lot of internal information to be exposed and that classification of failures should be better. Review of Tape Usage at CERN during Feb CCRC’08– T.Bell ---------------------------------------------------------------------------------- During the first 2 weeks aggregate writing speed reached 4 GB/sec. The CMS reading load stayed high during the whole period and dominated the reading We know, in CASTOR, how to tune the writing but the reading is user driven and much harder to optimize. S.Lin asked about mount times and was told 3 minutes for a mount and 2 minutes for an unmount. K.Bos did not understand the time distribution of ATLAS tape writing and TB explained that migration to tape was triggered after 4 hours for security when the volume threshold of a complete tape had not been met and in addition the small ATLAS file sizes are inefficient for tape writing. J.Gordon asked if migration policies were the same for all experiments the answer being yes when Castor 2.1.6 is used. TB explained that tape mounts are dominated by non-production reads and they had observed about 400 CMS users getting 2-3 files per mount and that it was of general CMS data. TB continued that they need to understand this individual usage – e.g. why is the needed data not on disk – and must ensure that production reading gets what it needs. P.Charpentier remarked that they see files deleted from the LHCb disk cache immediately after migration even when there is space left in the buffer. M dos Santos said they are implementing a better overview of the disk cache and recommended that users flush files from the cache when they no longer need them. TB announced they intended to improve their tape usage metrics for the May run. J.Shiers concluded by offering to send lists of the busy individual tape readers to the experiments for follow-up on what they are doing. Service/Operations Review and Summary ----------------------------------------------------------- Following the presentation of M.Dimou on the pluralism of problem reporting there was a lively discussion. She pointed out that at the moment the CERN operator mailing list for trusted CMS users has an open subscription. She also highlighted the need to convince ggus to support trusted users so they can bypass ROC filtering and go straight to sites. J.Shiers said we must advance on the CERN experts lists from the experiments. They should not be open but be controlled and modifiable by CERN and the experiments and they should allow a dialogue between the problem submitter and the called out staff. P.Charpentier thought we should not treat CERN as any different so should go via ggus tickets. He said ATLAS had requested the Tier 0 type callout be extended to the Tier 1. T.Cass reminded that anyone can call the operator at 5011 and that we also need to allow for people like the experiment shifters at the pits. On this K.Boss said ATLAS prefer to filter these at the moment and L.Betev said the same for ALICE. M.Dimou said she is getting conflicting information on this from others in ATLAS and ALICE. J.Gordon said there will be several ROC-dependent ways of working and that ROCs should not block problem passing but do need to be informed of them. H.Renshall presented a review of the February run of CCRC'08 including that few Tier 1 participated in the daily meeting. J.Templon said he thought the meeting could be made more useful and was immediately appointed to chair a session on site/experiment communications at the next F2F meeting in early April. J.Andreeva presented the status and plans for monitoring in CCRC'08. She presented possible gridmap displays of experiment workflows and J.Templon said sites need to know if these are working for them and wants to send test results to a sites own monitoring and alarms systems. Julia said this is not excluded by their work. J.Gordon asked when to expect to see site views and the answer was soon. She said they were looking at calculating transfer speeds of Ganga workflow to a Tier 1 with measurements taken each 5 minutes and averages being presented in several ways. CCRC'08 Review from RAL Perspective -------------------------------------------------------- ALICE was not supported in the February run but will be for May. There is still a lot of work to do to get all the services in production state by May (they are expecting delivery of 180 disk servers of 9 TB in April). They were checking the CERN elogger (J.Templon said Nikhef were not) and did see delays in ggus tickets arriving. CMS Review ------------------ Questions/comments: L.Betev asked how CMS knew the number of tapes written at CNAF - the answer being from a local database. Not so many SRMv2 functionalities were tested. Main area of concern is the ability of the local MSS to guarantee they can recover data. They would like to test T1 to T1 transfers overlapping with ATLAS. ATLAS Review --------------------- They regretted the interference between their FDR and the CCRC activities and up to the May run will be performing firstly the M6 cosmics run then alternating functionality and throughput tests which will include oversubscription of T0 to T1 transfers. Questions/comments: The logic to complete partialy transferred data sets will go in tomorrow. SRMv2 was only deployed the day before our FDR. They want their physicists to analyse derived physics datasets to exercise another storage class. They cannot run an FDR and throughput tests at the same time. J.Gordon said that if they do not stress test in May they will only find the problems in August. M.dos Santos, referring to early transfer performance problems where disk-to-disk copying in Castor was happening, said that although the copies were efficient the target pool was very busy and also was performing garbage collection. D.Bonacorsi encouraged ATLAS to do some concurrent T1-T1 testing together with CMS. Finally J.Gordon asked ATLAS to let sites know their May storage requirements as soon as possible. ALICE Review -------------------- There were no additional questions/comments. LHCb Review ------------------- Questions/comments: At NIKHEF the job queue time limit was too short for LHCb. They had problems with dcache not returning space for deleted files. J.Templon asked if they were considering using xrootd for read-write to which the answer was for read only to replace gsidcap, rfio and rootd. They would also consider testing copying their input files to local disk on the WNs. Storage Review ---------------------- F.Donno reviewed the problems seen during the February run. On the gsidcap timeout of 2 hours J.Templon said that it was on inactivity and that SARA had now removed it. J.Shiers said he thought that basically storage management worked and that we are in a much better situation than we had thought we would be. Calendar ------------- P.Mendez presented the calendar of planned activities up to the May run. M.Jouvain remarked that if dpm 1.6.10 is to be used in May it must be available much earlier. J.Shiers said that in the April F2F meeting we will discuss and agree the baseline middleware versions and that we are aiming for installation stability from the start of the April workshop on the 21st. K.Bos saw from the calendar CNAF plans a 1 week shutdown during ATLAS T1-T1 tests. J.Gordon pointed out that RAL will not have finished deploying new disk space by April and J.Templon reminded that site deployed resources information is coordinated by H.Renshall. J.Shiers concluded by saying the calendar is intended to be a synthesis to stop us from having to wade through hundreds of presentations. To conclude J.Shiers thanked those attending and looked forward to seeing them again at the next F2F on 1 April.
    • 09:35 13:35
      CCRC'08 - Site Focussed Session 32/1-A24

      32/1-A24

      CERN

      32-1-A24 until 13:00 160 1-009 from 13:00
      40
      Show room on map
      • 09:35
        Communication between sites and experiments 30m
        Action: { Need to set up the equivalent of atlas-grid-alarm@cern.ch for each T1. Action on: Jamie Shiers }
        Speaker: Ronald Starink (NIKHEF)
        Slides
      • 10:05
        Distributed DB Service View on Communications 15m
        Speaker: Maria Girone (CERN)
        Slides
      • 10:20
        Communication with Network teams / providers - discussion at LCG OPN 15m
        Speakers: Edoardo Martelli (CERN) , James Casey (CERN)
        Slides
      • 10:35
        French T2s experience and communication challenge 30m
        Speaker: Michel Jouvin (LAL)
        Slides
      • 11:05
        Site / Experiment Panel & Round-table on Communication 30m
        Jeff "short (make it four minutes) introductory two-slide partial presentation on LHCb 'increase my wall time limit' discussion" Templon
        Speaker: Sites, Experiments
      • 11:35
        FTS setup at RAL
        Speaker: Derek Ross (RAL)
        Slides
      • 11:40
        FTS channel configuration jamboree 25m
        Speaker: Gavin McCance (CERN)
        Slides
      • 12:05
        LFC backend requirements at WLCG Tier1s for May 5m
        Speaker: Jamie Shiers (CERN)
        Slides
      • 12:10
        Central deletions through srmv2 in Atlas 20m
        Speaker: Vincent Garonne (ATLAS)
        Slides
    • 12:30 13:30
      lunch and relocation break 1h 32-1-A24 morning, 160-1-009 afternoon

      32-1-A24 morning, 160-1-009 afternoon

      CERN

      32-1-A24 until 13:00 160 1-009 from 13:00
    • 13:30 16:00
      CCRC'08 - Experiment / Service Focussed Session 160/1-009

      160/1-009

      CERN

      32-1-A24 until 13:00 160 1-009 from 13:00
      12
      Show room on map
      • 13:30
        Middleware status 20m
        Speakers: Markus Schulz (CERN) , Oliver Keeble (CERN)
        Slides
      • 13:50
        Storage-ware status 30m
        Speaker: Flavia Donno (CERN)
        Slides
      • 14:20
        DB Service Status: migration of services to 2008 h/w, Oracle version(s) for 2008 and readiness for May challenge 15m
        Speaker: Maria Girone (CERN)
        Slides
      • 14:35
        Experiment outlook (i.e. for May and beyond) 1h
        • ATLAS 15m
          Speaker: Kors Bos (NIKHEF, ATLAS, CERN)
          Slides
        • CMS 15m
          Speaker: Christoph Paus (MIT)
          Slides
        • ALICE 15m
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
          Slides
        • LHCb 15m
          Speaker: Nicholas Brook (University of Bristol)
          Slides
      • 15:35
        Future CCRC'08 F2F meetings(?) 10m
        There is a HEPiX meeting at CERN May 5-9 and a GDB on May 14.

        Do we need a CCRC'08 F2F in May? (expecting the answer no)

        In June there is a 2-day "post-mortem" workshop, (12th - 13th).

        (Not to mention the WLCG Collaboration workshop 21 - 25 April!)

        CCRC'08 June post-mortem workshop
        GDB Indico category
        HEPiX
        WLCG Collaboration workshop
      • 15:45
        Wrap-up 5m
    • 15:50 16:00
      end of meeting / relocation break 10m 32-1-A24 morning, 160-1-009 afternoon

      32-1-A24 morning, 160-1-009 afternoon

      CERN

      32-1-A24 until 13:00 160 1-009 from 13:00