WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 1
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          From: Nordic Federation and France
          To: UK/Ireland and Italy


          Report from Nordic Federation:
          • No special remarks at this point.

          Report from France:
          • Nothing to report except site to suspend.
          Candidate sites for suspension:
          • Site Name: SN-UCAD (ROC France); GGUS Ticket number(s): 44443, 44987, 42668
            https://gus.fzk.de/ws/ticket_search.php?ticket=44443
            https://gus.fzk.de/ws/ticket_search.php?ticket=44987
            https://gus.fzk.de/ws/ticket_search.php?ticket=42668
            Reason for escalation: no answer from site since 1 month and still in error on SRMv2-get-SURLs (GGUS #44443), sBDII-performance (GGUS #44987), APEL-pub (GGUS #42668)
        • b) <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          2009-01-09: In order to improve the roll-out procedure of the BDII service and to minimise the overall risk of service disruption we are looking for a production site running a top-level BDII to join the Release test process. Upon a new update of the BDII software in production the site would be requested to be the first to upgrade the BDII and to confirm that the updated service work as expected.
          More info about release test procedures in https://twiki.cern.ch/twiki/bin/view/LCG/PPS_Release_Testing Contact: pps-support@cern.ch 2008-11-09: Pilot service of SLC5 WN at CERN: in progress * LHCb tests on the pilot pointed out some issues with the gssklog mechanism when submitting from DIRAC3. The issue apparently arises with the newer version of VDT distributed in the WN Under investigation. * In accordance with the plans,two production CEs are being reconverted to use SLC5 They will be made available for production next week (19th of Jan) * Details about the pilot (including planning, layout, technical info) can be found in the page https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotSLC5 * Details about the single tasks can be found in the tracker http://www.cern.ch/pps/index.php?dir=./ActivityManagement/SA1DeploymentTaskTracking specifically listing the subtasks of TASK:8350
        • c) <big> gLite Release News</big>
        • d) <big> CREAM-CE (for Alice) </big>
          Tier-1 sites in particular are encouraged to install 1 or more CREAM CEs.
        • e) <big> Sites supporting BioMed VO: Please update GFAL </big>
          Can all ROCs contact their sites which support the BioMed VO to ask them to update their WNs with the latest version of GFAL, please.
        • f) <big> EGEE issues coming from ROC reports </big>
          • Central Europe: Two cases conserning lack of procedures how site shall set default SE:
            1. The EGEE SLA allows the site to be CE-only (Section 8: site must provide at least one CE OR SE). Not having SE affects on passing by site RM SAM tests - those tests take closest SE (default). Also setting up site in such situation is not possible because yaim require SE.
              Comment: maybe this is a problem with our interpretation of 8 section in SLA. Doeas this section says that Site can have "CE OR SE" or "CE with >=8CPU or SE with >=1TB"? If the secound option then the 8 section in SLA can mislead.
            2. In case of putting SE in Scheduled downtime, site have to put also CE into downtime (otherwise will not pass RM tests) or chose (lack in procedures) other SE (from other site).

          • France: [Information] Answer from French ROC to GGUS #42668: fifth escalation step for SN-UCAD site:
            As this site has never reached a sustainable production level since certification, it has been decided by French ROC, with the agreement of the site, to restart the whole certification process from the beginning. By the way, this site has been put in "uncertified" status, and is now out of production.

          • DECH: GSI-LCG2 is down because of bugs in the 64 bit WN package, see GGUS Ticket-ID: 38013 How to deescalate this situation?

          • SouthWest Europe: SWE will have a new site RedIRIS, which wil only host central services (Top-BDII, WMS, LFC, MYPROXY etc.). This configuration will cause problems in GSTAT because some necessary variables will not be defined. Will this configuration be supported in the future? Is there a work around for this type of sites?
      • 2
        WLCG Items
        • a) <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • b) <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • c) <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • d) <big> Alice </big>
        • e) <big> ATLAS </big>
        • f) <big> CMS </big>
          (Due to constant meeting clashes in 2009, I may be already out of this call when you get to this point. If so, please find below a summary, and mail me any questions).
          1. Tier-0 === The DataOps team kept the T0 resources mostly saturated throughout the winter break. This turned into being able to repack & prompt reco all CRAFT completely twice, and CRUZET + BeamCommissioning 3 times (4th running last week). Results written to disk-only pools, and promptly recycled as needed. Main issues: 1. some on CMS T0 code(s) (--> FIXED); 2. CERN resources behaved well except for some LSF failures on the weekend Jan 3-4 (--> FIXED); 3. some lessons learned in data handling over vast datasets at T0 (CMS-specific lessons) (--> being addressed).
          2. MC production === Summer08 phase. physics requests count for 253 M events produced (GEN-SIM-RAW, CMSSW_2_1_7); 208 M events reconstructed (CMSSW_2_1_8). --- Fall08 phase. MadGraph requests with CMSSW_2_1_17: 15.6 M evts produced + reco'ed, plus 1 RAW workflow and 1 RECO workflow still running (only some problems with a workflow, not yet working with a even patched version of ProdAgent PA_0.12.9). --- Winter09 phase. FastSim requests with CMSSW_2_2_3. 45 requests were assigned to be run during the Xmas break. 44/45 DONE, remaining 1 just skipped by DataOps. Total: 342 M evts produced. --- Summary of issues (breakdown with site issues only): just a couple of T2 sites had tmp issues, all fixed/bypassed.
          3. Reprocessing at T1 sites === 1) CRAFT activities: CRAFT data AlCaRECO and skims ran in IN2P3, FZK, PIC; of the order of ~50k jobs/workflow. IN2P3 had storage-SRM related issues over the Xmas break. Many issues with the glideins: some solved by DataOps submitters, some jobs ran, but at a somehow limited rate: not easy. In addition, re skims: 5 workflows, problems with the RECO-RAW output and needed fix to DBS to get it sorted out. 2) re-digi and re-reco: also tried to move from glideins to glite, also had problems, had jobs running for a while, then turned out to give errors, etc. In this case, AFAICT they were mostly identified as a site issue at PIC. Unfortunately, no tickets were opened by operators (to improve much in this).
          4. Transfer system === No major issues with the transfer system. A total of 175.11 TB transferred over winter holidays. Just one event: PhEDEx Castor-related export agents at CERN were not responsive on a Friday morning (I recall it to be Jan 2rd): auto resolved problem.
          Speaker: Daniele Bonacorsi
        • g) <big> LHCb </big>
        • h) <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
      • 3
        OSG Items
        Speaker: Rob Quick (OSG - Indiana University)
        • a) Discussion of open tickets for OSG
          1. https://gus.fzk.de/ws/ticket_info.php?ticket=44104: "A Nebraska site publishes the GlueSite object twice with 2 different base DNs"
          2. https://gus.fzk.de/ws/ticket_info.php?ticket=44140: "The site BU_ATLAS_Tier2 publishes information which are not Glue v1.3 compliant "
          3. https://gus.fzk.de/ws/ticket_info.php?ticket=44837: ""lsm-get failed" error occurred at site "HarvardU" under BNL, "
      • 4
        Review of action items
      • 5
        AOB