WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: Russia / Italy
          To: SEE / SWE

          Report from Russian COD:
          1. No issues this week.

          Report from Italian COD:
          1. No issues this week.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          1. None this week.
        • <big> WN distribution mechanism </big>
          SA3 put forward a proposal for a centralized distribution mechanism for the gLite clients (WN). Several responses have been received so far and are attached here.
          Speaker: Oliver Keeble
          more information
        • <big> NEW: Broadcasting of downtimes of Operations Tools (GOC DB, CIC portal, etc.) </big>
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. [SWE ROC]: CMS opened a ticket to the site LIP-Coimbra telling that the disk space for CMS is full. Would it not be better to assign this kind of ticket to the VO instead of the site supposing that the site while fulfills the capacities agreed by a MoU or similar?
        • <big> End points for FTM service at tier-1 sites </big>
          Here is the latest list of FTM end-points:

          The list of FTM end-points we have so far is:
          • ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
          • BNL: ???
          • CERN: https://ftsmon.cern.ch/transfer-monitor-report/
          • FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
          • FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
          • IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
          • INFN: https://tier1.cnaf.infn.it/ftmmonitor/
          • NDGF: Being installed.
          • PIC: http://ftm.pic.es/transfer-monitor-report/
          • RAL: No endpoint in produciton yet.
          • SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
          • TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
        • <big>FTS SL4 - required by the experiments or tier-1 sites?</big>
          Alice: Neutral (as long as there is no disruption to the service. ATLAS: Prefer not to; to avoid introducing problems this close to data taking. CMS: Priority is stability for data taking days. Whatever is scheduled in advance *and* allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin) LHCb: Neutral (as long as there is no disruption to the service. ASGC: ??? BNL: Need to migrate (Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over.) FNAL: Need to migrate (Hardware is dating fast. May be issues with maintenance.) FZK: Prefer to wait (to include patch for SRM1 requests issued by FTM) IN2P3: Can wait until next shutdown. INFN: ??? NDGF: Prefer to wait until next shutdown. PIC: ??? RAL: ??? SARA/Nikhef: ??? TRIUMF: Can wait until next shutdown.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>

          • General:
            Global Run data taking with the magnet at 3T over some part of the weekend.
          • CERN-IT and T0 workflows:
            Migration data transferred into the local CAF-DBS instance for public information and access got slow for an issue debugged over the weekend and now understood, 11k blocks to go, may take up to 3 days to digest, does not worth any action, just let it go, since insertion of CAF-urgent datasets can (and was already successfully) be forced manually, thus causing no troubles for CERN-local analysis access.
          • Distributed sites issues:
            • T1_ES_PIC failures in CMS-specific SAM analysis test (missing input dataset: already fixed, thanks to Pepe Flix)
            • T1_DE_FZK failures in CMS-specific SAM analysis test (missing input dataset)
            • T2_CH_CSCS: No JobRobot jobs assigned (bdII ok?) + CMS-specific js and jsprod tests fail ("no compatible resources")
            • T2_US_NEBRASKA: No JobRobot jobs assigned (bdII ok?)
            • T2_UK_London_Brunel: Aborted JobRobot jobs ("Job got an error while in the CondorG queue")
            • T2_US_Wisconsin: No JobRobot jobs assigned (bdII ok?)
            • T2_ES_CIEMAT: CMS-specific SAM errors in analysis and js tests (timeout executing tests)
            • T2_PT_LIP_Coimbra : CMS-specific SAM CE errors in jsprod + dCache "No space left on device" (acknowledged)
            • T2_US_MIT: CMS-specific SAM Frontier error ("Error ping from t2bat0080.cmsaf.mit.edu to squid.cmsaf.mit.edu": the latter is down.)
            • T2_US_Wisconsin: CMS-specific SAM tests not running since 8/29 (some problems in bdII? JobRobot is not running too)
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

          Note that the recommended dCache version has been updated to 1.8.0-15p11.
        • <big> Storage services: this week's updates </big>
          Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
          1. ATLAS ask sites to setup USER and GROUP space tokens, setting specific ACLs to protect access to such areas. Furthermore, they have asked to set specific ACLs on directories used to access those spaces. The ATLAS request cannot be fulfilled at the moment nor by DPM nor by dCache installations. Sites are therefore asked to just setup the space tokens allowing access to generic ATLAS users to both files and directories.
          2. Next release of dCache (1.8.0-16) will have support for ACLs on directories. This will allow site administrators to setup correctly what ATLAS has asked.
          3. For DPM, the release that allows to set multiple ACLs on spaces is still in the hands of the developers.
          4. Release of dCache 1.8.0-15p12 which is supposed to come out this week has a fix for ATLAS Tier-1s. The Pin specified on BringOnline will start after the file has been brought on disk and not at the time the request was issued, as before. After this patch release, the dCache team will concentrate on release 1.8.0-16. Therefore, no more patch releases to 1.8.0-15 will be made available unless very critical bugs will be reported.
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          • None this week.
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35