WLCG Operations Coordination

31/S-023 (CERN)



Show room on map
Maria Dimou (CERN)
WLCG Operations coordination twiki
    • 15:30 15:40
      Tier-1 Grid services 10m
      Speaker: Nicolo Magini (CERN)
    • 15:40 15:45
      Middleware news and baseline versions 5m
      Speaker: Nicolo Magini (CERN)
    • 15:45 16:15
      Task force reports 30m
      • Squid monitoring task force 5m
    • 16:15 16:35
      Experiment operations review and plans 20m
      • ALICE 5m
      • ATLAS 5m
        Reprocessing of 2012 data on-going, close to the end with some more merging jobs and re-running of failed jobs.
        • avoided data on FZK Tape as input
        • affected by "data loss" at T1s  ( FZK, NDGF -- disk server incidents, RAL -- power cut )
          • ATLAS finally started exercising recovery of a job output by running a single job, rather than re-running a whole task (a long-wished function in our prodsys)
        Another set or processing of special stream data from tape is to be defined soon (this month).

        follow-up within ATLAS about Frontier raised last week "Frontier: to avoid default TCP timeout in case of service down (WLCGDailyMeetingsWeek121119#Wednesday)"
        • The Frontier client is configured with a 10 second TCP timeout, and try the next Frontier server on the list quickly if the primary Frontier server is down
        • i.e. ok to keep the node down, rather than rebooting with a possibility the Frontier server sending a "keep alive" command potentially causing the time to fail and retry to be longer.

        points raised at the last meetings to be followed-up
        • GOCDB: WLCG-ops should review the fall-back system / procedure
        • FTS: affects T2 activities largely. ATLAS requests WLCG-ops to address fall-back solution
        • VOMS-GGUS synchronization for /atlas/team (WLCGDailyMeetingsWeek121029#Friday)
        • Need for alert when OPN switches to backup (WLCGDailyMeetingsWeek121105#Monday)
        • Twiki WLCGCriticalServices to be updated

        ATLAS Distributed Computing Tier-1/Tier-2/Tier-3 Jamboree (10-11 December 2012 CERN)
        • https://indico.cern.ch/conferenceDisplay.py?confId=196649
      • CMS 5m




        * Improving CVMFS support

           * all software deployment team members are now watching the cvmfs savannah squad: cmscompinfrasup-cvmfs

           * questions and support requests about CVMFS should go through this squad

           * documentation is maintained here: [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CernVMFS4cms]] and here: [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCVMFS]], to be consolidated


        * Improving savannah ticket support

           * merged categories: "Data Operations" and "Facilities Operations" into "Facilities"

           * created squad acting as catch-all place: cmscompinfrasup-comp_ops: sites are asked to reassign ticket to this squad if they need reply from central operations and don't have a specific squad they would like to communicate with


        * Support for /store/himc and /store/hidata

           * All T1 sites except FNAL and all T2 sites are asked to support /store/hidata and /store/himc for production use

           * These top level directories are used to cleanly separate Heavy Ion collision files from proton-proton collision files

           * Not all sites allow for directory creation in /store, therefore this announcement


        * T2s using DPM: workarounds are needed to run CMSSW

           * Reminder that workarounds have to be re-applied and updated after upgrading middleware or OS

           * Documentation is kept up to date on TWiki: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsT2DPMInstructions





        * ARCHIVE Castor Service Class at CERN ready: 10 TB per user

           * Documentation: [[https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsCERNArchive]]

           * users can store and recall files from tape but cannot run CMSSW jobs against the files





        * IN2P3 and other dCache T2 are suffering from SRM problem: 

           * known problem, the jGlobus fix for long proxies (fixed in jGlobus 2, but dCache uses jGlobus1) requires frequent restarts of frozen SRM

           * Currently IN2P3 uses cron job to check for responsiveness and restarts and waits for a fix from dCache developers (IN2P3 hasn't gotten any reply in a long time)

           * Best is to find problematic DN and stop user refreshing the proxy in a certain way which creates longer and longer proxies (problematic as DN is not in the log files as freeze happens before the log files are used)


        * Not all downtimes declared in OIM for OSG sites are propagated to Dashboard for a change in OIM feed

           * OSG sites can get improperly marked in "error" during a downtime

           * Follow up in SAV:134221


        * some instabilities with CERN CreamCEs:

           * CEs: HammerCloud test jobs aborting on CERN CREAM CEs ce206 and ce208 with reason "the endpoint is blacklisted", IN  PROGRESS GGUS:89124

           * CEs: low level job submission problem (< 5%), IN PROGRESS GGUS:88573




           * UI recommendation: gLite is recommended but not supported anymore

              * How do we proceed?

           * Also related to the UI: for progress with the UI migration towards EMI, we _urgently_ need a relocatable "TAR distribution". This was always lowered in priority by EMI due to person power limitations. There was though an effort started by CERN IT and GridPP to provide this. Can we have an update on the status of a tar distribution?

      • LHCb 5m
    • 16:35 16:45
      GGUS tickets 10m
      No ticket submitted this time from experiments or sites.
      Speaker: Maria Dimou (CERN)
    • 16:45 16:50
      News from other Working Groups 5m