WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France / Central Europe
          To: DECH / CERN

          Report from France COD:
          1. For information: A problem with Gstat has caused many false alarms related to BDII tests on every ROCs. These failed tests to BDIIs were caused by transitorily ASGC network outage for 20 minutes from 06:05 till to 06:25 on 11-Aug-2008.


          Report from Central Europe COD:
          1. No issues this week.
        • <big> PPS Report & Issues </big>
          1. None this week.
        • <big> gLite Release News</big>
          Now in Production
          • gLite3.1 Update28. The release contains:
            • glite-CONDOR_utils for lcg-CE(PATCH:1856)
            • New version of gsoap plugin with a vulnerability fix (affecting LB, WMS, UI, WN, VOBOX, CE)(PATCH:1846)
            • Several bug fixes on WMS and clients (PATCH:1780)
            • New Short Lived Credential Service (SLCS), allowing to get short-lived personal certificate based on Shibboleth AAI identity (PATCH:1693)
            • MyProxy? version 1.6.1-7 (fixes build issue related to globus flavour, already deployed in production) (PATCH:1978)
            • Various improvements on lcg-extra-jobmanagers (CE) (PATCH:1942)
            • GFAL and lcg_util update with new function gfal_removedir and Several bug fixes
            • FTS SL4 release (32 and 64 bit) This version has a critical bug and should not be installed. The RPMs have been removed from the repository.


            Now in PPS
            • No new updates since last week.


            Soon in Production
            • gLite3.1 Update 29 in preparation. The release contains:
              • DPM & LFC 1.6.11 : R3.1/SLC4/i386 (PATCH:1988)
              • DPM & LFC 1.6.11 : R3.1/SLC4/x86_64: DPM & LFC 1.6.11 (PATCH:1987)
        • <big> EGEE issues coming from ROC reports </big>
          1. No items.
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. None.
        • <big> End points for FTM service at tier-1 sites </big>
          Here is the latest list of FTM end-points:

          The list of FTM end-points we have so far is:
          • ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
          • BNL: ???
          • CERN: https://ftsmon.cern.ch/transfer-monitor-report/
          • FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
            https://cmsfts3.fnal.gov:8443/transfer-monitor-gridvie
          • FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
          • IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
          • INFN: https://tier1.cnaf.infn.it/ftmmonitor/
          • NDGF: ???
          • PIC: http://ftm.pic.es/transfer-monitor-report/
          • RAL: no endpoint in produciton yet
          • SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
            http://ftm.grid.sara.nl/transfer-monitor-gridview
          • TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
        • <big>FTS SL4 - required by the experiments?</big>
          - At the tier 1s ? - At the tier 2s ?
        • <big>Correction of availability metrics due to incorrect setting of LFC write test to Critical</big>
          The Gridview team are recalculating the metrics and the correct data should be available within 1-2 days.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. CNAF [OUTAGE]: CASTOR upgrade. From Tuesday, 19 August, 09:00 UTC+2 to Wednesday, 20 August, 20:00 UTC+2. Affected nodes:
            • castorgrid.cr.cnaf.infn.it
            • srm-v2.cr.cnaf.infn.it
            • srm-v2-cms.cr.cnaf.infn.it
            • castorsrm.cr.cnaf.infn.it

          2. DESY [at risk]: One poolnode will move its location. Some files in dq2 and user directories will not be available. From: Tuesday, 19 August, 10:00 UTC+2 to Thursday 21 August 21:00 UTC+2. Affected nodes:
            • dcache-se-atlas.desy.de

          3. CSCS [OUTAGE]: Replacement of a faulty DIMM on storage pool node. From: Tuesday 19 August, 11:30 UTC+2; To: Tuesday 19 August, 13:30 UTC+2. Affected nodes:
            • storage01.lcg.cscs.ch
            • ce01.lcg.cscs.ch

          4. GRIF [OUTAGE]: electrical maintenance. From: Thursday 21 August 23:11 UTC+2; To: Wednesday 27 August 21:117:30 UTC+2. Affected nodes:
            • apcse01.in2p3.fr
            • apcce01.in2p3.fr

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>

          1. CRUZET-4:
            It is a slow start of CRUZET-4 atm (day-1 today). HCAL and DT are in, Tracker may join in the afternoon. DAQ currently is addressing some issues seen. From the computing standpoint, we have regular data operations shifts in place and operational - focusing mostly on T0 workflows - and we are using the CRUZET-4 exercise to implement the general computing shift design put in place recently, which is supposed to complement and integrate the DataOps approach and extend it to monitor the overall infrastructure, interfacing with the Grid Ops and the distributed facilities.
          2. Summer08 production:
            More details will follow from DataOps team. Anyway, the most urgent and needed info by T1 sites has been already provided to them at the end of last week (they need it to prepare tape families on their MSS systems); current storage needs estimated to be as follows:
            ASGC: 27.0 TB (RAW) + 13.5 TB (RECO) = 40.5 TB
            CNAF: 26.5 TB (RAW) + 13.25 TB (RECO) = 39.75 TB
            FNAL: 64.6 TB (RAW) + 32.3 TB (RECO) = 96.9 TB
            FZK: 58.8 TB (RAW) + 29.4 TB (RECO) = 88.2 TB
            IN2P3: 22.0 TB (RAW) + 11.0 TB (RECO) = 33.0 TB
            PIC: 8.4 TB (RAW) + 4.2 TB (RECO) = 12.6 TB
            RAL: 23.9 TB (RAW) + 11.95 TB (RECO) = 35.85 TB
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
        • <big> Storage services: this week's updates </big>
          • dCache announced version 1.8.0-16. It will most probably be available in one month. It contains several improvements:
            1. New Information Providers in accordance with the decisions taken by the "Dynamic Megatable" working group
            2. Improved version of Pin Manager. It allows to release pins per VO.
            3. Better performing srmLs
            4. New Pool System with no overcommitted space
            5. Improved srm clients with better handling of command line options
            The CCRC08 branch will still continue to be supported
          • New CASTOR information providers compliant with the decisions taken by the "Dynamic Megatable" working group in validation.
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB