WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: DECH / CERN
          To: Russia / Italy

          Report from DECH COD:
          1. Quiet week. Two items (sites) to mention here:
            1. INFN-NAPOLI (GGUS #39631). No response over 10 days -> Step 3: operations meeting. Site was set to downtime by Italian ROC.
            2. INFN-LECCE (GGUS #39533). Also no answers, but it seems that the site has now the status "uncertified". Next COD should followup with the ROC about the intended status of this site.


          Report from CERN COD:
          1. Very simple week, COD dashboard is much faster than it ever was.
            A short outage on Thursday with xSQL interface that CIC portal queries SAM with. Judit fixed it immediately, problem understood.
        • <big> PPS Report & Issues </big>
          1. .
        • <big> gLite Release News</big>
          Now in Production
          • -
            • -


          Now in PPS
          • -


          Soon in Production
          • -
            • -
        • <big> EGEE issues coming from ROC reports </big>
          1. None this week.
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • <big> End points for FTM service at tier-1 sites </big>
          Here is the latest list of FTM end-points:

          The list of FTM end-points we have so far is:
          • ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
          • BNL: ???
          • CERN: https://ftsmon.cern.ch/transfer-monitor-report/
          • FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
            https://cmsfts3.fnal.gov:8443/transfer-monitor-gridvie
          • FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
          • IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
          • INFN: https://tier1.cnaf.infn.it/ftmmonitor/
          • NDGF: Being installed.
          • PIC: http://ftm.pic.es/transfer-monitor-report/
          • RAL: No endpoint in produciton yet.
          • SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
            http://ftm.grid.sara.nl/transfer-monitor-gridview
          • TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
        • <big>FTS SL4 - required by the experiments or tier-1 sites?</big>
          Alice: Neutral (as long as there is no disruption to the service. ATLAS: Prefer not to; to avoid introducing problems this close to data taking. CMS: Priority is stability for data taking days. Whatever is scheduled in advance *and* allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin) LHCb: Neutral (as long as there is no disruption to the service. ASGC: BNL: Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over. CERN: FNAL: Hardware is dating fast. May be issues with maintenance. FZK: IN2P3: INFN: NDGF: PIC: RAL: SARA/Nikhef: TRIUMF:
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. NDGF-T1 [at risk]: dCache upgrade on the CSC pools. Some CMS and ALICE data unavailable.
            From: Tuesday 2008-08-26, 06:00:00 UTC;
            To: Tuesday 2008-08-26, 09:00:00 UTC
            Affected nodes:
            • srm.ndgf.org

          2. RAL [OUTAGE]: Atlas and LHCB LFC downtime for upgrade.
            From: Tuesday 2008-08-26, 12:00:00 UTC;
            To: Tuesday 2008-08-26, 13:00:00 UTC
            Affected nodes:
            • lcglfc0377.gridpp.rl.ac.uk
            • lfc0448.gridpp.rl.ac.uk

          3. CERN [OUTAGE]: CASTORPUBLIC 2.1.7-16 upgrade.
            From: Wednesday 2008-08-27, 12:00:00 UTC;
            To: Wednesday 2008-08-27, 13:30:00 UTC
            Affected nodes:
            • srm-dteam.cern.ch
            • castorsrm.cern.ch
            • srm.cern.ch
            • srm-v2.cern.ch
            • srm-public.cern.ch

          4. NDGF-T1 [at risk]: Optical cable maintenance work on the IJS-NDGF network connection.
            From: Wednesday 2008-08-27, 22:00:00 UTC;
            To: Thursday 2008-08-28, 03:00:00 UTC
            Affected nodes:
            • srm.ndgf.org

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>

          1. general on CRUZET-4 and T0 workflows:
            CRUZET-4 over at ~8am in the morning, ~38 ml evts collected during the exercise, most interesting part from Thursday on, >25 ml evts only in last weekend. Plenty of precious info and feedback on a real-life exercise. CRUZET Jamboree on Wednesday afternoon. CRUZET-like activities will restart again with magnetic field at the end of the week. --- SLS reported "CMS Online databases" at 0% availability, due to a CMS DB intervention in the Online, now over and status is OK.
          2. Distributed Data Transfers:
            We see 1) issues with the stager agent (experts aware and investigating) + 2) some Castor issues causing problems to the CAF (2 tickets to CERN-IT still pending over the weekend, see [$1] and [$2]) + 3) issue with download agents in at least 2 T1 sites. This overall causes PhEDEx service to be labelled as 'degraded' in SLS. These are being addressed/closed right now- as from news from the WLCG daily call
            [$1] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546182&email=stephen.gowdy@cern.ch
            [$2] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546181&email=peter.kreuzer@cern.ch
          3. Tier-2 workflows:
            The high-profile Summer'08 production is on-going, still ramping up to full speed though.
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. LHCb is wondering (and wants to be seriously taken into account) whether it is valid that any downtime announced less than 24 hours must be considered Unscheduled rather than scheduled (with obvious different implication at the site reliability computation level)

          2. LHCb wants to remind all sites that the Shared Area is also a critical service and sites must guarantee the adequate QoS required. The problem at CNAF teaches us that this is important. How can this message be conveyed efficiently to all sites and the quality improved by adopting/writing adequate fabric sensors?

          3. The last week SAM sensors http://lblogbook.cern.ch/Operations/375 pointed out a problem about SAM critical services (used by Gridview algorithms to computing reliability) and services effectively used by the VOs. The 20th of August StoRM at CNAF stopped to be published as SRM sensor (it is now only SRMv2 sensor in SAM dictionary) and then SAM clients fail to publish results. The net effect is that, for the still critical SRM service, there are not results available for CNAF since then. Open a GGUS for GridVIEW team: https://gus.fzk.de/pages/ticket_details.php?ticket=40087
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

          Note that the recommended dCache version has been updated to 1.8.0-15p11.
        • <big> Storage services: this week's updates </big>
          Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
          • Version 1.8.0-15p12 of dCache will be soon available. Installation scripts and improvements for sites using Chimera are available. Sites who do not use Chimera should not upgrade to this version.
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          • https://gus.fzk.de/ws/ticket_info.php?ticket=37948
            Should be set to solved.
          • https://gus.fzk.de/ws/ticket_info.php?ticket=38087
            Looks like user error. Can it be closed?
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB