WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France and Italy
          To: UK/I and Russia


          Report from France:

          Report from Italy:
          List of unresponsive sites (First Ops meeting):
          1. SITE NAME: ru-Moscow-GCRAS-LCG2
          2. ROC NAME: ROC_Russia
          3. GGUS TICKET NUMBER: #45457, #45039, #44739
          4. Reason for escalation: as reported at https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures#7_6_Suspending_a_site, the ROC must suspend one of their sites if a site is in downtime for more than one month. The site is almost in SD (at risk) from 2008-12-24 (https://goc.gridops.org/downtime/list?id=15555346) and the last SD ends on 2009-04-16 (https://goc.gridops.org/downtime/list?id=20105380).
            Also, GGUS tickets are also not updated in timely manner.
          Problems Encounteredduring shift:
          1. Some temporary problems with cod dashboard on 24th of March due to the GOCDB outage.
          2. Because of GGUS update on 25th of March, cod dasboard has been unavailable for a couple of hours.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          Highlights: Nothing to report this week.

        • <big> gLite Release News</big>
          Please find gLite release news in:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

          Highlights

            Now in Production
          • gLite 3.1 Update 42 was released to production in preparation:
            • BDII: The starting cache size has been reduced from 1 GB to 50 MB.
            • VDT 1.6.1 Release 9 - This version features the fix of a bug in globus that was causing troubles to 32bit programs using globus and running on 64bit machines.
            • gLite3.1/FTS 2.1
            • gLite3.1 lcg-vomscerts-5.4.0 adds next cert for lcg-voms.cern.ch

            Now in PPS
          • Nothing to report.

            Soon in Production
          • Release of gLite 3.1 Update 43 to production in preparation (approx. 6 April):
            • YAIM clients: to enable configuration of Service Discovery
            • VOMS: fixes for FQAN order, short FQANs....64bit version. Also dependency on mysql-server added.
            • SGE: New info dynamic plugin + YAIM utils
          • Release of gLite 3.1 Update 44 to production in preparation (approx. 14 April):
            • CREAM CE: Updates to CE + YAIM
            • WMS: Update to ICE + YAIM
          This set of patch includes the versions tried out during several weeks in a PPS Pilot and it is known to fix a number of performance issues previously affecting the ICE --> CREAM submission chain.
        • <big> EGEE issues coming from ROC reports </big>
          • None this week.
        • <big>Grid Service Interventions </big>
          SARA: OUTAGE: From 02:00 4 April to 02:00 5 April. Service: dCache SE.
          SARA: OUTAGE: From 09:30 30 March to 21:00 30 March. Service: srm.grid.sara.nl.
          SARA: OUTAGE: From 15:13 27 March to 02:00 31 March. Service: celisa.grid.sara.nl. Fileserver malfunction.
          CERN: At Risk: From 11:00 31 March to 12:00 31 March. Service: VOMS (lcg-voms.cern.ch).
          FZK: OUTAGE: From 14:21 30 March to 20:00 30 March. Service: fts-fzk.gridka.de
          INFN-CNAF: OUTAGE: From 02:00 28 March to 19:00 3 April. Service: ENTIRE SITE.
          INFN-T1: OUTAGE: From 16:00 27 March to 17:00 3 April. Service: ENTIRE SITE.
          NDGF-T1: At risk: From 12:31 27 March to 16:31 30 March. Service: srm.ndgf.org (ATLAS).
          NDGF-T1: At risk: From 12:31 27 March to 13:27 31 March. Service: ce01.titan.uio.no.

          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Please consult the URLs above for details.

        • <big> Update of CERN VOMS server certificate </big>
          The certificate of the VOMS server lcg-voms.cern.ch will be replaced on the 31st of March. By that date all the sites should have updated their services to the version of lcg-vomscerts deployed with gLite3.1 Update42
          ( https://cic.gridops.org/index.php?section=roc&page=broadcastretrieval&step=2&typeb=C&idbroadcast=39901 ).
        • <big> Retirement of gLite 3.0 </big>
          As previously announced, it is planned that all remaining gLite 3.0 services will be retired by the end of April. At this point, all support for these services will cease.
          All sites should ensure that they are running up-to-date versions of their services.
          If any site sees a need to keep a gLite 3.0 service in the middleware stack, please submit a GGUS ticket as soon as possible.
      • 16:30 17:00
        WLCG Items 30m
        • <big> Removal of the WLCG-specific section of the meeting</big>
          From now on, at the request of WLCG, there will be no WLCG-specific section at this meeting. Note that the WLCG experiments will still take part to the general meeting.
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          OSG
          • Exactly this was discussed for the last 2 weeks and Rob had an action to check. GGUS #46647: The ticket is now assigned to Rob. The action required is in the 2009-03-24 by MariaDZ.
          • Comment today in stalled urgent ATLAS ticket since 2009-03-09 GGUS #46988:
            Tim and other OSG colleagues,
            my understanding from https://savannah.cern.ch/support/index.php?107511#comment3 is that
            had you chosen status 'customer' in OIM, 
            the ggus ticket would have gone to status 'waiting for reply' and the submitter would have been prompted 
            to react. Please do so now.
            yours
            maria
            
          • Ticket GGUS #47032: should have been in status 'solved'. Assigned to GGUS dev. for investigation.
          • Ticket GGUS #47061: Same as above. It should have been marked 'solved'.
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB

      • This week: LHC experiment VOs to perform an ALARM ticket test (full round from opening to ticket closing) to Tier1s. [savannah ticket #107452] and [testing rules]. Summary reports must be sent to wlcg-operations@cern.ch by April 3rd at the latest! (MariaDZ)
      • Very important strategic USAG meeting this Thursday @9:30CEST. All 1st line user support (TPM in GGUS terms) is subject to change. (MariaDZ)