WLCG-OSG-EGEE Operations meeting

CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
        EGEE Items
        • <big> Grid-Operator-on-Duty handover </big>
          "Old" COD: South East Europe (SEE) => Germany/Switzerland (DECH)

          Report from "old style" COD:
          No unresponsive sites. Nothing to raise.

          cCOD: Central Europe (CE) => North Europe (NE)

          Report from cCOD:
          Nothing to raise.

        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:
        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          • France ROC: IN2P3-CC is down from Sunday 3rd May 19:00, due to air cooling failure. Most of the grid services have been restarted this morning (May 4th).
            An unscheduled dowtime is still active until tomorrow afternoon for CEs and SEs.

          • SEE ROC: Are there any developments/plans/ideas towards to a high availability mechanism for the LFC service from the development team?
            From the developers: LFC can be deployed in a HA setup, as it does not hold internal state apart from the database. One can deploy multiple front-ends pointing to the same database back-end, which are in a load balanced or fail-over configuration.
            Of course in this case the database is a single point of failure, which one can mitigate by deploying on Oracle RAC or having a multi-tier LFC setup: https://twiki.cern.ch/twiki/bin/view/LCG/LfcConceptDeploymentUsage
            In this case there is one master LFC service, which updates read-only replicas via Oracle streams database level replication.
            In theory one can also think about MySQL based replication (tested for VOMS, but not for LFC) and also about multi-master Postgres DB, depending on the actual requirements coming from the sites.

          • SEE ROC: Will the glite-3.2, also support 32-bit architecture? Based on the https://twiki.cern.ch/twiki/bin/view/EGEE/Glite32RNProd it seams that it will. [Nick: Really???]

          • SEE ROC: Which is the current status of the top-BDIIs? Tests we made within the HellasGrid infrastructure showed to us that many of the problems at the current version of top-BDII are solved in the top-BDIIs.

        • <big>Grid Service Interventions </big>

          ALL TIMES IN UTC+2

          Downtimes effecting the WLCG tier-1 sites:

          CERN-PROD: OUTAGE: 14:00 06 May - 18:00 06 May. Services: srm-public.cern.ch; srm-atlas.cern.ch

          CERN-PROD: At risk: 08:30 06 May - 11:30 06 May. Services: srm-pps.cern.ch ; srm-public.cern.ch ; srm-cms.cern.ch ; srm-alice.cern.ch ; srm-lhcb.cern.ch ; srm-dteam.cern.ch ; srm-atlas.cern.ch

          CERN-PROD: OUTAGE: 09:00 07 May - 13:00 07 May. Services: srm-public.cern.ch ; srm-alice.cern.ch

          CERN-PROD: OUTAGE: 09:00 05 May - 13:00 05 May. Services: srm-public.cern.ch ; srm-alice.cern.ch

          SCAI: OUTAGE: 14:00 07 May - 18:00 07 May. Services: rb.scai.fraunhofer.de

          SARA-MATRIX: OUTAGE: 10:00 06 May - 11:00 06 May. Services: ui.grid.sara.nl; voms.grid.sara.nl ; lfc-atlas.grid.sara.nl ; bdii.grid.sara.nl ; fts.grid.sara.nl ; rgmamon.grid.sara.nl

          RAL: OUTAGE: 09:00 06 May - 18:00 06 May. Services: lfc.gridpp.rl.ac.uk

          RAL: OUTAGE: 08:00 06 May - 18:00 06 May. Services: lcgfts.gridpp.rl.ac.uk

          RAL: OUTAGE: 11:00 05 May - 19:00 08 May. Services: lcgce04.gridpp.rl.ac.uk

          RAL: OUTAGE: 10:00 05 May - 13:00 05 May. Services: lcglb02.gridpp.rl.ac.uk

          SARA-MATRIX: At risk: 09:00 06 May - 12:00 06 May. Services: srm.grid.sara.nl

          NDGF-T1: At risk: 13:00 05 May - 17:00 05 May. Services: srm.ndgf.org

          NDGF-T1: OUTAGE: 09:00 05 May - 12:00 05 May. Services: db1tier1.ndgf.org

          IN2P3-CC: OUTAGE: 11:55 04 May - 19:00 05 May. Services: cclcgceli02.in2p3.fr ; ccsrm.in2p3.fr ; cclcgceli01.in2p3.fr ; cclcgceli03.in2p3.fr ; cclcgceli04.in2p3.fr ; ccsrm02.in2p3.fr

          and many more...

        OSG Items
        Speakers: Maria Dimou, Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          Information on GOCdb-to-OIM migration for USA wLCG T1 sites
          • ggus #44104. This ticket is waiting on the OSG GOC to roll out changes to their production BDII that will publish entries by their OSG resource group, not the OSG resource name. This will remove this issue before it gets to the BDII. Next action deadline in OIM is in Feb 2010. Should we close as unsolved to free the escalation reports?
          • ggus #46988. Site concerned is AGLT2. The ticket is urgent since early March. Still nothing happened since my comment of 2009-03-30.
          • ggus #47786. Site concerned is Nebraska. Urgent. Submitted 2009-04-08! Some OSG reminders remain unanswered by the site (?)
        Review of action items
