WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 1
        Feedback on last meeting's minutes
      • 2
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          From: UK/I / South East Europe
          To: Central Europe / North Europe

          Report from UK/I COD:
          1. Item 1
          Report from SEE COD:
          1. GGUS could not send emails on 21 July and the new release of the GGUS portal is now available
          2. SAM BDII failed and fake SRM alarms appeared on 22 July
          3. Many SAM tests failed around 13:20 UTC on 25 July, probably due to: "available CRL has expired".
          4. The site RO-02-NIPNE (ticket id#8348) may need to be suspended and should be discussed at the meeting.
        • b) <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

        • c) <big> gLite Release News</big>
        • d) <big> EGEE issues coming from ROC reports </big>
          1. None.
        • e) <big> Upcoming SRM V2 tests for SAM</big>
          Speaker: Konstantin Skaburskas
        • f) <big> Experience of countries/regions with the WMS? </big>
          In the UK we are still trying to understand when to move to relying on the WMS and how many we require. What are the experiences of other countries/regions?
          Here is some background from a GridPP meeting today:

          "The RAL WMS lcgwms01 (SL3 host with gLite-WMS-2.4.9-0 and glite-LB-2.3.5-0) became heavily loaded on 22nd and user throughput suffered as a result. The underlying problem was not understood as the service returned to normal without a clear intervention required. This prompted SL to comment on WMS and RB availability in the UK. He noted 5 RBs (3 RAL; 1 Glasgow and 1 IC). He was only aware of the 1 WMS instance at RAL. As of today, the default server in Glasgow is a gLite 3.1 WMS instance (RB to be removed at the end of July and possibly replaced with another WMS). RAL maintains one test instance on SL4 – to be moved to production after further testing. IC has PPS-glite-WMS.i386 3.1.8-1. This WMS is stable with 20-30,000 jobs a day not causing a problem. NGS has an unadvertised WMS hosted at RAL. Grid Ireland run a WMS and has seen “quite a few issues” while working with users to get their apps working via it. Throughput performance of the WMS is good.

          Stephen recently noticed that YAIM will soon be configuring UIs to work with service discovery (WMS and LBs will be discoverable through the information system using appropriate UI commands): https://savannah.cern.ch/bugs/?31211.”
      • 3
        WLCG Items
        • a) <big> WLCG issues coming from ROC reports </big>
          1. None.
        • b) <big> End points for FTM service at tier-1 sites </big>
          There is a request to know what are the FTM endpoints at the Tier-1 sites.
          We can collect these manually now, but how should the list be kept up-to-date?
        • c) <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. PIC will have a Scheduled downtime on 5-Aug, from 8:00 CEST (UTC+2) to 20:00 CEST (UTC+2). The SRM and CE services will be down for a dCache upgrade and PBS master migration, respectively. The LHCB-DIRAC2 (lhcb.pic.es) server will also be stopped from 9:00-10:00 (UTC+2) for a cold backup of the MySQL DB.

          Time at WLCG T0 and T1 sites.

        • d) <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • e) <big> Alice report </big>
        • f) <big> Atlas report </big>
        • g) <big> CMS report </big>
          Speaker: Daniele Bonacorsi
        • h) <big> LHCb report </big>
        • i) <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
        • j) <big> Storage services: this week's updates </big>
          • dCache announced version 1.8.0-16. It will most probably be available in one month. It contains several improvements:
            1. New Information Providers in accordance with the decisions taken by the "Dynamic Megatable" working group
            2. Improved version of Pin Manager. It allows to release pins per VO.
            3. Better performing srmLs
            4. New Pool System with no overcommitted space
            5. Improved srm clients with better handling of command line options
            The CCRC08 branch will still continue to be supported
          • New CASTOR information providers compliant with the decisions taken by the "Dynamic Megatable" working group in validation.
      • 4
        OSG Items
        Speaker: Rob Quick (OSG - Indiana University)
        • a) Discussion of open tickets for OSG
      • 5
        Review of action items
      • 6