WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: CERN
  • VOs:
      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
        Previous minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: UK/I / CE
          To: AsiaPacific / SouthWest Europe


          Report from CE COD:
          1. Nothing to report.
          Report from UK/I COD:
          1. Nothing to report.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
        • <big> gLite Release News</big>

          Release News:
          Please find gLite release news in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
        • <big> EGEE issues coming from ROC reports </big>
          1. ROC France: CERN Trusted Certification Authority published old CRL twice during the week. This was due to a CERN CA web site update. Such a problem is quite tricky as failures come from everywhere (SAM test failed, some users complain, Atlas DDM production went wrong). In order to find the problem out, you have to cross those data. So, perhaps:
            1. It would be interesting to monitor CRL validity of official CAs, and keep its history
            2. Ask CA adminitrator to check the CRL validity after each update.
            Reminder: When a SAM critical test is going wrong, this is obviously good to quickly solve the problem, but please don t forget to announce asap the problem to prevent people from wasting time with misleading alarms.

          2. Several of our sites had problems with site availability this week following the CA root certificate update. Things have now returned to normal but we are carrying out a post mortem to understand where things could have worked better.
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. No items this week.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

        • GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.
          The site currently supports the following VOs: alice, atlas, lhcb, cms, biomed, dteam and ops

        • CYFRONET-IA64: We are going to shut down CYFRONET-IA64 completely at the end of May 2008.
          Please take care of your data you may have on our classic SE: ares03.cyf-kr.edu.pl.

        • INFN-FIRENZE: Classic SE grid002.fi.infn.it is planned to be removed from production the 15th June. Please backup your data before that date.


          Time at WLCG T0 and T1 sites.

  • <big> Items from Alice </big>
  • <big> Items from ATLAS </big>
  • <big> Items from CMS </big>
    • On of the main focus of last week was T1 workflows. The first was stable, some Castor issues (mostly GC-related) addressed and fixed. On the second, we had reprocessing and skimming jobs running at T1 sites.
      ASGC: Pretty impressive performances; no problems found.
      CNAF also OK, running up to ~600 skim jobs in parallel.
      FNAL is running all kind of processing jobs since days, up to 3.4K running in parallel.
      FZK running processing with some issues reading the input RAW data, skim jobs also somehow slower than other T1's.
      IN2P3 is clearing out its backlog, more processing jobs will come.
      PIC has much processing to go, total number of slots used driven by the number of running skim jobs, some issues there are being investigated with the input/help from other T1's.
      RAL using all queues, some jobs killed by some CEs because exceeding the maximum CPU time on them, now restricted to use the long queue, and things got better.
    • Interesting week with transfers, especially in T0-T1 together with other VO's, and in T1-T2. Data transfer tests stable in the T0-T1 routes, and will continue until the end of the challenge. T1-T2 going on with link rotation, details at https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-DataTransfers . Tests ramping down in T1-T1 routes.
    • at the T0, repacker tests in progress. More info will be collected during this week.
    Speaker: Daniele Bonacorsi
  • <big> Items from LHCb </big>
  • 5:00 PM 5:30 PM
    OSG Items 30m
    None received.
    Speaker: Rob Quick (OSG - Indiana University)
  • 5:30 PM 5:35 PM
    Review of action items 5m
    list of actions
  • 5:35 PM 5:35 PM
    AOB