WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 1
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          "Old" COD: Germany/Switzerland => Russia

          Report from "old style" COD:
          Nothing to report.

          cCOD: Northern Europe (NE) => Asia Pacific (AP)

          Report from cCOD:
        • Quiet week : nothing to report

  • b) <big> PPS Report & Issues </big>
    Please find Issues from EGEE ROCs and general info in:
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
  • c) <big> gLite Release News</big>
  • d) <big> EGEE issues coming from ROC reports </big>
    1. France : Since 01/06/2009, one of the regional Top BDII, hosted at GRIF, had some problem initially due to a air cooling system problem. GRIF WMS had consequently some problems because it was linked to this Top BDII.
    2. France : IN2P3-CC, the MSS software update successfully ended on friday. Dcache SE is now fully available.
    3. DECH : We needed to ban some users because various things, completely filling /tmp (VOs icecube and biomed) and running hundreds of jobs being killed by CPU time limit (ATLAS). The first two cases got quickly fixed via GGUS. The ATLAS case is still open since almost two weeks:
      https://gus.fzk.de/ws/ticket_info.php?ticket=49052 (Assigned to VOsupport)
      How should sites react in cases users got banned? LHC have alarm tickets to sites, how should sites approach the VOs?
    4. SWE:During the migration of 32bit workers to 64bit PIC faced to many problems related to the dependencies of LHC software on 32/64bit libraries. We are not happy with the situation of having production releases that are poorly tested against software of experiments (at least LHC): reference, e.g.
      - thread in LCG-ROLLOUT: "libstdc++-devel.i386 and libstdc++-devel.x86_64"
      • Reply from Integration and Certification: we are working with the Applications Area to produce a meta-rpm that pulls in the OS libraries needed by the HEP VOs.
  • e) <big>Grid Service Interventions </big>


    ALL TIMES IN UTC+2

    Downtimes effecting the WLCG tier-1 sites:

    NDGF-T1: At risk: 08:00 9 Jun - 00:00 11 Jun. Services: Bergen will update the fimm cluster and the Tier1 machines (compute nodes, dcache machines, grid middleware servers) to Rocks 5.1 with CentOS 5.3 at UiB. Will degrade services a bit.

    RAL-LCG2: OUTAGE: 10:00 8 Jun - 10:00 15 Jun. Services: Relocation to new machine room [IN PROGRESS].

    NDGF-T1: OUTAGE: 00:15 8 Jun - 04:15 8 Jun. Services: GEANT's circuit provider will be performing maintenance on the dark fibre route COP-FRA.

    NDGF-T1: At Risk: 7:30 5 Jun - 15:00 8 Jun. Services: Some dCache pools crashed this morning. Some Atlas and Alice files will be unavailable until the pools have been brought online again. Most pools got back again, but two are still giving us problem. Investigation in progress. [IN PROGRESS]

    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    Please consult the URLs above for details.

  • f) <big> Update on downtimes in the GOCDB </big>
    Speaker: Gilles Mathieu
  • 2
    OSG Items
    Speakers: Maria Dimou, Rob Quick (OSG - Indiana University)
    • a) Discussion of open tickets for OSG
      It is now urgent to get an OSG answer on the site email as per https://savannah.cern.ch/support/?107531 Ticket analysis done today by Guenter Grein: 1. GGUS Ticket #49049 (OSG #6926) Ticket is in progress in GGUS but closed in OSG Reason: GGUS received the "Closing" mail before the update mails that made the mail parser setting GGUS ticket into "in progress". Conclusion: the mail parser works correctly, but problems occur in case of mail delays especially if sending more than 1 update mails in a short time slot -> I closed this ticket manually. 2. GGUS Ticket #48962 (OSG #6924) Both tickets open -> ok 3. GGUS Ticket #48737 (OSG #6922) Both tickets open -> ok 4. GGUS Ticket #37059 (OSG #6926) Both tickets open -> ok
  • 3
    Review of action items
  • 4
    AOB