WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

  • b) <big> PPS Report & Issues </big>
    Please find Issues from EGEE ROCs and general info in:
  • c) <big> gLite Release News</big>
  • d) <big> EGEE issues coming from ROC reports </big>
    1. France : Since 01/06/2009, one of the regional Top BDII, hosted at GRIF, had some problem initially due to a air cooling system problem. GRIF WMS had consequently some problems because it was linked to this Top BDII.
    2. France : IN2P3-CC, the MSS software update successfully ended on friday. Dcache SE is now fully available.
    3. DECH : We needed to ban some users because various things, completely filling /tmp (VOs icecube and biomed) and running hundreds of jobs being killed by CPU time limit (ATLAS). The first two cases got quickly fixed via GGUS. The ATLAS case is still open since almost two weeks:
      https://gus.fzk.de/ws/ticket_info.php?ticket=49052 (Assigned to VOsupport)
      How should sites react in cases users got banned? LHC have alarm tickets to sites, how should sites approach the VOs?
    4. SWE:During the migration of 32bit workers to 64bit PIC faced to many problems related to the dependencies of LHC software on 32/64bit libraries. We are not happy with the situation of having production releases that are poorly tested against software of experiments (at least LHC): reference, e.g.
      - thread in LCG-ROLLOUT: "libstdc++-devel.i386 and libstdc++-devel.x86_64"
      • Reply from Integration and Certification: we are working with the Applications Area to produce a meta-rpm that pulls in the OS libraries needed by the HEP VOs.
  • e) <big>Grid Service Interventions </big>


    Downtimes effecting the WLCG tier-1 sites:

    NDGF-T1: At risk: 08:00 9 Jun - 00:00 11 Jun. Services: Bergen will update the fimm cluster and the Tier1 machines (compute nodes, dcache machines, grid middleware servers) to Rocks 5.1 with CentOS 5.3 at UiB. Will degrade services a bit.

    RAL-LCG2: OUTAGE: 10:00 8 Jun - 10:00 15 Jun. Services: Relocation to new machine room [IN PROGRESS].

    NDGF-T1: OUTAGE: 00:15 8 Jun - 04:15 8 Jun. Services: GEANT's circuit provider will be performing maintenance on the dark fibre route COP-FRA.

    NDGF-T1: At Risk: 7:30 5 Jun - 15:00 8 Jun. Services: Some dCache pools crashed this morning. Some Atlas and Alice files will be unavailable until the pools have been brought online again. Most pools got back again, but two are still giving us problem. Investigation in progress. [IN PROGRESS]

    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    Please consult the URLs above for details.

  • f) <big> Update on downtimes in the GOCDB </big>
    Speaker: Gilles Mathieu
  • 2
    OSG Items
    Speakers: Maria Dimou, Rob Quick (OSG - Indiana University)
  • 3
    Review of action items
  • 4