WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-07-14T16:00:00+02:00
End: 2008-07-14T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 14 Jul 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Steve Traylen (CERN)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: France, Russia

VOs: No VO reports received

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: CERN / Italy
    To: DE-CH/Russia
    
    CERN -Number of alarms was quite low. No ticket to be escalated to ops meeting. All COD tools were remarkably reliable and fast.
    Site escalation to Operations meeting: IL-IUCC 36262 (site has also tkt 37110)
  - <big> PPS Report & Issues </big>
    
    Please find Issues from EGEE ROCs and general info in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
  - <big> gLite Release News</big>
    
    Please find gLite release news in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
  - <big> EGEE issues coming from ROC reports </big>
    
    [ROC DECH]: The SE for CMS, dcache-se-cms.desy.de will be down for upgrade next Thursday (17.7.) 9:00 to 15:00 (local time). The CMS SW tags on the CE will be removed the day before to avoid matching of CMS jobs.
    dCache: gridka-dCache.fzk.de Software update of the dCache SE on 2008-07-16 from 07:00 to 09:00 (UTC)
    LFC: lfc-2-fzk.gridka.de FTS: fts2-fzk.gridka.de databases down for Oracle upgrade on 2008-07-24 from 07:30 to 11:30 (UTC)
    [ROC Italy]: Issue from INFN-T1:
    During the past week-end an incident involving connectivity to the INFN Tier-1 occurred. On Saturday July 5, 2008 at 03:29 (all times are local) a 10 Gigabit/s interface on one of the Tier-1 core switches started flapping. This interface is part of a bundle of 4 10GE interfaces. Although the flapping should not in itself have caused much disturbance to the network infrastructure, the effect was intermittent connectivity to various sets of computers across the Tier-1.
    Problem troubleshooting started immediately, with system specialists looking for possible causes of the problem already in the morning of Saturday July 5, 2008. What rendered detection of the fault not immediately obvious was that no traces of the flapping were recorded in the log files of the core switch actually exhibiting the problem. On Sunday night an official EGEE broadcast message was issued.
    On Monday July 7, 2008 the priority for the replacement of the faulty network card was escalated to the highest possible level to the switch vendor. At 17:00 the faulty network card was replaced, and the network was operational again. A fallout of the network problem was that several systems were stuck and had to be rebooted.
    On Tuesday July 8, 2008 at 11:00, during certification of all INFN Tier-1 subsystems and services, some other network problems were detected. At 12:30 the cause of these problems was identified through log messages in a faulty core switch management card, which caused among other problems random packet loss. Another ticket was opened with the switch vendor, and in the afternoon a replacement management card was received. The replacement of the card and a related operating system upgrade to the core switches finished at 22:00.
    In the morning of Wednesday July 9, 2008 all network, storage and farm subsystems and services were checked and certified as ready for operation. But since the INFN Tier-1 was still in downtime, the decision was taken to replace the the broken component of the electrical switch mentioned above on Thursday July 10, 2008.
    Now all services (network, farming and storage) are up and running.
    
    [Northern]: Estonia: On the whole period we had the site mostly on downtime because there were odd errors with job submission that we were unable to understand. There is a ticket open in GGUS with number 38270, however the responses we have gotten have not been useful. This is not the first time that GGUS has proven to be useless for days/weeks for non-trivial errors and is causing serious reduction in reliability for the site.
  - Documents for Review 15m
    
    Comments on draft document about security command line tools, requested by Christoph Witzig (broadcast sent on 07 Jul, "Feedback request: EGEE/OSG joint document about security tools")
    comments on the multi platform support document edited by SA3 and TMB (broadcast sent on 09 Jul "Feedback request: TMB Proposal on gLite Multi Platform Support")
    I'll add the links to documents in the minutes.
- 16:30 → 17:00
  WLCG Items 30m
  - <big>Verification of alarm workflow for Tier-1 centres</big> 15m
    
    We are doing the first service verification on Thursday,17th.
    So I can report on the results during the operations meeting next Monday.
    
    Speaker: Guenter Grein (Unknown)
  - <big> WLCG issues coming from ROC reports </big>
    
    none
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    CERN will switch the remainder of 2500 nodes tomorrow on the 15th July. All services should have been moved.
    See the DE-CH reports above.
    
    Time at WLCG T0 and T1 sites.
  - <big> WLCG Operational Review </big>
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big>
  - <big> Atlas report </big>
  - <big> CMS report </big>
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big>
  - <big>Recommended base versions for storage services:</big>
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
- 17:30 → 17:35
  
  Review of action items 5m
- 17:35 → 17:36
  
  AOB 1m