28-R-15 (CERN conferencing service (joining details below))
CERN conferencing service (joining details below)
firstname.lastname@example.org Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
The SE for CMS, dcache-se-cms.desy.de will be down for upgrade next Thursday (17.7.) 9:00 to 15:00 (local time). The CMS SW tags on the CE will be removed the day before to avoid matching of CMS jobs.
Software update of the dCache SE
on 2008-07-16 from 07:00 to 09:00 (UTC)
databases down for Oracle upgrade
on 2008-07-24 from 07:30 to 11:30 (UTC)
Issue from INFN-T1:
During the past week-end an incident involving connectivity to the INFN Tier-1 occurred.
On Saturday July 5, 2008 at 03:29 (all times are local) a 10 Gigabit/s interface on one of the Tier-1 core switches started flapping. This interface is part of a bundle of 4 10GE interfaces. Although the flapping should not in itself have caused much disturbance to the network infrastructure, the effect was intermittent connectivity to various sets of computers across the Tier-1.
Problem troubleshooting started immediately, with system specialists looking for possible causes of the problem already in the morning of Saturday July 5, 2008. What rendered detection of the fault not immediately obvious was that no traces of the flapping were recorded in the log files of the core switch actually exhibiting the problem. On Sunday night an official EGEE broadcast message was issued.
On Monday July 7, 2008 the priority for the replacement of the faulty network card was escalated to the highest possible level to the switch vendor. At 17:00 the faulty network card was replaced, and the network was operational again. A fallout of the network problem was that several systems were stuck and had to be rebooted.
On Tuesday July 8, 2008 at 11:00, during certification of all INFN Tier-1 subsystems and services, some other network problems were detected. At 12:30 the cause of these problems was identified through log messages in a faulty core switch management card, which caused among other problems random packet loss. Another ticket was opened with the switch vendor, and in the afternoon a replacement management card was received. The replacement of the card and a related operating system upgrade to the core switches finished at 22:00.
In the morning of Wednesday July 9, 2008 all network, storage and farm subsystems and services were checked and certified as ready for operation. But since the INFN Tier-1 was still in downtime, the decision was taken to replace the the broken component of the electrical switch mentioned above on Thursday July 10, 2008.
Now all services (network, farming and storage) are up and running.
[Northern]: Estonia: On the whole period we had the site mostly on downtime because there were odd errors with job submission that we were unable to understand. There is a ticket open in GGUS with number 38270, however the responses we have gotten have not been useful. This is not the first time that GGUS has proven to be useless for days/weeks for non-trivial errors and is causing serious reduction in reliability for the site.
Documents for Review15m
Comments on draft document about security command line tools,
requested by Christoph Witzig (broadcast sent on 07 Jul, "Feedback
request: EGEE/OSG joint document about security tools")
comments on the multi platform support document edited by SA3 and TMB
(broadcast sent on 09 Jul "Feedback request: TMB Proposal on gLite Multi Platform Support")
I'll add the links to documents in the minutes.
<big>Verification of alarm workflow for Tier-1 centres</big>15m
We are doing the first service verification on Thursday,17th.
So I can report on the results during the operations meeting next Monday.
<big> WLCG issues coming from ROC reports </big>
<big>WLCG Service Interventions (with dates / times where known) </big>