WLCG-OSG-EGEE Operations meeting
→
Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))
28-R-15
CERN conferencing service (joining details below)
Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
OSG operations team
EGEE operations team
EGEE ROC managers
WLCG coordination representatives
WLCG Tier-1 representatives
other site representatives (optional)
GGUS representatives
VO representatives
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768
OR click HERE
-
-
16:00
→
16:01
Feedback on last meeting's minutes 1m
-
16:01
→
16:30
EGEE Items 29m
-
<big> Grid-Operator-on-Duty handover </big>From: DECH / CERN
To: Russia / Italy
Report from DECH COD:- Quiet week. Two items (sites) to mention here:
- INFN-NAPOLI (GGUS #39631). No response over 10 days -> Step 3: operations meeting. Site was set to downtime by Italian ROC.
- INFN-LECCE (GGUS #39533). Also no answers, but it seems that the site has now the status "uncertified". Next COD should followup with the ROC about the intended status of this site.
Report from CERN COD:- Very simple week, COD dashboard is much faster than it ever was.
A short outage on Thursday with xSQL interface that CIC portal queries SAM with. Judit fixed it immediately, problem understood.
- Quiet week. Two items (sites) to mention here:
-
<big> PPS Report & Issues </big>
- .
-
<big> gLite Release News</big>Now in Production
- -
- -
Now in PPS- -
Soon in Production- -
- -
- -
-
<big> EGEE issues coming from ROC reports </big>
- None this week.
- None this week.
-
-
16:30
→
17:00
WLCG Items 30m
-
<big> WLCG issues coming from ROC reports </big>
- None this week.
- None this week.
-
<big> End points for FTM service at tier-1 sites </big>Here is the latest list of FTM end-points:
The list of FTM end-points we have so far is:- ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
- BNL: ???
- CERN: https://ftsmon.cern.ch/transfer-monitor-report/
- FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
https://cmsfts3.fnal.gov:8443/transfer-monitor-gridvie - FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
- IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
- INFN: https://tier1.cnaf.infn.it/ftmmonitor/
- NDGF: Being installed.
- PIC: http://ftm.pic.es/transfer-monitor-report/
- RAL: No endpoint in produciton yet.
- SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
http://ftm.grid.sara.nl/transfer-monitor-gridview - TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
-
<big>FTS SL4 - required by the experiments or tier-1 sites?</big>Alice: Neutral (as long as there is no disruption to the service. ATLAS: Prefer not to; to avoid introducing problems this close to data taking. CMS: Priority is stability for data taking days. Whatever is scheduled in advance *and* allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin) LHCb: Neutral (as long as there is no disruption to the service. ASGC: BNL: Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over. CERN: FNAL: Hardware is dating fast. May be issues with maintenance. FZK: IN2P3: INFN: NDGF: PIC: RAL: SARA/Nikhef: TRIUMF:
-
<big>WLCG Service Interventions (with dates / times where known) </big>Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
- NDGF-T1 [at risk]: dCache upgrade on the CSC pools. Some CMS and ALICE data unavailable.
From: Tuesday 2008-08-26, 06:00:00 UTC;
To: Tuesday 2008-08-26, 09:00:00 UTC
Affected nodes:- srm.ndgf.org
- RAL [OUTAGE]: Atlas and LHCB LFC downtime for upgrade.
From: Tuesday 2008-08-26, 12:00:00 UTC;
To: Tuesday 2008-08-26, 13:00:00 UTC
Affected nodes:- lcglfc0377.gridpp.rl.ac.uk
- lfc0448.gridpp.rl.ac.uk
- CERN [OUTAGE]: CASTORPUBLIC 2.1.7-16 upgrade.
From: Wednesday 2008-08-27, 12:00:00 UTC;
To: Wednesday 2008-08-27, 13:30:00 UTC
Affected nodes:- srm-dteam.cern.ch
- castorsrm.cern.ch
- srm.cern.ch
- srm-v2.cern.ch
- srm-public.cern.ch
- NDGF-T1 [at risk]: Optical cable maintenance work on the IJS-NDGF network connection.
From: Wednesday 2008-08-27, 22:00:00 UTC;
To: Thursday 2008-08-28, 03:00:00 UTC
Affected nodes:- srm.ndgf.org
Time at WLCG T0 and T1 sites.
- NDGF-T1 [at risk]: dCache upgrade on the CSC pools. Some CMS and ALICE data unavailable.
-
<big> WLCG Operational Review </big>Speaker: Harry Renshall / Jamie Shiers
-
<big> Alice report </big>
-
<big> Atlas report </big>
-
<big> CMS report </big>
- general on CRUZET-4 and T0 workflows:
CRUZET-4 over at ~8am in the morning, ~38 ml evts collected during the exercise, most interesting part from Thursday on, >25 ml evts only in last weekend. Plenty of precious info and feedback on a real-life exercise. CRUZET Jamboree on Wednesday afternoon. CRUZET-like activities will restart again with magnetic field at the end of the week. --- SLS reported "CMS Online databases" at 0% availability, due to a CMS DB intervention in the Online, now over and status is OK.
- Distributed Data Transfers:
We see 1) issues with the stager agent (experts aware and investigating) + 2) some Castor issues causing problems to the CAF (2 tickets to CERN-IT still pending over the weekend, see [$1] and [$2]) + 3) issue with download agents in at least 2 T1 sites. This overall causes PhEDEx service to be labelled as 'degraded' in SLS. These are being addressed/closed right now- as from news from the WLCG daily call
[$1] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546182&email=stephen.gowdy@cern.ch
[$2] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546181&email=peter.kreuzer@cern.ch
- Tier-2 workflows:
The high-profile Summer'08 production is on-going, still ramping up to full speed though.
Speaker: Daniele Bonacorsi - general on CRUZET-4 and T0 workflows:
-
<big> LHCb report </big>
- LHCb is wondering (and wants to be seriously taken into account) whether it is valid that any downtime announced less than 24 hours must be considered Unscheduled rather than scheduled (with obvious different implication at the site reliability computation level)
- LHCb wants to remind all sites that the Shared Area is also a critical service and sites must guarantee the adequate QoS required. The problem at CNAF teaches us that this is important. How can this message be conveyed efficiently to all sites and the quality improved by adopting/writing adequate fabric sensors?
- The last week SAM sensors http://lblogbook.cern.ch/Operations/375 pointed out a problem about SAM critical services (used by Gridview algorithms to computing reliability) and services effectively used by the VOs. The 20th of August StoRM at CNAF stopped to be published as SRM sensor (it is now only SRMv2 sensor in SAM dictionary) and then SAM clients fail to publish results. The net effect is that, for the still critical SRM service, there are not results available for CNAF since then. Open a GGUS for GridVIEW team: https://gus.fzk.de/pages/ticket_details.php?ticket=40087
- LHCb is wondering (and wants to be seriously taken into account) whether it is valid that any downtime announced less than 24 hours must be considered Unscheduled rather than scheduled (with obvious different implication at the site reliability computation level)
-
<big> Storage services: Recommended base versions </big>The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
Note that the recommended dCache version has been updated to 1.8.0-15p11. -
<big> Storage services: this week's updates </big>Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
- Version 1.8.0-15p12 of dCache will be soon available. Installation scripts and improvements for sites using Chimera are available. Sites who do not use Chimera should not upgrade to this version.
-
-
17:00
→
17:30
OSG Items 30mSpeaker: Rob Quick (OSG - Indiana University)
-
Discussion of open tickets for OSG
- https://gus.fzk.de/ws/ticket_info.php?ticket=37948
Should be set to solved. - https://gus.fzk.de/ws/ticket_info.php?ticket=38087
Looks like user error. Can it be closed?
- https://gus.fzk.de/ws/ticket_info.php?ticket=37948
-
-
17:30
→
17:35
Review of action items 5m
-
17:35
→
17:36
AOB 1m
-
16:00
→
16:01