WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-08-25T16:00:00+02:00
End: 2008-08-25T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 25 Aug 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

Click here for minutes of all meetings

Click here for the List of Actions

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: DECH / CERN
    To: Russia / Italy
    Report from DECH COD:
    
    Quiet week. Two items (sites) to mention here:
    INFN-NAPOLI (GGUS #39631). No response over 10 days -> Step 3: operations meeting. Site was set to downtime by Italian ROC.
    INFN-LECCE (GGUS #39533). Also no answers, but it seems that the site has now the status "uncertified". Next COD should followup with the ROC about the intended status of this site.
    
    Report from CERN COD:
    
    Very simple week, COD dashboard is much faster than it ever was.
    A short outage on Thursday with xSQL interface that CIC portal queries SAM with. Judit fixed it immediately, problem understood.
  - <big> PPS Report & Issues </big>
    
    .
  - <big> gLite Release News</big>
    
    Now in Production
    -
    
    -
    
    Now in PPS
    -
    
    Soon in Production
    -
    
    -
  - <big> EGEE issues coming from ROC reports </big>
    
    None this week.
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    None this week.
  - <big> End points for FTM service at tier-1 sites </big>
    
    Here is the latest list of FTM end-points:
    
    The list of FTM end-points we have so far is:
    
    ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
    BNL: ???
    CERN: https://ftsmon.cern.ch/transfer-monitor-report/
    FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
    https://cmsfts3.fnal.gov:8443/transfer-monitor-gridvie
    FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
    IN2P3: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
    INFN: https://tier1.cnaf.infn.it/ftmmonitor/
    NDGF: Being installed.
    PIC: http://ftm.pic.es/transfer-monitor-report/
    RAL: No endpoint in produciton yet.
    SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
    http://ftm.grid.sara.nl/transfer-monitor-gridview
    TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
  - <big>FTS SL4 - required by the experiments or tier-1 sites?</big>
    
    Alice: Neutral (as long as there is no disruption to the service. ATLAS: Prefer not to; to avoid introducing problems this close to data taking. CMS: Priority is stability for data taking days. Whatever is scheduled in advance *and* allows some pre-testing can be negotiated, though. On CERN migration, instead, PhEDEx /Prod vs /Debug instance can be played with to allow testing before going into prod (talked to Gavin) LHCb: Neutral (as long as there is no disruption to the service. ASGC: BNL: Has a fairly pressing need to move to SL/RHEL4 because of our site security situation. If it is made available in production soon, we would definitely switch over. CERN: FNAL: Hardware is dating fast. May be issues with maintenance. FZK: IN2P3: INFN: NDGF: PIC: RAL: SARA/Nikhef: TRIUMF:
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    NDGF-T1 [at risk]: dCache upgrade on the CSC pools. Some CMS and ALICE data unavailable.
    From: Tuesday 2008-08-26, 06:00:00 UTC;
    To: Tuesday 2008-08-26, 09:00:00 UTC
    Affected nodes:
    srm.ndgf.org
    
    RAL [OUTAGE]: Atlas and LHCB LFC downtime for upgrade.
    From: Tuesday 2008-08-26, 12:00:00 UTC;
    To: Tuesday 2008-08-26, 13:00:00 UTC
    Affected nodes:
    
    lcglfc0377.gridpp.rl.ac.uk
    lfc0448.gridpp.rl.ac.uk
    
    CERN [OUTAGE]: CASTORPUBLIC 2.1.7-16 upgrade.
    From: Wednesday 2008-08-27, 12:00:00 UTC;
    To: Wednesday 2008-08-27, 13:30:00 UTC
    Affected nodes:
    
    srm-dteam.cern.ch
    castorsrm.cern.ch
    srm.cern.ch
    srm-v2.cern.ch
    srm-public.cern.ch
    
    NDGF-T1 [at risk]: Optical cable maintenance work on the IJS-NDGF network connection.
    From: Wednesday 2008-08-27, 22:00:00 UTC;
    To: Thursday 2008-08-28, 03:00:00 UTC
    Affected nodes:
    
    srm.ndgf.org
    
    Time at WLCG T0 and T1 sites.
  - <big> WLCG Operational Review </big>
    
    https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek080728
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big>
  - <big> Atlas report </big>
  - <big> CMS report </big>
    
    general on CRUZET-4 and T0 workflows:
    CRUZET-4 over at ~8am in the morning, ~38 ml evts collected during the exercise, most interesting part from Thursday on, >25 ml evts only in last weekend. Plenty of precious info and feedback on a real-life exercise. CRUZET Jamboree on Wednesday afternoon. CRUZET-like activities will restart again with magnetic field at the end of the week. --- SLS reported "CMS Online databases" at 0% availability, due to a CMS DB intervention in the Online, now over and status is OK.
    
    Distributed Data Transfers:
    We see 1) issues with the stager agent (experts aware and investigating) + 2) some Castor issues causing problems to the CAF (2 tickets to CERN-IT still pending over the weekend, see [$1] and [$2]) + 3) issue with download agents in at least 2 T1 sites. This overall causes PhEDEx service to be labelled as 'degraded' in SLS. These are being addressed/closed right now- as from news from the WLCG daily call
    [$1] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546182&email=stephen.gowdy@cern.ch
    [$2] http://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000546181&email=peter.kreuzer@cern.ch
    
    Tier-2 workflows:
    The high-profile Summer'08 production is on-going, still ramping up to full speed though.
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big>
    
    LHCb is wondering (and wants to be seriously taken into account) whether it is valid that any downtime announced less than 24 hours must be considered Unscheduled rather than scheduled (with obvious different implication at the site reliability computation level)
    
    LHCb wants to remind all sites that the Shared Area is also a critical service and sites must guarantee the adequate QoS required. The problem at CNAF teaches us that this is important. How can this message be conveyed efficiently to all sites and the quality improved by adopting/writing adequate fabric sensors?
    
    The last week SAM sensors http://lblogbook.cern.ch/Operations/375 pointed out a problem about SAM critical services (used by Gridview algorithms to computing reliability) and services effectively used by the VOs. The 20th of August StoRM at CNAF stopped to be published as SRM sensor (it is now only SRMv2 sensor in SAM dictionary) and then SAM clients fail to publish results. The net effect is that, for the still critical SRM service, there are not results available for CNAF since then. Open a GGUS for GridVIEW team: https://gus.fzk.de/pages/ticket_details.php?ticket=40087
  - <big> Storage services: Recommended base versions </big>
    
    The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
    
    Note that the recommended dCache version has been updated to 1.8.0-15p11.
  - <big> Storage services: this week's updates </big>
    
    Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
    
    Version 1.8.0-15p12 of dCache will be soon available. Installation scripts and improvements for sites using Chimera are available. Sites who do not use Chimera should not upgrade to this version.
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
    
    https://gus.fzk.de/ws/ticket_info.php?ticket=37948
    Should be set to solved.
    https://gus.fzk.de/ws/ticket_info.php?ticket=38087
    Looks like user error. Can it be closed?
- 17:30 → 17:35
  
  Review of action items 5m
- 17:35 → 17:36
  
  AOB 1m