WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-10-13T16:00:00+02:00
End: 2008-10-13T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 13 Oct 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE
(Please specify your name & affiliation in the web-interface)

Click here for minutes of all meetings

Click here for the List of Actions

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: UKI and Russia
    To: Taiwan and CE
    Report from Russia:
    Russian COD as a Backup team:
    opened: 37
    closed: 26
    2nd mail: 10
    extended: 21
    total: 94
    
    Report from UKI:
  - <big> PPS Report & Issues </big>
    
    Please find Issues from EGEE ROCs and general info in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
  - <big> gLite Release News</big>
    
    Please find gLite release news in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
    
    Now in Production:
    
    Now in PPS:
    
    Soon in Production:
  - <big> EGEE issues coming from ROC reports </big>
    
    UKI: No data available in ROC (or site) report(s) for the failures from SAM framework section.
  - <big> gLite 3.1 update 33, BDII</big> 10m
    
    Details on the changes of gLite 3.1 update 33 for the BDII
    Dear colleagues, the status of gLite 3.1 Update 33 is as follows:
    
    The glite-BDII (top-level BDII) meta-rpm for Update 33 was removed on Friday. At the same time the previous meta-rpm was changed to require exactly the previous version (3.9.1-5) of the bdii rpm. Sites that already upgraded their top-level BDIIs before these changes may want to downgrade (but see below). Resource and site BDIIs were not seen to display the instabilities described in Savannah bug #42727, therefore the meta-rpms for other node types have not been changed. The top-level BDII instability is being looked into with high priority.
    The "chown" problem reported by Michel Jouvin does not affect sites that use YAIM for their configurations. A fix for this problem has been coded and a new bdii version is being certified. It is expected to be released to the production system this week.
    
    Speaker: Mr Laurence Field (CERN)
  - <big>gLite 3.0 services to be obsoleted</big> 5m
    
    glite-SE_classic
    glite-VOBOX
    glite-WMS
    glite-PX
    glite-MON
    
    An announcement for this retirement is already on the gLite 3.0 page :
    http://glite.web.cern.ch/glite/packages/R3.0/
    This corresponds to the procedure (until we have new one) that was discussed in the ops meeting in Feb 08:
    https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMinutes2008x02x25#Support_for_gLite_3_0_services
    PLEASE, LET US KNOW ANY OBJECTION BY NEXT WEEK!
  - <big> Proposed process for removing SA1 support for old gLite services
    
    Attaches is a proposed process for removing support from obsolete glite services and out-of-date versions of services. Please read and comment as soon as possible.
    
    document
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    France: TEAM/ALARM tickets for T1s: how LHC expirements make their choice between these two type of tickets?
    ATLAS:
    -- ALARM tickets are for problems concerning T0 (mainly problem at T1 blocking data acceptance from T0)
    -- TEAM tickets for all other problems of importance (mainly T1<->T2 transfers for the moment) Currently in discussion: if the problem is not acknowledged by the site before 2PM the following day, then an ALARM ticket is sent.
    Could CMS, ALICE and LHCb explicit the range of use of each tickets?
  - <big>status of the WMS for Alice</big> 15m
    
    Alice wants to fully replace the RBs and only use the WMS in production at all sites. In Alice's computing model it is recommended (not mandatory) that sites provide a local WMS, though they understand that for some T2 sites this can be very difficult. Alice would like to requests to T1 sites and in general to all sites providing RBs to Alice, to migrate to the WMS. Specially the first target sites are NIKHEF and CCIN2P3.
    
    NIKHEF : is providing 2 RBs but no WMS yet
    IN2P3: no WMS there supporting Alice. In France there are only 2 at T2 sites: datagrid.cea.fr y lal.in2p3.fr. They would like to request IN2P3 to also provide one.
  - <big> CREAM CE for Alice (& PPS pilot service) </big>
    
    Alice would like to start using the CREAM CE in production. To do this, Alice has the following requirements on sites:
    Keep current LCG CE and install CREAM CE on another box.
    Install a 2nd VObox to point to the CREAM CE. VOBox can be in a virtual machine if the site is short of boxes.
    Point the CREAM CE to the standard Alice production queue.
    Need a GridFTP server somewhere on the site.
    This request also presents another opportunity: Any sites that wish to support Alice with the CREAM CE could also support the testing of the new ICE enabled WMS, simply by installing the latest version of the CREAM CE (available in the PPS repositories) rather then the version currently in the production repositories. Sites wishing to do this would also need to configure CMS as a VO on their site - no other action is needed on the part of the site.
    
    Any sites who are interested should contact occ-grid-support@cern.ch. Installation instructions for CREAM CE will be provided.
    
    Alice would like to ask that all LCG tier-1s (which support the Alice VO) contribute to this task. Alice would also like to invite as many tier-2 sites as possible to join in.
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    Many interventions scheduled this week. Please consult the URLs above for details.
    
    Time at WLCG T0 and T1 sites.
  - <big> WLCG Operational Review </big>
    
    https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek081013
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big>
  - <big> Atlas report </big>
    
    the site is LPNHE (part of GRIF):
    it is in downtime
    https://goc.gridops.org/downtime/list?id=10455542
    but no rss feed has been sent about it.
    feed://cic.gridops.org/index_rssflow.php?service=downtime_vo&vo=atlas
    This could be useful for the CIC people to tune the rss feed, that is the way in which the experiments are retrieving the infos about the downtimes.
  - <big> CMS report </big>
    
    None.
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big>
    
    Any comments from sites concerning last week request about gridmap file for LHCb? If not I will proceed by formulating an EGEE broadcast for all sites to implement this "safe" mapping in case of VOMS mapping failure.
    
    EGEE downtime announcement procedure:
    1 Announcement of scheduled downtime with a mail "Announcement" at least 24h in advance as in the MoU.
    2. Start of downtime (scheduled and unscheduled) as of the time when it starts with a mail "Start" (with correct time!)
    3. End of downtime: mail"End" (with correct time)
    
    (From Philippe) In the last couple of days we tend to receive update notifications from GGUS for tickets that according to the web page were not updated at all (ex #41707, last update was October 3rd but we got mails also recently). Why this happens?
  - <big> Storage services: Recommended base versions </big>
    
    The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
  - <big> Storage services: this week's updates </big>
    
    Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
- 17:30 → 17:35
  
  Review of action items 5m
- 17:35 → 17:36
  
  AOB 1m