WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-05-26T16:00:00+02:00
End: 2008-05-26T18:00:00+02:00
Location: CERN conferencing service (joining details below)

Monday 26 May 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0140768

OR click HERE

NB: Reports were not received in advance of the meeting from:

ROCs: CERN

VOs:

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
  
  Previous minutes
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: UK/I / CE
    To: AsiaPacific / SouthWest Europe
    
    Report from CE COD:
    
    Nothing to report.
    Report from UK/I COD:
    
    Nothing to report.
  - <big> PPS Report & Issues </big>
    
    Please find Issues from EGEE ROCs and general info in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
  - <big> gLite Release News</big>
    
    Release News:
    Please find gLite release news in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
  - <big> EGEE issues coming from ROC reports </big>
    
    ROC France: CERN Trusted Certification Authority published old CRL twice during the week. This was due to a CERN CA web site update. Such a problem is quite tricky as failures come from everywhere (SAM test failed, some users complain, Atlas DDM production went wrong). In order to find the problem out, you have to cross those data. So, perhaps:
    
    It would be interesting to monitor CRL validity of official CAs, and keep its history
    Ask CA adminitrator to check the CRL validity after each update.
    Reminder: When a SAM critical test is going wrong, this is obviously good to quickly solve the problem, but please don t forget to announce asap the problem to prevent people from wasting time with misleading alarms.
    
    Several of our sites had problems with site availability this week following the CA root certificate update. Things have now returned to normal but we are carrying out a post mortem to understand where things could have worked better.
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    No items this week.
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
  - GOG-Singapore would like to decommission their site by June 2, 2008. The hardware and services at the site will be shutdown permanently. Please migrate data that is still needed by your VO before the site is disabled.
    The site currently supports the following VOs: alice, atlas, lhcb, cms, biomed, dteam and ops
  - CYFRONET-IA64: We are going to shut down CYFRONET-IA64 completely at the end of May 2008.
    Please take care of your data you may have on our classic SE: ares03.cyf-kr.edu.pl.
  - INFN-FIRENZE: Classic SE grid002.fi.infn.it is planned to be removed from production the 15th June. Please backup your data before that date.
    
    Time at WLCG T0 and T1 sites.

<big> Items from Alice </big>

<big> Items from ATLAS </big>

<big> Items from CMS </big>

On of the main focus of last week was T1 workflows. The first was stable, some Castor issues (mostly GC-related) addressed and fixed. On the second, we had reprocessing and skimming jobs running at T1 sites.
ASGC: Pretty impressive performances; no problems found.
CNAF also OK, running up to ~600 skim jobs in parallel.
FNAL is running all kind of processing jobs since days, up to 3.4K running in parallel.
FZK running processing with some issues reading the input RAW data, skim jobs also somehow slower than other T1's.
IN2P3 is clearing out its backlog, more processing jobs will come.
PIC has much processing to go, total number of slots used driven by the number of running skim jobs, some issues there are being investigated with the input/help from other T1's.
RAL using all queues, some jobs killed by some CEs because exceeding the maximum CPU time on them, now restricted to use the long queue, and things got better.
Interesting week with transfers, especially in T0-T1 together with other VO's, and in T1-T2. Data transfer tests stable in the T0-T1 routes, and will continue until the end of the challenge. T1-T2 going on with link rotation, details at https://twiki.cern.ch/twiki/bin/view/CMS/CCRC08-DataTransfers . Tests ramping down in T1-T1 routes.
at the T0, repacker tests in progress. More info will be collected during this week.

Speaker: Daniele Bonacorsi

<big> Items from LHCb </big>