WLCG-OSG-EGEE Operations meeting

Name: WLCG-OSG-EGEE Operations meeting
Start: 2008-11-24T16:00:00+01:00
End: 2008-11-24T18:00:00+01:00
Location: CERN conferencing service (joining details below)

Monday 24 Nov 2008, 16:00 → 18:00 Europe/Zurich

28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray

Description

grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:

OSG operations team

EGEE operations team

EGEE ROC managers

WLCG coordination representatives

WLCG Tier-1 representatives

other site representatives (optional)

GGUS representatives

VO representatives

To dial in to the conference:
a. Dial +41227676000
b. Enter access code 0148141

OR click HERE
(Please specify your name & affiliation in the web-interface)

Click here for minutes of all meetings

Click here for the List of Actions

- 16:00 → 16:01
  
  Feedback on last meeting's minutes 1m
- 16:01 → 16:30
  EGEE Items 29m
  - <big> Grid-Operator-on-Duty handover </big>
    
    From: Central Europe and AsiaPacific
    To: DECH and SouthEast Europe
    
    Report from CE COD::
    
    Nothing to report.
    
    Report from Asia Pacific COD :
    
    GGUS Ticket-ID 42124 against site WEIZMANN-LCG2. APEL problem not solved yet but no responses from Nov. 7th.
  - <big> PPS Report & Issues </big>
    
    Please find Issues from EGEE ROCs and general info in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
  - <big> gLite Release News</big>
    
    Please find gLite release news in:
    
    https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
  - <big> EGEE issues coming from ROC reports </big>
    
    ROC France: INFORMATION IN2P3-CC: Central LFC for Biomed VO is currently overloaded due to a growth of Biomed activity. Even if the hardware was upgraded in emergency on Friday the problem is still there. The problem might be due to some limitations in the number of simultaneous connections between the LFC and the Oracle DB. We will contact LFC support to find a good (and scalable) solution. Sorry for the inconvenience.
    
    ROC UK/I: A Biomed user's activity has caused site instabilities by repeatedly trasfering the same 2.8GB file to WNs across EGEE from a single UK site SE. After ticketing the user they produced more replicas but there is concern about this data distribution model and the bandwidth stress. For a related GGUS ticket see: https://gus.fzk.de/ws/ticket_info.php?ticket=43489. The user responded quickly. We may be seeing signs of the limit of the standard submission approach/model: "We are submitting theses jobs with the native EGEE command glite-wms-job-submit . These grid jobs are then accessing the 2.8GB data file through the command lcg-cp . So we didn't decide neither where the jobs are scheduled nor which file-replicate is used by these jobs. The EGEE middleware is deciding." Because of the I/O limitations the Biomed jobs are often quite inefficient.
    
    ROC UK/I: UKI-NORTHGRID-LANCS-HEP saw a problem with a recent WN update: GGUS 43473 . The ticket seems to bounce around without anybody really knowing how to help! The point to note is that it is likely a site problem but the site/ROC has struggled to understand the problem as it (looks like it) requires middleware expert help. The site will try a reinstall with 64-bit gLite to try to remove the 64/32-bit incompatibilities but no real understanding of the problem has happened.
    
    ROC UK/I: Site availability does not take into account SRM V2 systems. As a result the overall RAL availability is dependent on a dcache service which is no longer considered a front line service. SRM V2 not being in the overall availability figures is a problem with the monitoring not the site.
    Update The WLCG Management Board decided on Tuesday to use SRMv2 in the availability calculations as of December (in lieu of the SRMv1 tests). This will be discussed with the EGEE ROC Managers to ask them to ratify this.
    
    ROC UK/I: On the topic of SAM, has there been any progress on centrally identifying common problems seen in SAM? On 19th November from 18:00-21:00 UK time a number of sites saw the same (top-level BDII?) problem. It would save much time if these errors could be automatically flagged as possibly due to an offsite problem.
    
    Plots for Biomed activity
  - <big> Java Bouncy Castle problems </big>
    
    Extract from broadcast:
    A few days ago jpackage updated bouncycastle to version 1.41. This version causes problems for several glite nodes as it places the jars in a new directory. The glite developers are currently working on patches to solve this issue. For the time being please make sure that your site DOES NOT UPGRADE to bouncycastle 1.41.
    Node types affected by this problem:
    
    glite-UI
    glite-MON
    glite-CREAM
    glite-FTS_oracle
    glite-WN
    glite-TORQUE_utils
    glite-LSF_utils
    glite-CONDOR_utils
    glite-VOMS_mysql
    glite-VOMS_oracle
    glite-VOBOX
    lcg-CE
- 16:30 → 17:00
  WLCG Items 30m
  - <big> WLCG issues coming from ROC reports </big>
    
    Nothing this week.
  - <big>WLCG Service Interventions (with dates / times where known) </big>
    
    Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
    
    Please consult the URLs above for details. Summary of downtimes during the next 7 days:
    
    UKI-LT2-RHUL; Power outage; OUTAGE
    RAL-LCG2; Castor stager instance for Alice, Minos, ILC and MICE to be upgraded.; RAL-LCG2; OUTAGE
    GOCDB; Rollout of GOCDB release 3.1.2; OUTAGE
    NIKHEF-ELPROD; Maintenance window; AT_RISK
    NDGF-T1; ATLAS pool restarts; AT_RISK
    INFN-NAPOLI-ATLAS; HW intervention; AT_RISK
    UKI-SOUTHGRID-BRIS-HEP; LCG CE upgrade to SL4; OUTAGE
    RAL-LCG2; Upgrade to Castor LHCb stager instance; OUTAGE
    INFN-GENOVA; Hardware problem; OUTAGE
    WEIZMANN-LCG2; Testing of new SE; OUTAGE
    BMEGrid; Shared software area fix shared; OUTAGE
    T2_Estonia; Investigation in CE problems; AT_RISK
    INFN-CS; Delay on solving problems; OUTAGE
    
    Time at WLCG T0 and T1 sites.
  - <big> WLCG Operational Review </big>
    
    https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek081013
    
    Speaker: Harry Renshall / Jamie Shiers
  - <big> Alice report </big>
    
    Item
  - <big> Atlas report </big>
    
    Item
  - <big> CMS report </big>
    
    Item
    
    Speaker: Daniele Bonacorsi
  - <big> LHCb report </big>
    
    Item
  - <big> Storage services: Recommended base versions </big>
    
    The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions
  - <big> Storage services: this week's updates </big>
    
    Refer to the wiki page here: https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08StorageStatus
- 17:00 → 17:30
  OSG Items 30m
  
  Speaker: Rob Quick (OSG - Indiana University)
  - Discussion of open tickets for OSG
- 17:30 → 17:35
  
  Review of action items 5m
- 17:35 → 17:36
  
  AOB 1m

Choose timezone

WLCG-OSG-EGEE Operations meeting

28-R-15

CERN conferencing service (joining details below)