WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

    Recording of the meeting
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: Central Europe and AsiaPacific
          To: DECH and SouthEast Europe

          Report from CE COD::
          • Nothing to report.

          Report from Asia Pacific COD :
          • GGUS Ticket-ID 42124 against site WEIZMANN-LCG2. APEL problem not solved yet but no responses from Nov. 7th.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          • ROC France: INFORMATION IN2P3-CC: Central LFC for Biomed VO is currently overloaded due to a growth of Biomed activity. Even if the hardware was upgraded in emergency on Friday the problem is still there. The problem might be due to some limitations in the number of simultaneous connections between the LFC and the Oracle DB. We will contact LFC support to find a good (and scalable) solution. Sorry for the inconvenience.

          • ROC UK/I: A Biomed user's activity has caused site instabilities by repeatedly trasfering the same 2.8GB file to WNs across EGEE from a single UK site SE. After ticketing the user they produced more replicas but there is concern about this data distribution model and the bandwidth stress. For a related GGUS ticket see: https://gus.fzk.de/ws/ticket_info.php?ticket=43489. The user responded quickly. We may be seeing signs of the limit of the standard submission approach/model: "We are submitting theses jobs with the native EGEE command glite-wms-job-submit . These grid jobs are then accessing the 2.8GB data file through the command lcg-cp . So we didn't decide neither where the jobs are scheduled nor which file-replicate is used by these jobs. The EGEE middleware is deciding." Because of the I/O limitations the Biomed jobs are often quite inefficient.

          • ROC UK/I: UKI-NORTHGRID-LANCS-HEP saw a problem with a recent WN update: GGUS 43473 . The ticket seems to bounce around without anybody really knowing how to help! The point to note is that it is likely a site problem but the site/ROC has struggled to understand the problem as it (looks like it) requires middleware expert help. The site will try a reinstall with 64-bit gLite to try to remove the 64/32-bit incompatibilities but no real understanding of the problem has happened.

          • ROC UK/I: Site availability does not take into account SRM V2 systems. As a result the overall RAL availability is dependent on a dcache service which is no longer considered a front line service. SRM V2 not being in the overall availability figures is a problem with the monitoring not the site.
            Update The WLCG Management Board decided on Tuesday to use SRMv2 in the availability calculations as of December (in lieu of the SRMv1 tests). This will be discussed with the EGEE ROC Managers to ask them to ratify this.

          • ROC UK/I: On the topic of SAM, has there been any progress on centrally identifying common problems seen in SAM? On 19th November from 18:00-21:00 UK time a number of sites saw the same (top-level BDII?) problem. It would save much time if these errors could be automatically flagged as possibly due to an offsite problem.
          Plots for Biomed activity
        • <big> Java Bouncy Castle problems </big>
          Extract from broadcast:
          A few days ago jpackage updated bouncycastle to version 1.41. This version causes problems for several glite nodes as it places the jars in a new directory. The glite developers are currently working on patches to solve this issue. For the time being please make sure that your site DOES NOT UPGRADE to bouncycastle 1.41.
          Node types affected by this problem:
          • glite-UI
          • glite-MON
          • glite-CREAM
          • glite-FTS_oracle
          • glite-WN
          • glite-TORQUE_utils
          • glite-LSF_utils
          • glite-CONDOR_utils
          • glite-VOMS_mysql
          • glite-VOMS_oracle
          • glite-VOBOX
          • lcg-CE
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. Nothing this week.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Please consult the URLs above for details. Summary of downtimes during the next 7 days:
          1. UKI-LT2-RHUL; Power outage; OUTAGE
          2. RAL-LCG2; Castor stager instance for Alice, Minos, ILC and MICE to be upgraded.; RAL-LCG2; OUTAGE
          3. GOCDB; Rollout of GOCDB release 3.1.2; OUTAGE
          4. NIKHEF-ELPROD; Maintenance window; AT_RISK
          5. NDGF-T1; ATLAS pool restarts; AT_RISK
          6. INFN-NAPOLI-ATLAS; HW intervention; AT_RISK
          7. UKI-SOUTHGRID-BRIS-HEP; LCG CE upgrade to SL4; OUTAGE
          8. RAL-LCG2; Upgrade to Castor LHCb stager instance; OUTAGE
          9. INFN-GENOVA; Hardware problem; OUTAGE
          10. WEIZMANN-LCG2; Testing of new SE; OUTAGE
          11. BMEGrid; Shared software area fix shared; OUTAGE
          12. T2_Estonia; Investigation in CE problems; AT_RISK
          13. INFN-CS; Delay on solving problems; OUTAGE

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          1. Item
        • <big> Atlas report </big>
          1. Item
        • <big> CMS report </big>
          1. Item
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. Item
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

        • <big> Storage services: this week's updates </big>
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35