WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: UKI and Italy
          To: SEE and Russia


          Report from Italy:
          • Report encountered problems with grid core services
          • Any Savannah/GGUS tickets that need more attention to a wider audience?
            top-bdii problem (discussed in lcg-rollout with subject "TOPBDII: result: 80 Internal (implementation specific) error"):
            https://gus.fzk.de/ws/ticket_info.php?ticket=43230&from=search a solution is provided (bdii package update under certification).
          Candidate sites for suspension (from UKI):
          • first Ops meeting (OCC involved)
            SITE NAME : ENEA-INFO
            ROC NAME : Italy
            GGUS TICKET NUMBER : 44997
            reason for escalation: No response to 1st or 2nd mails, However a response has now been received.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          • ROC Italy: top-bdii servers have been update to resolve the problem reported on GGUS ticket #43230
          • ROC Italy: At least two site in italy (INFN-CATANIA and INFN-PADOVA) are fighting with the problem of thousands of "globus-gma [defunct]". There is a ggus ticket about this problem (https://gus.fzk.de/ws/ticket_info.php?ticket=42981), but it is not clear which is the real cause of the problem and if there is a solution.
          • ROC Italy: ENEA-INFO: the ticket 44997 (APEL failure on egce.frascati.enea.it (ENEA-INFO) has been solved.
            The problem was due to the change of the queues names due to farm migration to sl4. This change was not propagated to the accounting server (based on DGAS).
            Several support units in the italian ticketing system were involved, but nobody remember to push partial updates on the ggus ticket (lack in the italian support procedures).
            Apologies for that, adequate countermeasures have been taken.
          • ROC SEE: GR-01-AUTH, raised the issue about the stability of the BDII and how it affects day to day users in their data operations. More info can be found at https://savannah.cern.ch/bugs/?45455.
            The main problems with the bdii are caused due to frequent and incompatible updates that may add functionality but do not improve reliability.
            I believe that our current trend to overload the bdii with more info does not help either.
            GR-01-AUTH submitted also the following GGUS tickets which are still pending.
            https://gus.fzk.de/ws/ticket_info.php?ticket=43230
            https://gus.fzk.de/ws/ticket_info.php?ticket=43578
            ANSWER: Bug 45455 is not related to the stability of the BDII. It is a request for additional functionality which would improve the robustness of the information system.
            No recent BDII updates were incompatible or added any functionality. The main change in update 34 was to move to the bdb backend as the ldbm backend is now obsolete. Also no additional information has been added. One addition was the introduction of a new sub tree in the database containing a few long term development related things but there have been no reported problems relating to this.
            Both GGUS tickets were addressed in a timely manner and the patch was submitted 3rd Dec.
            This bug (45455) is not related to BDII problems, and is not critical at all (in remind state). Moreover, there is already a BDII failover mechanism in GFAL since 1.10.6 (tagged in dec. 2007).
            A local file bdii cache is on the todo list for GFAL (no timeline), this functionality will certainly help in case of bdii instability as proposed in the bug by Akos.
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. ROC ???: Item
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          1. Item
        • <big> Atlas report </big>
          1. Item
        • <big> CMS report </big>
          (Due to constant meeting clashes on Mondays in 2009, I may be out of this call when you get to this point. If so, please find below a summary, and mail me any questions).
          • activity running smoothly. Some stageout errors seen at PIC in reprocessing activities, due to the infamous dcache feature of creating directories root:root: fixed by hand, it was related to just 1 dataset, and was there for a couple of (working CERN) hours: so, almost invisible to CMS (thanks PIC).
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. Item
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 5:30 PM 5:35 PM
        Review of action items 5m
      • 5:35 PM 5:35 PM
        AOB