WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: AP, Italy
  • VOs: Atlas, Alice, BioMed, CMS, LHCb
  • Minutes
      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
        Minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: CE / CERN
          To: SWE / Italy


          Report from CERN COD:

          1. Site: ITPA-LCG2 was failing GSTAT. Its publishing ScientificSL 5.0 which is not in the OS list used by GSTAT. In such case it should be the responsiblity of site/ROC to send a request to mailing list: roc-dev@lists.grid.sinica.edu.tw to add there required OS version in the list.
            http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name
          2. *Please note rotation calendar for this week:*
            Lead team: SouthWesternEurope
            Backup team: Italy
          Report from CE COD:
          1. Tickets
            opened: 45
            closed: 13
            2nd mail: 6
            extended: 23
            total: 87
          2. Political instances:
            GGUS Ticket #26634 - YerPhI
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, CE, IT, RU


          Issues from EGEE ROCs:
          1. None found. Actually, the pages displayed were completely empty, so I suspect a malfunctioning of the portal (reported)

          Release News:
          1. gLite3.0.2 PPS Update47 released to PPS today and now in phase of pre-deployment test.
            The update contains, among others
            • FTA update (PATCH:1740). gridFTP session handling was changed: now copy and getFileSize are done in the same session: fix for BUG:33528

          2. gLite3.1.0 PPS Update22 was released to PPS last Friday andit is now in phase of pre-deployment testing.
            The release contains, among others, an update of yaim-core, so, technically, all services are concerned. The full list of patch deployed is:
            • 1219 fix for DENY tags to lcg-info-dynamic-scheduler
            • 1645 R3.1/SLC4/x86_64: GFAL/lcg_util update
            • 1663 lcg-infosites (patch 1646 revisited)
            • 1680 R3.1/SLC4/x86_64: GFAL 1.10.8
            • 1709 [ YAIM ] yaim core and yaim lcg-ce 4.0.4 series
            • 1728 [ YAIM ] glite-yaim-clients 4.0.3 series
            • 1730 new lcg-ManageVOTAg version (solving bug 34245)
            • 1738 R3.1/SLC4/i386: GFAL & lcg-util update l
            • 1712 R-GMA fix for forwards compatibility

          3. gLite3.1.0 PPS Update23 is in preparation:
            The release will introduce the WMS and LB services for SL4
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC France): Some site administrators complained because their e-mail address was added to a VO mailing-list without their agreement. The VO has been contacted and the problem is being solved, but that incident raises the more general problem of SPAM generated by the project itself. Could we agree in a good administration rule of mailing-list ? At least, except for some obvious and mandatory mailing-lists, an actor should have the possibility to unregister from any mailing-list by him/herself. The way to unregister should be made clear by the mailing list.

          2. (ROC DECH): Please reopen action item 150. The problem is still present. see GGUS #33850.

          3. (ROC SEE): https://gus.fzk.de/ws/ticket_info.php?ticket=33697 is long overdue, please put some pressure on the corresponding support unit to respond.

          4. (ROC SEE): LCG-TAU still has some problems, thus its is now in downtime for the next 7 days in order to upgrade to the latest gLite release.

          5. (ROC SWE): We would like to have an update Top-BDII failover awareness on gLite client tools. Is it possible to confgiure several BDIIs in form of a list with yaim?

          6. (ROC UKI): GGUS should respond whether the UKI-SOUTHGRID-CAM-HEP problem of 100 mails for the same ticket is a bug.

          7. (ROC UKI): There have been many complaints in UKI about the move to the need to complete the site reports every day. Site admins often fill out the report for the week in one go and this seems a sensible approach - at least they should be able to choose. Several sites have indicated that they will stop filling out the reports in this new format. On the positive side the new interface seems better with the graphical representation of downtime etc. However, it would be very welcome if the colours used between tools were consistent. Previously grey represented downtime and red a failure... now we have black. Sites would also like to see the past history for the report so they can cross reference previous failures which is a feature lost in this upgrade.

          8. (ROC UKI): The move to validating every use of a certificate on a site is becoming tedious. Is this a feature of the browser settings or does everyone get greeted with constant requests to use their certifcate? Is it possible to have a compact view and a detailed view of site problems? I can not see correlations between sites anymore.
        • <big> gLite Release News</big>
          1. gLite 3.1 Update18, announced for last week, is being released to production right now.
            We apologise for the delay, due to a technical issue in the preparation of the release.
            The update contains:
            • NEW: glite-MON for SL4
            • DPM 1.6.7-4
              • fix for bug #33769: incorrect pool free space after dpm-drain
              • improved ACL management for srmMkdir command
            • UI/WN/VOBOX
              • lcg-tags non longer produces Globus warnings suppressed
              • voms-admin client 2.0.6-1 providing ACL support on command line
            • vdt_globus_essentials (affecting several services and notably the CE)
              • bug fix to prevent globus-job-manager processes to pile-up on a CE (bug observed at CERN after SAM WMS/RB tests were enabled )
            • voms-admin server (VOMS)
              • Refactored voms-admin-ping script
              • ACL management web service (compatible with client >= 2.0.6-1)
              • Registration web service.
              • many bug fixes
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board


          Time at WLCG T0 and T1 sites.

        • <big> CCRC'08 Operational Review </big>
          • There will be a test of the Tier0 to Tier1 Optical Private Networks backup links from 15.00 to 19.00 CEST (13.00 to 17.00 UTC) on Wednesday 9 April.
            The plan of the test is here:
            https://twiki.cern.ch/twiki/bin/view/LHCOPN/BackupTest
            RAL will be unreachable for 15-20 minutes between 16:45 and 17:15 CEST
            PIC will be unreachable for 15-20 minutes between 17:15 and 17:45 CEST
            The goal of the maintenance is to verify that all the backup solutions work as expected. The T1s with a backup link should be up all the time, but at the moment we cannot guarantee that it will be the case and there may be outages at any time for any Tier1. It would be appreciated if experiments could particularly exercise their links as much as possible during the test period.
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
          No report received before the meeting.
        • <big> Atlas report </big>
          1. new version of DPM:
            in today' s meeting I would like to know the status of the DPM server version fixing the ACL problem. In particular, I would like to know if this has been released to production and when (it should have been last Wednesday). Also, I would like to know the exact version+patch number so that I can refer to it in the proper manner.
            ATLAS T2s need the patch to start running production on SRMv2 and we would like to push the deployment of the patch ASAP.
            Could you make sure someone can provide the infos mentioned above at today's meeting? Both Alessandro and I will be present.

            ANSWER: The version is DPM 1.6.7-4; it is in PPS and will be released to production today

          2. ATLAS sites with lcg-utils for SRM2:
            we have developped a SAM test to see which version of lcg-utils has been installed on the WN of the ATLAS supporting sites.
            The results can be seen in the sam web page, selecting ATLAS VO, CE, CE-sft-lcg-version
            SAM link
            The sites that give ERROR in this test didn't upgrade to the SRM2 compatible version of lcg-utils.
            Hope this could help in following the action of having, in all the ATLAS supporting sites, the WN upgraded to SRM2
        • <big> CMS report </big>
          No report received before the meeting.
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          No report received before the meeting.
      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          The only outstanding ticket is: https://gus.fzk.de/ws/ticket_info.php?ticket=31037
      • 5:30 PM 5:35 PM
        Review of action items 5m
        list of actions
      • 5:35 PM 5:35 PM
        AOB
        1. Item 1