WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 1
        Feedback on last meeting's minutes
      • 2
        EGEE Items
        • a) <big> Grid-Operator-on-Duty handover </big>
          From: DECH and Southeast Europe
          To: Southwest Europe and North Europe


          Report from DECH::
          • Network problems between Cern and Canada-Sites, as well as Taiwan-Sites.

          Report from Southeast Europe COD : List of unresponsive sites:
          • Site: IMCSUL (ROC North), GGUS TICKET NUMBER: 43141
            The ticket was extended because of scheduled downtime.
          • Site: WEIZMANN-LCG2 (ROC SouthEast), GGUS TICKET NUMBER: 42124
            The ticket was extended for 3 days as agreed in the previous OPS meeting. The new deadline has passed. For such tickets we believe that a definitive largeer deadline should be set, and the team working on the problem at least updates the ticket when the new deadline has passed. In the absense of such deadline the usual procedure results in new escalations.
          • Site: UKI-LT2-UCL-CENTRAL (ROC UK/Ireland), GGUS TICKET NUMBER: 40596
            The ticket was escalated because of no visible answer. After the warning they responded that they are working to fix the problem on this new cluster. We propose de-escalation of this ticket.
          .
            Problems Encountered during shift
            After the new release of the GGUS on 26 Nov 2008 the folowing actions with these tickets from the Dashboard were not realized:
            • GGUS Ticket #43130 has to be extended by the reason of SD (until 12/15/08 -Gstat ), but this action was not possible to be done. The ticket has expired and it is still on the Dashboard. This is reported as a technical problem of the dashboard.
            • GGUS Ticket #44001 and #44003 were closed as unsolvable. The reasonable action should be "Closed by Site OK" but when this action is selected the result is that only the masked alarms are freed, but the ticket is not really closed. Again this technical problem has been reported.
        • b) <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          2008-11-26: MInor changes to the release procedures https://twiki.cern.ch/twiki/bin/view/LCG/PPSReleaseProcedures
          1. Obsolete concept of fixed two-week stage in PPS cleaned up
          2. Opening of ticket GT-PPS by PPS release team now happens as soon as the release is out and not on fixed date.


          2008-11-25: Pilot service of Cream CE: in progress
          1. 6th check-point meeting held.
          2. Progresses in Nagios monitoring set-up.
          3. Pilot end-date confirmed to the 16th of December.
          4. Minutes in http://indico.cern.ch/conferenceDisplay.py?confId=45264
          5. Details about the pilot (planning, layout, technical info) can be found in the page https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotCream
          6. Details about the single tasks can be found in the tracker http://www.cern.ch/pps/index.php?dir=./ActivityManagement/SA1DeploymentTaskTracking specifically listing the subtasks of TASK:7981.
        • c) <big> gLite Release News</big>
          Please find gLite release news in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

          Now in Production
          2008-11-26: gLite 3.1 Update 36 was released to production. The update contains:
          1. FTS bug and logrotate for FTA (PATCH:2551)
          2. Fix for the bouncycastle problem for FTS
          3. Bug fixes for CREAM CE ( affects UI ), (PATCH:2417)
          Now in PPS
          2008-10-31: gLite 3.1 PPS Update 40 was released to PPS and it is now in phase of deployment test. This update contains:
          1. Bug fixes of Proxy renewal mechanism on FTA (PATCH:2344)
          2. CREAM CE: Bug fixes in CREAM, CEMon and BLAH (PATCH:2415)
          3. MyProxy?: Info provider configuration + improvements (PATCH:2518)
          4. lcg-vomscerts: renamed all certificates with ".pem" suffixes because of BUG:43395 (PATCH:2598 / 9)
          5. FTS 2.1: configuration fixes (PATCH:2643)
          Release notes in https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update40

          Soon in Production
          2008-11-18: release of gLite 3.1 Update 37 to production in preparation The update is being fast-tracked to production (it was not officially delivered to PPS first). It contains a hot fix to glite-BDII, glite-SE_dcache_info and lcg-CE. The information provider glite-info-provider-ldap has been updated. This version has improved logging and the protection for recursion has been re-enabled after accidentally being removed in a previous release. (PATCH:2649, PATCH:2651)
        • d) <big> EGEE issues coming from ROC reports </big>
          • ROC France: IN2P3-CC/IN2P3-CC-T2 : We remind you that there is a downtime scheduled from Monday to Tuesday evening. Both CEs and SEs are closed. Even though there are electrical operations, core services should be available thanks to redundancy. "at risk" SD was however set during this period (3 hours).
        • e) <big> Data management issues caused by BioMed users </big>
          A substantial number of sites from several regions are complaining that some BioMed users are causing problems through poor handling of data on the grid. For example, copying a multi-GB file to thousands of SEs.
          However, at the time of writing this item, there were only 2 GGUS tickets related to this.
          Can all sites who are experiencing problems that they think are caused by BioMed users please submit a GGUS ticket. This problem can then be taken to the TCG.
      • 3
        WLCG Items
        • a) <big> Possible suspension of site Taiwan-LCG2 </big>
          Many problems have been seen with the Castor storage at the Taiwan-LCG2 site over the last several weeks. A report from Taiwan-LCG2 was received today and is attached.
          Report
        • b) <big> Details of SL5 WN cluster in production at CERN-PROD </big>
          At the tier-0 there is a CE providing access to several SL5 WNs.
          Speaker: Ulrich Schwickerath
        • c) <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • d) <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • e) <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • f) <big> Alice report </big>
          1. Item
        • g) <big> Atlas report </big>
          1. Item
        • h) <big> CMS report </big>
          1. CMS has now a new prodAgent release installed and tested at CERN, so DataOps people are starting a reprocessing activities at T1 sites. This is the "CRAFT reprocessing" exercise, in CMS jargon. Budgeting extra effort wrt previous reprocessing rounds, the overal nbs can be evaluated now to be:

            Site     NewDataset                                       Size
            -------------------------------------------------------------------
            IN2P3 /Cosmics/Commissioning08-ReReco-v1/RECO       69.6TB
            FZK      /Calo/Commissioning08-ReReco-v1/RECO            62.2TB
            RAL      /MinimumBias/Commissioning08-ReReco-v1/RECO     19.9TB

            This doesn't seem to pose any capacity problems for any of the aforementioned sites, judging from the status of resources posted on the usual FacilitiesOps blackboard, regularly updated by the site contacts.  The (in CMS jargon for the namespace) "acquisition era" is also unchanged so Ops people don't believe that any setup work is required at the sites.

            Daniele: "FYI I have a clash of meetings among this one and the weekly CMS DataOps + FacilitiesOps meetings on Mondays. In case it happens I cannot be online anymore when you reach this point in the agenda to comment myself the CMS report, you find it both uploaded to the CIC portal and on this page: please address me any question by mail.".
          Speaker: Daniele Bonacorsi
        • i) <big> LHCb report </big>
          1. Item
        • j) <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

        • k) <big> Storage services: this week's updates </big>
      • 4
        OSG Items
        Speaker: Rob Quick (OSG - Indiana University)
        • a) Discussion of open tickets for OSG
          GGUS ticket 43840 https://gus.fzk.de/ws/ticket_info.php?ticket=43840
      • 5
        Review of action items
      • 6
        AOB