WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: Russia + UK/I
          To: DECH + SouthEast Europe


          Report from UK/I:
          • List of sites eligible for suspension (first Ops meeting with the OCC involved):
            • SITE: RO-03-UPB; ROC: SEE; GGUS: 45038
              Reason for escalation: still within 2nd mail escalation, but no updates for last 2 weeks
            • SITE: RRC-KI; ROC: Russia; GGUS: 45788, 45789
              Reason for escalation: reply from SAM testing team is required
          Report from Russia :
          • Nothing to report.
          Other points::
          • IMCSUL-INF site from Northern Europe. Site is in downtime since 14/01/2009 and end date is set to 27/02/2009 (more then 1 month).
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          SUMMARY:
          2009-02-16: Definition of procedures for middleware roll-back: in progress As previously announced, within the gLite release process we want to introduce the concept of “roll-back” of a middleware update. With the now coming PPS Update44 the PPS deployment test team is going to do a test of the first implementation of the procedure. This is done in this instance as a collaboration "una-tantum" as there are currently no plans to introduce this test as a regular deployment test in PPS.

          2009-02-16: Pilot service of Cream CE: in progress

          1. Results of direct submission test against the CREAM CEs in the pilot are now available on the SAM PPS portal (https://pps-sam.cern.ch:8443 sam/sam.py), specifically at http://tinyurl.com/ctwfaz
          2. There is a new version of CREAM ready to be installed on the CREAM PPS pilot. Wrt the previous version, this release fixes bugs BUG:45913, BUG:46283, BUG:46684 and BUG:46916. This software corresponds to PATCH:2748.
          3. Details about the pilot (planning, layout, technical info) can be found in the page https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotCream
          4. Details about the single tasks can be found in the tracker http://www.cern.ch/pps/index.php?dir=./ActivityManagement/SA1DeploymentTaskTracking specifically listing the subtasks of TASK:7981

          2008-02-16: Definition of early adopters of gLite releases (Staged roll-out) As a follow-up of action 000344 from the [[https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_CoordinationTasks][SA1 coordination tasks] we are publishing the prioritised list of the service currently NOT covered by the Release Testing process.

          • Last week we registered the site GRIF (France) for this activity
          • The EGEE ROCs are welcome toindicate one or more sites that can cover one of more of the services below listed. Candidates should be forwarded to the list =pps-support@cern.ch= The selected sites should have a reasonable usage profile, in order for the to be meaningful. The list is priority-oriented. So if a site can cover more services they should be selected starting from top.
            More info about the release testing (early adoption) process and the relevant interfaces can be read at https://twiki.cern.ch/twiki/bin/view/LCG/PPS_Release_Testing List o fServices not covered:
            • glite-WN (plain and re-locatable)
            • glite-UI (plai and re-locatable)
            • glite-TORQUE_client
            • glite-TORQUE_server
            • glite-TORQUE_utils
            • glite-CONDOR_utils
            • glite-LSF_utils
            • glite-SGE_utils
            • glite-MON
            • glite-SE_dpm_disk
            • glite-MON (registry)
            • glite-MPI_utils
            • glite-FTA_oracle
            • glite-FTM
            • glite-FTS_oracle
            • glite-SE_dcache_admin_gdbm
            • glite-SE_dcache_admin_postgres
            • glite-SE_dcache_info
            • glite-SE_dcache_pool
            • glite-LFC_oracle
            • glite-PX
            • glite-CREAM_ce
            • glite-SE_dpm_mysql
            • glite-LFC_mysql
            • glite-WMS
            • glite-LB
            List of existing Early Adopters:
            • GRIF (glite_SE_dpm_mysql, glite_LFC_mysql, glite_WMS)
            • GUP-CERTIF-TB lcg-CE, glite-BDII (site)
            • SiGNET? lcg-CE, glite-BDII (site)
            • RAL glite-BDII (top)
            • WCSS lcg-CE, glite-SE_dpm_mysql
        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          • South East Europe:
            1. We would like to address the issue at https://gus.fzk.de/ws/ticket_info.php?ticket=45785. In particular, for the new look at https://lcg-sam.cern.ch:8443/sam/sam.py:
              "First, we would like to be able to right-click and copy links there directly, instead of bookmarking them. And we would like to be able to do that for the "Details" also."

            2. A non urgent CA update was broadcasted on late Thurday, meaning that most sites saw that on Friday (https://gus.fzk.de/ws/ticket_info.php?ticket=45892). This not a good practice in our opinion counter to hhttp://goc.grid.sinica.edu.tw/gocwiki Procedure_for_new_CA_release . There have been in the past problems spotted after the release, which in this case would mean that they would have be dealt during the weekend.

            3. Quiet some time ago we opened https://gus.fzk.de/ws/ticket_info.php?ticket=42028 about DPM. There has been no response ever since, but we eventually found assistance from the dpm user mailing list discovered from a post in lcg rollout. Is DPM not supported via GGUS?

          • UK/I: We were asked to check whether our regions had any issues with gLite 3.0 becoming obsolete. In the UK we had 3 sites still running a gLite 3.0 CE. Of these, two can easily move their remaining CE to 3.1. However the third is running the Condor batch system and the system administrator notes that: "The Condor part of gLite-3.1 is still in the PPS and according to the SA3 batchsystem-condor team, it doesn t seem to work with WMS job submission as well. Hence, it not possible for Cambridge to upgrade the CE to 3.1 until condor integration work is complete and in production".
        • <big>Grid Service Interventions </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Please consult the URLs above for details.

          In particular, the following sites requested that these downtimes be reported here:

          1. FZK-LCG2 downtime on 17th February (tomorrow):
            • FTS + LFC: down due to oracle update of the backends.
            • SRM:
              1. Installing a dcache patch to fix queue allocation and improve throughput
              2. Shrinking ATLAS pnfs database (may improve throughput for ATLAS)
              3. Upgrade Postgres DB (which prevents uncontrolled PNFS DB growth)
        • <big> Items from SAM </big>
          • In one week’s time, we will stop submitting SE and SRMv1 tests. They were replaced in December by the SRMv2 tests.

          • Barring any objections from the WLCG MB, and following a request from Mattias Ellert (NDGF), the ARCCE-lfc test will be made critical at the beginning of March. At the same time, the ARCCE-rls test will be made non-critical.
            FROM NDGF: Since ATLAS has moved away from using the RLS server and no other LCG experiment is using it, we would like to mark the ARCCE-rls test as non-critical.
            The experiments are now using LFC instead, and we have since some time an ARCCE-lfc test that tests the LFC functionality on the clusters. Since this functionality is critical for production, we would like to make this test critical.
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. South West Europe: Some of the SWE sites, especially LIP-Coimbra (GGUS #45728: Storm SRM TCP port is 8444 not 8443), complain that some of the experiments do not use the information published by the information system but hardcoded values by their frameworks. This causes sites failing even eveything is configured correctly. Should we submit a GGUS ticket every time this happens?
        • <big> Wiki page containing FTM Endpoints </big>
          Can all tier-1 sites please keep the list of FTM endpoints up to date. The list is here: https://twiki.cern.ch/twiki/bin/view/LCG/LCGFTMEndpoints

          Note: This requirement will be replaced by information providers publishing the end-points into the information system.

        • <big> Alice items </big>
        • <big> Atlas items </big>

        • <big> CMS items </big>
          1. Please have a look at the daily reports given at WLCG daily calls here.
          Speaker: Daniele Bonacorsi
        • <big> LHCb items </big>
        • <big> WLCG service recommended baseline versions </big>
          The recommended baseline versions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          Information taken from the weekly escalation reports.

          A last issue still to check on the GGUS side on OSG ticket closure being (not?) reflected in GGUS. Testing based on ggus #45488
          OSG did, indeed, re-remind Felipe Silva to answer on ggus #45094. What does one do in such cases? Maybe try to contact him offline, in case all GGUS/OSG ticketing systems' notifications end-up in his spam folder? The submitter still expects an answer.

      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB