WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray, Steve Traylen
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: All ROC reports received.
  • VOs: Alice, ATLAS, CMS, BioMed
  • list of actions
    Minutes
    Recording of the meeting
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC France / ROC SouthEast Europe
          To: ROC Asia Pacific / ROC DECH


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues from France COD team::
          1. 2 sites are expected to attend the meeting.
            • INFN-MILANO (ROC Italy): No anwser from site and no progress (https://gus.fzk.de/pages/ticket_details.php?ticket=27659)
            • PEARL-AMU (ROC Central Europe): A long problem due to network connectivity (https://gus.fzk.de/pages/ticket_details.php?ticket=25346)
              Last answer from the site:
              Dear all,
              Since all our efforts in situation remediation have failed I have requested AMU authorities for Network correction for the pagaj SE host. This will include changing of the subnet and network route for the host what will, hopefully, resolve the connectivity problem for our site.
          Issues from SouthEast Europe COD team::
          1. No major issues to report
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          Italy, AP

          Issues from EGEE ROCs:
          1. At Cyfronet there was a failure of main switch for clusters systems. Both preproduction and production services (including SAM UI) where unavailable for about 24h. [CE ROC]
          Release News:
          • gLite 3.1.0 PPS Update08 was release to PPS and it is currently undergoing the pre-deployment testing
            This update contains (among other patches) the new service:
            • lcg-CE for SLC4
            Due to the fact that lcg-CE requires the latest version of glite-yaim-core (patch #1413), new versions of yaim client packages (patch #1415) need tobe released to PPS as well, namely:
            • glite-yaim-clients-4.0.1-1.noarch.rpm
            • glite-yaim-torque-client-4.0.1-1.noarch.rpm
            as well as the following new metapackages, which were affected:
            • PPS-glite-TORQUE_client-3.1.0-5.noarch.rpm
            • PPS-glite-UI-3.1.0-8.noarch.rpm
            • PPS-glite-WN-3.1.0-8.noarch.rpm
        • <big> New tool for announcing (and receiving notification of) down-time </big>
          There is a new tool in the CIC Portal for announcing site and service downtimes. Features of the tool are:
          1. Uses standardized templates so all announcements will look similar (easier to scan) and all relevant information will be captured (no missing information)
          2. The template will include a more targeted set of recipients of a broadcast (spam reduction)
          3. You can subscribe to an RSS feed of messages (by type) rather than receiving them in your inbox (spam reduction)
          Speaker: CIC Portal team
        • <big> EGEE issues coming from ROC reports </big>
          1. [NE] When will the SL4 32-bit lcg-CE be released?
          2. [NE] We have submitted a GGUS ticket about a problem with GStat (27724) which has been in the "assigned" status since oktober 9th. When will somebody take care of this? Details of ticket are:
            At the moment SARA-MATRIX has the following warning in GStat:
            Missing DN and Attributes:
            ==================
            IN: 'dn: GlueSALocalID=dteam:DTEAM_RAW,GlueSEUniqueID=ant2.grid.sara.nl,mds-vo-name=SARA-MATRIX,o=grid' 'GlueSARoot: .+:.+' ()
            etc.
            However, the use of SARoot is already deprecated in Glue version 1.2. So this test is wrong.
          3. [France ROC, CGG-LCG2] There is no automatic procedure to clean up the $EDG_WL_SCRATCH and the MPI execution directory
          4. [France ROC, GRIF] Request for a SAM tests history of 7 days at least.
          5. [SE Europe ROC] I've noticed some discrepancies betwenn ggus and cic portal dashboard, some PPS sites appear in production view in dashboard.
        • <big> Move of 'default' CERN AFS UI from gLtie 3.0 to gLite 3.1</big>
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          • Item 1
        • <big> WLCG issues coming from ROC reports </big>
          1. None this week.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • Problem of the VOView consistency. Signaled 3 weeks ago, still 90 queues have problems. The new list under the usual http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
        • <big>CMS service</big>
          • Item 1
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • Last week we had at CNAF problem due to the shared area not working. The problem was related to the migration of the shared areas to GPFS. This suggestes that any important changes in site configuration should be always broadcasted at a high level.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • Item 1.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
          • Common Computing Readiness Challenge - CCRC'08 - meetings page
          • ATLAS throughput tests finished and M5 detector cosmics now running till 5 November. Data export from CERN later in the week.
          • CMS CSA07 now to continue till mid-November.
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
        • .