WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-06 (CERN conferencing service (joining details below))

28-R-06

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • Tier-1 sites:
  • Tier-1 availability reports:
  • VOs: Alice, Atlas, LHCb
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC Italy (backup: ROC CERN) to ROC CentralEurope (backup: ROC Russia)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Tickets:
          lead team:
          Opened new
          Closed
          2nd mails
          Extend
          Quarantine

          Issues:
            There a few sites now at the final stage to be addressed at this Mondays operations meeting - Monday 2nd July.
            I suspect that some of these sites have already presented but nothing was done following the meeting. CIC on duty please take note of the comments in the meeting and reduce the escalation step or if needs be suspend the site. This is the job of primary COD team following the monday meeting I would say.

            ROC-North T2_Estonia -> https://gus.fzk.de/ws/ticket_info.php?ticket=23178&from=ID (Looks good now)
            BEgrid-UGent -> https://gus.fzk.de/ws/ticket_info.php?ticket=22932&from=ID

            Roc-Russia RU-Phys-SPbSU -> https://gus.fzk.de/ws/ticket_info.php?ticket=23073&from=ID

        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          No ROC reports available at 14:44 UTC


          Release:
        • gLite3.0.2 PPS Update 34 released today to PPS.
          It contains the new version of LFC/DPM (1.6.5-3) with an Urgent security fix.
          The fix will be announced today to production and PPS sites
        • gLite3.1 client updates linked to the same fix will be released to pre-production later this week

        • A possible issue was found at the PPS site in Birmingham, where SAM BrokerInfo test started failing after re-installing SLC4 WNs (as recommended). The failure happens because edg-brokerinfo is missing (PPS-glite-WN).
          Apparently the SAM test needs to be upgraded
          Unfortunately that site is the only one in PPS, among those still supporting lCG-RB, having gone through a re-installation of WNs, so we don't have a second confirmation of the issue. More details available in the following GGUS and Savannah tickets.
          • https://gus.fzk.de/ws/ticket_info.php?ticket=23866
          • http://savannah.cern.ch/bugs/?27558


          Operations:
        • Second SAM client at PPS-RAL starting the operations today. Temporary glitches in the SAM results are to be expected during synchronization phase.

          Issues from EGEE ROCs:
          No ROC reports available at 14:44 UTC

        Speaker: Antonio Retico (CERN)
  • <big> UPDATE: next versions of YAIM: content and timelines </big>
    Update since last week: The YAIM team has decided to configure glite CE and (old) WMS also with yaim 3.1.1.
    Speaker: Maria Alandes Pradillo (Unknown)
  • <big> OSCT: BioMed proxy lifetime </big> 5m
    Speaker: Mr Romain Wartel (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    • ROC France:
    • Comments on GOC DB3
      It seems that by the way of GOC DB3, some people (ex.: COD operators) might (accidentally) set a scheduled downtime on the whole EGEE grid (or ROC sub-grid). If this was actually the case (we didn''''t really try it of course), we would not understand the interest of such a functionnality. It might be error-prone and then dangerous for the production. Choosing a site and taking an action, according to your roles, only on this site was a good way to work, is there any reason for having changed that ?
      Could it be possible to add a "production" field to be set by node to specify if the node must be considered as a current production node or not of the site. It would prevent us from removing/adding node every time we change the composition of our site. Those actions are quite different of setting a scheduled downtime.
    • UPDATE 27 of M/W
      This is the starting of long vacations period, so site administrators would like to disturb the less possible their configuration. In particular, do they really have to update their WN/UI as specified by M/W web pages ?
      There is a new version of Glue Schema (1.3), and as far as we understood, this new version is compliant with the previous one (1.2). But, as Top BDII must be updated before Site BDII, Site BDII before GRIIS, we should certainly propose a global scheduling for this update..
    • ROC NorthernEurope:
    • Recently we saw frequent failures of SAM''''s replica management tests. Most of them failed on a time out. We could not find indications of problems at the site, nor did the sub-tests take excessive amounts of time. Therefore, we assume that the source of the problems is somewhere else. This kind of error is not conclusive about its cause. We already have a tendency to ignore such failing tests, assuming that the problem lies somewhere else. That raises the question whether this test is useful at all (in particular triggering an error on a time out)" It looks like that other sites also are hit by intermittent failures of this SAM test too (Sites sometime just submit another SAM job to get status OK again). If sites don''t trust a test anymore the test becomes indeed questionable.
    • ROC UKI:
    • How far back do the SAM test results go and can this be made visible? At present we can only see up to 7 days.
  • 16:30 17:00
    WLCG Items 30m
    • <big> Tier 1 reports </big>
      more information
    • <big> UPDATE: job priorities and YAIM </big>
      Very Short term: the FQAN VOViews should disappear from the information system. The VO:atlas view will then show inclusive information for ATLAS jobs submitted with any role. This means the FQAN VOViews should not come anymore with the default YAIM configuration (action for SA3) *and* they should disappear from the sites which already have deployed it, both via YAIM and by hand (action for SA1)
      UPDATE: All T1s removed the configuration manually except GridKA (no answer) and PIC, TRIUMF (they applied with YAIM so will remove it through the new YAIM version)
      Additionally, tickets have been opened to all ROCs with teh list of CEs in tehir region that need to be updated.
      The new YAIM version fixing this was released last Friday 29th morning.
      Speaker: Simone Campana (CERN/IT/GD)
    • <big> WLCG issues coming from ROC reports </big>
      1. None this week


    • <big>WLCG Service Interventions (with dates / times where known) </big>
      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

      Time at WLCG T0 and T1 sites.

      1. None escalated to the meeting this week.
    • <big>FTS service review</big>
      Speaker: Gavin McCance (CERN)
    • FTS 2.0 service and client compatibility issues - reminder
      Following some questions last week, please see https://twiki.cern.ch/twiki/bin/view/LCG/FtsChangesFrom15To20

      In particular, "Client Compatibility" and "Upgrade Path" sections at the bottom.

      The relevant client release was made in October 2006.

    • <big> ATLAS service </big>
      Speaker: Kors Bos (CERN / NIKHEF)
    • <big>CMS service</big>
      • Job processing: Spring07 MC production activities continues. Overall pre-CSA07 is in full swing and perform well: 24M MinBias, 4.1 QCD events done. Global Datataking going well, also. SLC4 not rolled out to all Tier-1 centers, and CMSSW150 was planned to be SL4-based only: either migration is finalized soon, or CMS mayconsider a SLC3 build for 15x.
      • Data transfers: the 'extended' LoadTest isconverging into CSA07 preparation activities, the extended plan was presented and agreed during the weekly CMS Computing project meeting, details to be worked out this week.
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • <big> LHCb service </big>
      1. .
      Speaker: Dr roberto santinelli (CERN/IT/GD)
    • <big> ALICE service </big>
      Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  • 16:55 17:00
    OSG Items 5m
    1. Item 1
  • 17:00 17:05
    Review of action items 5m
    list of actions
  • 17:10 17:15
    AOB 5m