WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-06 (CERN conferencing service (joining details below))

28-R-06

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: DECH, North Europe
  • Tier-1 sites: IN2P3, INFN, NDGF, PIC, FNAL
  • Tier-1 availability reports:
  • VOs: Alice, ATLAS, LHCb
  • list of actions
    Minutes
      • 4:00 PM 4:05 PM
        Feedback on last meeting's minutes 5m
        Minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC CentralEurope (backup: ROC Russia) to ROC SWE (backup: ROC SEE)

          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Tickets:
          lead team:
          Opened new 61
          Closed 12
          2nd mails 7
          1st mail 2
          site OK 22


          Backup team
          Treated tickets:
          Opened new 55
          Closed 15
          2nd mails 24
          Updated 19
          All together 113
          Issues:
            A lot of unregistered in GOCDB CEs are monitored by SAM



            GGUS ticket #18679 - site: UNI-KARLSRUHE "*CE-sft-job* is failing on ekplcgce.physik.uni-karlsruhe.de" problem rised: 2007-02-19

        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          AP, CERN, DECH, IT, NE, SEE

          Release:
        • New version of YAIM will be delivered today from certification to PPS with PPS Update35 . The new update will contain only the "service oriented" version of YAIM. Although this version of YAIM supports both gLite3.0 and gLite3.1, the rpm will presumably be shipped with the number 3.1.1 , this number will be upgraded to 4.0.0 before the release to production in order not to create confusion with versions 3.0 and 3.1 of gLite.
        • The failing of the SAM BrokerInfo test at the Birmingham PPS site was found to be the result of the "which" command not being installed there, not, as announced in last meeting, with an issue in SAM. There is no explicit dependency on which set by the middleware though. So sites should make sure that the command is properly installed on their WNs in order for SAM, YAIM (and presumably a lot of other applications) to work

          Operations:
        • The Service summary pages linked by the PPS website (http://www.egee.cesga.es/gocdb_userview/index.html) are still gathering info from GOCDB2. They have not been put off-line because still useful for reference. The issue has been addressed and it is in course of solution

        • The DILIGENT VO would like to run a data challenge on the PPS. This would involve running an Image Feature Extraction application
          The characteristics of the data challenge are:
          • Start date: Wednesday 18 July
          • Duration of data challenge: from 3 to 4 weeks (strongly depending on the ‘efficiency’, ‘reliability’, and failure rate of the infrastructure)
          • Around 500 jobs submitted per day (through 2 WMSs), although this can be increased/decreased as needed.
          • Each job requires at most 50 Mb of disk space and at least 512 of RAM. Jobs will consume between 20 minutes and 1 hour of CPU time (depending on CPU).
          • Site do not need to install any particular libraries or other software.
          PPS sites able to support the DILIGENT VO in this data challenge are strongly encouraged to do so.
          More info in a broadcast to come soon

          Issues from EGEE ROCs:
          1. What is the state of COD monitoring PPS? Sites are no glad about receiving tickts for obviously unstable services (for example if SAM tests are failing for all PPS sites) - SWE ROC
        Speaker: Nicholas Thackray (CERN)
  • <big> UPDATE: next versions of YAIM: content and timelines </big>
    Clarification: Once YAIM 3.1.1 is released (expected mid July), the previous versions of YAIM will not be further developed or maintained. This means that YAIM bug fixing will only be done in versions equal or higher than 3.1.1. the new yaim can use old site-info.def. The way site-info.def is handled hasn't changed. In general, the new yaim doesn't introduce big changes for the site admins since the yaim command is already present from yaim 3.0.1.
    Speaker: Maria Alandes Pradillo (Unknown)
  • <big> YAIM 3.0.1-22 and SGM/PRD accounts </big>
    • glite-yaim=3.0.1-22 (gLite 3.0.2 Update 27) allows "sgm" and/or "prd" users to be mapped to the traditional static accounts or to their own sets of pool accounts.
      Pool accounts are better w.r.t. the audit trail on the CE and WN.
      The use of pool accounts can be decided per VO and per account type: YAIM will discover which sort of accounts has been put into users.conf for sgm and/or prd users and map those users accordingly.
      The LHC experiments have been asked to adapt their software installation procedures to cope with the possible use of sgm pool accounts at sites.
      LHCb were the only ones to respond so far: in principle their procedure is ready, but they request sites not to switch to sgm accounts during the summer holidays, when the available expertise to debug problems is much reduced.
  • <big> Staged upgrade of BDIIs to gLite 3.0.2 Update 27</big>
    Dear all,

    gLite 3.0.2 Update 27 includes fixes to the Glue schema:
    The new version of glue-schema updates to Glue version 1.3. The top-level BDIIs should be updated first, followed by the sites BDIIs and then the GRISs. Only when all the nodes have been updated, new information can be published.
    As discussed at this week's operations meeting, we propose a coordinated staged deployment of this fix to the production BDIIs:
    • Step 1: upgrade ONLY all top level BDIIs in the next week, deadline: the end of next Monday 9th of July
    • Step 2: upgrade site BDIIs, starting from Tuesday 10th of July (a reminder will be given at next Monday's operations meeting and broadcasted after).

    PLEASE, ALL PRODUCTION SITES: don't upgrade your site BDII yet, wait till Tuesday 10th of July
    (In case your site has been already upgraded, we are presently testing what would be the effects of this (site BDII upgraded and top level BDII not upgraded) so we can inform the VOs and find a workaround).
  • <big> EGEE issues coming from ROC reports </big>
    • ROC AP:
      • For information: GStat migration to GOCDB3 is not complete, so recent site changes have not propagated to Gstat. This work is scheduled be completed by July 10th. op level bdii is scheduled to be upgraded to 1.3 glue Schema later today.
    • ROC France:
      • Comments on CIC portal
        In the weekly ROC report, we can fill the T1 availability text field, but it''s impossible to consult it afterwards. Please update the consultation of ROC report to make available this field.
      • official URLs to be used for a TOP BDII configuration
        Is there somewhere an official repository (web page) with the official URLs to be used for a TOP BDII configuration. We mean the value of BDII_UPDATE_URL providing the list of all the production sites, and BDII_UPDATE_LDIF providing the list ldap filtering rules built by FCR.
    • ROC SEE:
      • Release notes for Update 27 are not correct (reconfiguration of BDII is not mentioned in Update 27 Release notes, but it is apparently necessary). See this ticket for details:
        https://gus.fzk.de/pages/ticket_details.php?ticket=24054
  • 4:30 PM 5:00 PM
    WLCG Items 30m
    • Upgrade to SL4 WN release 5m
      Summary from WLCG GDB SL4 WN release has been in production for a month but the uptake has been disappointing. Sites were concerned that the SL4 middleware release did not contain all the rpms required by the experiments that were previously included in SL3. Markus said this was a deliberate choice to remove operating components from the middleware packages as this had been a criticism from sites in SL3. Not all experiments had updated their VO Cards to include any extra rpms that they required. Metapackages would be even better. ACTION 1 – all experiments to have updated their VO cards by Operations Meeting on Monday 9th July. Separate lists of rpms have a risk of circular dependencies and clashes for sites who try to all install for more than one experiment. ACTION 2 Someone (UK volunteers but others welcome too) to attempt installing rpms for all 4 experiments on a test box to check out the dependencies. Target date 7 days after Action 1. If there are no problems then the message to all sites is to move to SL4 as soon as possible, and by the end of August at the latest. If there are problems then SA3 should help resolve them and/or find out which combinations of experiments are problem-free. The experiments all said they were happy to run on SL4 with CMS expressing
      Speaker: Dr John Gordon (STFC-RAL)
    • <big> UPDATE: job priorities and YAIM </big>
      UPDATE: The new YAIM version fixing this was released last Friday 29th morning. List of sites which did not apply this version yet to remove the configuration is attached.
      Speaker: Simone Campana (CERN/IT/GD)
      more information
    • <big> WLCG issues coming from ROC reports </big>
      1. None this week


    • <big>WLCG Service Interventions (with dates / times where known) </big>
      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

      Time at WLCG T0 and T1 sites.

      1. BNL: HPSS reconfiguration for Thursday, June 28, 10:30 AM to 12:30 PM. The maintenance requires the restart of several major HPSS components.
    • <big>FTS service review</big>
      Speaker: Gavin McCance (CERN)
    • FTS 2.0 service and client compatibility issues - reminder
      Following some questions last week, please see https://twiki.cern.ch/twiki/bin/view/LCG/FtsChangesFrom15To20

      In particular, "Client Compatibility" and "Upgrade Path" sections at the bottom.

      The relevant client release was made in October 2006.

    • <big> ATLAS service </big>
      Speaker: Kors Bos (CERN / NIKHEF)
    • <big>CMS service</big>
      • Job processing: MC production activities - and pre-CSA07 production also - continue. Progresses in SLC4 roll-out at Tiers. The Offline project decided not to make a CMSSW150 build on sl3, so migration to sl4 is needed by Tiers to join to parts of CSA07 exercise requiring CSA15x.
      • Data transfers: the Debugging Data Transfers (DDT) program was launched, the Task Force defined and the charge written, discussed and agreed. LoadTest continues as part of such 'extended'program.
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • <big> LHCb service </big>
      1. .
      Speaker: Dr roberto santinelli (CERN/IT/GD)
    • <big> ALICE service </big>
      Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  • 4:55 PM 5:00 PM
    OSG Items 5m
    1. Item 1
  • 5:00 PM 5:05 PM
    Review of action items 5m
    list of actions
  • 5:10 PM 5:15 PM
    AOB 5m