WLCG-OSG-EGEE Operations meeting

Nicholas Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC DECH (backup: ROC SouthEast Europe) to ROC UK/I (backup: ROC AsiaPacific)

          Opened New :92
          close : 52
          1st mail: 33
          Quarantine : 20
          2nd mail : 9
          Unsolved: 3

        • Tests for RGMA , LFC, SRM, SE creating (lots of) alarms (COD teams learning to handle them...)
        • Issues with certificate tests (most of the tests failing with timeouts -- sites probably have a valid certificate, reasons might be use of nonstandard port for certain services, service simply not available, or timing out because of cert test itself...?)
        • Several months ago I requested that a page detailing the most common causes for gLiteCE job failures be created, based on the experience by PPS. The answer was that basically we should wait until experience is gathered in the GGUS tickets that are going to be opened by CODs on PPS or production sites and then produce this web page. So far there is one page in gocwiki, detailing one site specific cause:
          "Sometimes this is due to the fact that the user is not authorised on the CE."
          I do not believe this is the only site-specific cause that was ever found as reason for this error .

          I have the feeling that these host-cert-valid tests generate a lot of alarms, and I am not sure they must be handled by CODs, especially before a certificate is expired. Some CAs already send e-mails about expiring certificates one-two months in advance, which is a sensible period.
          Note that there is some lack of synch in COD dashboard, so it shows older tests that are in error, while the newer tests are ok. Ordering the alarms by status in the Monitoring/Alarms page shows last test date as 11th May in most cases, and only a few times as 12, and never newer. This problem appeared at beginning of our shift, and it seams to me it reappeared on 13th.
  • <big> PPS reports </big>
    PPS reports were not received from these ROCs: Italy NE Russia SWE UKI AP
    • PPS-Update 29 released to the PPS. This contains, among others, the following high-prority patches:
      • #898 LCG-CE modifications for DGAS support
      • #1144 R-GMA Server fix for bugs #21558, #20090 and #23052
      • a new version of the gLite 3.1 Worker Node (glite-WN-3.1.0-3) for SL4/i386 which addresses all known issues.
    • Integration of SRM2.2 test SEs into the PPS progressing:
      • CERN_PPS is for the time being publishing end-points in US in the information system
      • SAM tests are being summitted to all published SRMs.
      • Atlas transmitted some requirements on FTS channels for preliminary tests. They are being implemented at CERN_PPS
      • In addition to the sites originally involved in the SRMv2 pilot testing, also PPS sites PIC, IFIC, CNAF, Birmingham, DESY, FZK are getting involved in this activity
    • Release process Improved: From next week 6 PPS sites will perform pre-deployment testing in the PPS. Mario David, at LIP is coordinating this activity.
    • Hand-over of the SAM PPS service to PPS-CYFRONET and PPS-RAL started (completion date: 8th June)
    • Administrators of SAM Admin's Page (SAMAP) requested PPS to dedicate two services (BDII and WMS) to support SAMAP service redundancy.
      The request is reasonable and so we are asking here for any PPS sites to volunteer to provide these services.
    • Issues coming from the ROCs
      1. UPDATE 29 - FTS2 migration: DB schema migration script con be run only once in the current release. So if it fails for any reasons, it needs to be tweaked in order to run again. [ROC CERN]
      2. UPDATE 29 - VOBOX: VOBOX couldn't be upgraded because of dependency problem. bug reported (https://savannah.cern.ch/bugs/?26246). [ROC CERN]
      3. UPDATE 29 : PreGR-01-UoM Applied PPS Update 29 on site following the guidelines mentioned at the Release Notes. The Update caused a number of issues at the site and we are in the process of solving them. [ROC SEE]
    Speaker: Nicholas Thackray (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    1. (ROC CentralEurope): [For information] Installting top-level BDII on SLC4. We compiled a wiki page with instruction on how to set up a toplevel BDII on SLC4: http://wiki.grid.cyfronet.pl/CoreServices/SLC4BDII An instance of that is running at zeus60.cyf-kr.edu.pl. We plan to put it in production round-robin DNS this week. Any comments appreciated.

    2. (ROC CentralEurope): Recent YAIM release introduced that SGM users started to be mapped on a pool of accounts instead of just one SGM account, but how the VO software is managed in SW_DIR directory at sites? The problem is: the VO software should be readable by VO users, so we set group rights to read the directory and the sgmuser to write eg. 0750, but now we have multiple users who should have write access to that directory. A document considering impact of the moving from one account mapping to a pool accounts written probably by YAIM team would be useful.

    3. (ROC France/IN2P3-CC): Might it be possible to improve YAIM in order to make possible the publication of several sub-clusters by CE ? Indeed, GlueSubCluster defines the memory max to be used by job. So if we could declare several sub-clusters, that would make possible to set memory size limitation by type of queues. For example, up to now, by specifying only one sub-cluster by CE, we cannot express that the memory size of the medium queue is less than the memory size of the long queue. This the reason of a lot of Atlas job failures (as discussed with Simone Campana).

    4. (ROC SouthEasternEurope): We would appreciate an update from SA3/JRA1 regarding the status of the development / certification of SL4 based MW both 32bit and 64bit. An indicative (or estimated) roadmap will also be helpfuf for us to plan ahead, as we've stopped deploying new application software in our regional VO waiting for the major upgrade / switch to SL4, because it affects user/application software as well.

    5. (ROC UK/I): Technical issues to do with the email that CIC-Portal Alarms send:
      a) The From field should be CIC-Portal@in2p3.fr and not just CIC-Portal. Otherwise intervening mail relays add their own spurious @host info and so the mail can be misidentified by mail browsers.
      b) All emails from CIC-Portal, and in2p3.fr generally, are given a Spam-Assassin rating of DNS_FROM_RFC_ABUSE 0.37, plus whatever other spam score the contents of the message might incur. This would be avoided if in2p3.fr got itself de-listed from www.rfc-ignorant.com - that shouldn't be hard!

    6. (ROC UK/I): Spam from "project-lcg-" mailing lists is currently at about 1 per hour. Predominantly project-lcg-security-* and project-lcg-vo-*. What is being done about this? eg. change the name of these mailing lists, and then keep them quiet.

    Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
  • <big> LHCb service </big>
    Speaker: Dr roberto santinelli (CERN/IT/GD)
  • <big> ALICE service </big>
    Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
  • <big> WLCG Service Coordination Issues </big>
    WLCG Collaboration workshop September 1-2 2007, Victoria, BC, Canada (co-located with CHEP 2007)
    Speaker: Jamie Shiers / Harry Renshall
  • Operations workshop in Stockholm, 13-15th June, agenda available: