WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
        Feedback on last meeting's minutes
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: SouthEasternEurope and SouthWesternEurope
          To: UK/Ireland and CentralEurope

          • opened: 57
          • closed: 44
          • 2nd mail: 1
          • Quarantine: 3
          • Unsolvable 2
          Report from South West Europe:
          • Some operational problems:
            1. Incompatibility problems BDII (glite 3.0) with GFAL and LCG-info-sites (04-09-2008)
            2. Storm of alarms with the new APEL check (05-09-2008)
            3. UKI-LT2-UCL-CENTRAL has a 3 mouth scheduled down time

          Report from SouthEasternEurope COD:
          • No issues for this week.
          • Some operational problems:
            1. GGUS was unreachable for 1 hour on 02.09
            2. GOCDB update is one possible reason for appearing APEL-pub alarms
            3. Many fake CE-sft-job alarms arose around 14:20 UTC on 05.09.2008 because of problem affecting sam-bdii at CERN.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

        • <big> gLite Release News</big>
          Please find gLite release news in:


          Now in Production:
          • 2008-09-04: an issue was found with the version of GFAL released with gLite3.1 Update30: After the upgrade, some sites in various regions appear to be failing the SAM tests. The issue, still under analysis appears to be due to LDAP searches no longer compatible with the gLite 3.0 top-level BDII. Affected sites (those failing the SAM tests) should make sure to be be pointing to a gLite 3.1 top-level BDII as LCG_GFAL_INFOSYS. SA1 is preparing an official recommendation for the sites and interacting with the ROCs to make sure that all regional top level BDII are running a compatible version.
          • 2008-09-02: gLite3.1 Update30 was released to production. The update, meant to be released next Thursday, was delayed due to last-minute changes to the repository and further analysis of the impacts of bugs found in PPS. It is going however to be released today or tomorrow. The update will affect the vast majority of services. It will contain, notably:
            • A patch to globus VDT , fixing the issue raised with BUG:37563 (limit in proxy delegation chain)
            • dCache 1.8.0-15p5 : Minor bug fixes and inclusion of the Chimera filesystem which can be configured through new yaim-dcache module.
            • GFAL/lcg_util bugfix release
            • gLite YAIM clients update
              • VOBOX specific variables are now distributed under services/glite-vobox and defaults/glite-vobox.pre
              • The AMGA client configuration function is now included in the UI, WN, TAR UI and TAR WN
              • The config_vomsdir function configuring the .lsc files under vomsdir is now included un the UI, WN and VOBOX. There is a known problem with config_vomsdir on the UI_TAR and WN_TAR.
              • Please check also the YAIM-Client Known Issues in https://twiki.cern.ch/twiki/bin/view/LCG/YaimGuide400
          Now in PPS:
          No releases to production during the last week.

          Soon in Production:
          There are no plans to release anything to production during the next week.
        • <big> EGEE issues coming from ROC reports </big>
          • CERN
            1. See attached report from triumf concerning central software distribution.
          • CentralEurope
            1. WN distribution mechanism CE ROC position - rather negative as it looks like centralization breaking the idea of Grid, problems with independence of NGIs etc.
            2. In the details of CE-sft-lcg-rm test we have noticed that LCG_GFAL_BDII_TIMEOUT is set to 10 sec and was in the past set to 60 sek. We think that 10 sek is too rigoristic and sites are punished for low performance of regional top BDII. Regional BDII may have some timeouts as for example each WMS is asking c.a. 12MB queries each 5 minutes causing high load on the BDII.
              What does rigoristic mean?
          • French ROC report: bad week for production sites
            1. The lcg-utils incompatibility with gLite3.0 Top BDII showed a weakness within test and certification process. Moving nodes from glite3.0 to glite3.1 is not so simple as it requires OS upgrade. It would be interesting to evaluate the real impact this incompatibility really had on production jobs.
            2. Suggestion to cope with the certification of new client software: As SA3 is proposing a centralized distribution of gLite client software, it could be interesting to use such a mechanism to easily distribute and test a (piece of) production-candidate software directly within the production context by using OPS VO.
            3. Please don t forget to announce new critical SAM tests. It was not the case with SRMV2.2 and Apel-pu tests. Behind each SAM test failure, there is support staff (TPM, ROC, site, etc) working to solve it. If those people don t know SAM changes, they waste time in finding out the reason of such failures.
            4. It seems that SAM DB is not synchronized anymore with GOC DB. By the way, current site changes in GOC DB (SD, adding/removing node, etc) are not taken into account by SAM.
          • Germany Switzerland
            1. It would be greatly appreciated if HIGH PRIORITY upgrades would be announced a few days before they are released, like we plan to put out a HIGH priority upgrade in the next 10 days or so. This makes the planning easier, e.g. if a downtime is planned the upgrade can be considered.
            2. SAM has trouble again. ( GGUS:40544 ). Can this be monitored by SAM itself?
            3. The central gLite WN installation on software area was discussed in our regional DECH meeting and few sites have some concerns e.g. if all OS would be supported. What is the current state of affairs?
          • Italy
            1. INFN-T1: CE availability problem caused by missing SAM CA test. We have initially submitted GGUS:40590, then understood the reason from DE-CH report and ticket GGUS:40544
          • SouthEasternEurope
            1. We've upgraded a couple of BDII to gLite 3.1 and know bdii.egee-see.org contains only gLite 3.1 bdii s with a couple more to be included within the week.
          • SouthWesternEurope
            1. Site UMinho-CP was put into certified and production status last week, but did no appear on SAM before this morning
            2. No accounting entries for SWE at the RGMA server at UK for the last weekend. Was there a central problem?
            3. CIEMAT reports that SRM test SAM timeout is maybe to short if the SRM is under heavy load. Any feedback on this from other ROCs?
          more information
        • <big> Update from SAM team </big>
          Apologies from the SAM team for not having followed the correct procedure in releasing two new sets of tests, which are:
          1. the APEL-pub test has (finally) been made critical. The new APEL service was created in order to separate the tests from the CE test suite, thus avoiding any impact on site availability calculations. GOCDB automatically associated the APEL service with all CEs, and the new sensors were tested in Validation prior to being run in Production. Sites can now easily check whether they are correctly publishing accounting information (271 are, 75 are not).
          2. Release to Production of the much-anticipated SRMv2 tests last Thursday. The seven tests have been made critical for the time being (but alarming supressed). Roughly 25 SRMs were failing the tests because they were advertising fewer protocols than the Information System claimed (after 3 days, number was down to 10). After a few weeks, this body (Ops meeting) should decide which of the seven tests should remain critical and generate alarms, bearing in mind that GridView should include the critical SRMv2 results in site availability calculations sometime in the future.
        • <big> New Gstat Tests for LFC and BDII Service Types</big> 15m
          Following GGUS:40475 and GGUS:38053 gstat will soon compare GlueServiceTypes for bdii_site, bdii_top, lcg-file-catalog and lcg-local-file-catalog against GOCDB node types Site-BDII, TOP-BDII, LFC and Local-LFC respectively.

          This is in a very similar fashion to how GlueCEs and GlueSEs are already tested. In the first instance these will generate a 'Warn" status within gstat.

        • Top BDII Publishing 15m
          A collection of Top BDIIs that are publishing is visible here. If you have a gLite 3.1 top level BDII then it should appear on this page. Please check.
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. [SWE ROC]: CMS opened a ticket to the site LIP-Coimbra telling that the disk space for CMS is full. Would it not be better to assign this kind of ticket to the VO instead of the site supposing that the site while fulfills the capacities agreed by a MoU or similar?
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
          1. Downtime procedure: we noticed some of the downtimes are not broadcast. Could be the "broadcast" checkbox be the default?
            - dowtime calendar for ATLAS: https://twiki.cern.ch/twiki/bin/view/Atlas/ATLASDowntimeCalendar
          2. GGUS: BNL mail for team ticket. please check if the team ticket mail for ATLAS is correct
        • <big> CMS report </big>
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

        • <big> Storage services: this week's updates </big>
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          • https://gus.fzk.de/ws/ticket_info.php?ticket=37059
          • https://gus.fzk.de/ws/ticket_info.php?ticket=39303
        Review of action items 5m
