WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Italy
  • Tier-1 sites: INFN, Triumf
  • VOs:
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC CE (backup: ROC AP) to ROC DECH (backup: ROC SEE)

          Tickets:
            1. Backup team Opened New :17
              close : 31
              new : 1
              Quarantine : 14
              2nd mail : 10
              Extend : 22

              Issues :
              # cannot use " " in the contents of ticket .
              # some site's SAM result didnt update to new .(alert #21446)
        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Italy, Russia
          • PPS-Update 28 released to the PPS. This contains:
            • 1086 RGMA Client Exception Addition
            • 1089 Removal of incorrect apel-* deps from metapackages
            • 1120 FTS 2.0 (update)
            • 1133 glite-yaim 3.0.1-13
          • Significant issue found in SL4 natively complied WN (gridFTP ls causes segmentation fault)
          • A meeting with all PPS sites (VRVS) is fixed for:
            Wednesday 09th May 2007
            from 16:00 to 17:30 GNT + 2

            The preliminary agenda is available here.

          • SRM-2 testing in PPS
            Set-up of PPS to support this activity was started, progresses in the configuration of end-points in PPS are registered here.

          • Issues coming from the ROCs
            1. CERN_PPS: LB availability improved by applying a workaround suggested by LB developers. Details about the workaround, if other sites are interested, can be found in GGUS (#25976) [CERN ROC]
            2. Admins in CERN_PPS registered to the new "site alert notification" tool in the CIC Portal. The feedback is so far positive. [CERN ROC]
            3. DESY-PPS and FZK-PPS involved in testing SRMv2.2 dCache1.8beta . [DECH ROC]
          Speaker: Nicholas Thackray (CERN)
        • <big>Phase out of classic SE </big> 15m
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC UKI, from last week): Do adhoc site submitted SAM tests get published into the database used to calculate site availability?


          2. (ROC SEE): Release notes of Update 23 to gLite were not complete yet again: https://gus.fzk.de/pages/ticket_details.php?ticket=21392 The quality of release notes to updates should be improved, and they have to reflect actions that need to be taken on production sites, i.e. on services already running, and not just to consider deployment of new services. Transparency of recent updates (specifically 21 and 22) is highly dubious, since we encountered problems that caused loss of jobs. This needs to be highlighted in the release notes. Another example: https://gus.fzk.de/pages/ticket_details.php?ticket=21155


          3. (ROC SWE): It would be nice to get the VO configuration template files for the new yaim (vo.d structure) from each VO to prevent misconfiguration. It would be nice to upgrade the documentation of gLite including description of the new yaim version. Small and new sites had problems deploying it.


          4. (ROC SEE): Another problem is continuing nightmare with gCE SAM tests, which are mainly due to SAM WMS problems. We opened two GGUS tickets on this, but still there are no clues on what can be causing those problems: https://gus.fzk.de/pages/ticket_details.php?ticket=20732 https://gus.fzk.de/pages/ticket_details.php?ticket=21454 We even observe gCE SAM test where all individual tests pass with OK, but the overall status is JS. While the following GGUS ticket hints to a source of problems, we think that there may be other problems related to rb108.cern.ch where all these failures occur: https://gus.fzk.de/pages/ticket_details.php?ticket=20625.

          5. (ROC SWE): We see still the intermittent error: BDII Connection Timeout: bdii.pic.es:2170 from the replica manager test but this only happens with the replica manager client. How is this related?


          6. (ROC SWE): The gLite error "The job attribute PeriodicHold expression ''''Matched =!= TRUE && CurrentTime > QDate + 900'''' evaluated to TRUE. is just handles by the COD people like a site problem even it should be seen like a middleware problem.


          7. (ROC UKI): The other point we wish to rise is that trying to "research" through the web obscure error messages thrown by the middleware does not seem to be a useful or efficient way to tackle problems. This has been raised in the past and error messaging hasn''t improved at all. Now that service level targets are becoming more important and are going to be based on SAM test results, being tagged red with a meaningless error message thrown by the middleware is quite unhelpful and won''t necessarily reflect correct figures of a site availability.


      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          T1 reports
        • <big>Plans for SRM v2.2 deployment in production</big> 5m
          We are preparing for SRM v2.2 deployment in production. For certification purposes we need sites to configure 2 REPLICA-ONLINE test spaces of 200MB each with dteam_test1 and dteam_test2 space token descriptions.
          Speaker: Dr Flavia Donno (CERN AND INFN)
        • <big> WLCG issues coming from ROC reports </big>
          1. (ROC ???): ???


        • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          • None this week

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>
          Speaker: Gavin McCance (CERN)
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information. MC Production: the fix for the Job Priority (publication of DENY tags) has beensuccessfully tested in Nikhef. Tests are ongoing for Valencia (pre-production).If the latest tests will be successful as well, we will ask to push the fix inthe rest of Pre-Production early this week.
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          -- Job processing: 'Spring07' MC production based on CMSSW_1_3_0 started on Apr25th and is in progress. All CMSSW_1_2_0 datasets and 95% of Spring07 GEN_SIM have been migrated to DBS-2 already. Progress on CMSSW_1_3_1 production also (>5Mevts in 7 days)
          -- Production data transfers: Spring07 GEN-SIM data shipping out of CERN/FNAL needed in order to give room to HLT processing: ~4 TB of data have been shipped, main destinations: FZK, ASGC, Legnaro, CNAF, Florida.
          --Test data transfers: Last week was week-2 of Cycle-3 of the CMS LoadTest07 [*].Since Tuesday noon time (GVA), ~800-1000 MB/s of aggregate transfers over the WAN.

          [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          Point1. Instability of SRM endpoints at T1.
          The reconstruction activity,after few days with all T1 sites were running happily, started to degrade because the SRM response started to be very slow (srm-get-metadata or the lcg-gt for staged files takes a long while). (RAL-CERN are currently suffering this problem: they are not doing well as during the last week). Could sysadmins do investigate?
          IN2P3 seems to be much better since this morning.
          NIKHEF and CNAF are OK.
          We observed that a reboot of SRM would cure all problems.

          Point 2.
          Site should have a sensor that regularly does asrm-get-metadata on a existing test file and measures the time it takes. In case of slowness that sensor should trigger some alarm at site level; a similar test might also be part of SAM test suite.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          Alice has just updated the AliEn version to v2.13. We are updating the sites. Regarding the open issue I had last week (related to the test of different information providers: IS, GRIS, LB and batch system), it can be closed.
          Via monaLisa we are printing this info since weeks and I have announced it via the support-eis list.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination Issues </big>
          Speaker: Jamie Shiers / Harry Renshall
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
      • Operations workshop in Stockholm, 13-15th June, agenda available:
        http://indico.cern.ch/conferenceTimeTable.py?confId=12807