WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:05 16:30
        EGEE Items 25m
        • <big> Grid-Operator-on-Duty handover </big>
          From ROC SWE (backup: ROC DECH) to ROC UK/I Europe (backup: ROC Russia)

          Lead team handover
          Tickets:
          Backup team hand over:
          Open: 36
          1st mail: 22
          2nd mail: 16
          Quarantine: 13
          Site ok: 59
          Solved by ROC: 26
          Unsolvable: 3

          Notes:
          Report from Backupteam (DECH):

        • A lot of sites failing with JS errors, most of them probably related to the recent m/w update.
        • A lot of RM errors on Friday related to publishing of duplicate LFC.
  • <big> PPS reports </big>
    PPS reports were not received from these ROCs: Italy, North Europe, Russia
  • Problem with SAM tests for PPS. Job submission tests are failing with WARN status. [Central Europe]

  • Speaker: Nicholas Thackray (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    Reports were not received from these ROCs: None missing
    1. Question for Atlas from HPC2N:
      We need to terminate our storage element ibelieve-i.hpc2n.umu.se (an old "classic" SE) in preparation for a major upgrade and reconfig. There are 548 files owned by ATLAS on it. Most of them data and logfiles from rome and dc2 runs. This info was broadcasted to ATLAS VO on Feb 14 but we still haven't got any response. Please advise. (NE ROC)


    2. As AEGIS-01 points out in GGUS ticket 19464 (https://gus.fzk.de/pages/ticket_details.php?ticket=19464), when we issue an update for the production service we ought to have everything needed ready including documentation and repository. (SE Europe ROC)


    3. We need to find a way to Isolate sites that publish Central LFCs by mistake, perhaps a bdii update script can fix this. (SE Europe ROC)


  • 16:30 17:00
    WLCG Items 30m
    Reports were not received from these tier-1 sites: INFN, NDGF, SARA/NIKHEF, BNL, TRIUMF
    • <big> gLite WMS/CE – LCG RB/CE deployment strategy</big>
      Speaker: Dr Ian Bird (CERN)
      document
      Slides
    • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>

      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      • New endpoint SRM v1 for LHCb at INFN-T1.
        The old endpoint will be active for LHCb for the time needed for the FTS and LFC changes.
      • CASTOR nameserver DBs will be migrated to new hardware on Monday 2nd April 2007. At the same time, 3 network switches affecting many Grid services at CERN will be replaced. The total intervention will start at 07:30 UTC (08:30 Geneva time) and last for about 2 hours. More details will be provided later (full list of services affected etc.)

      Time at WLCG T0 and T1 sites.

    • <big>FTS service review</big>

      Full weekly report

      Main issues this week:

      • Transfer ranging from 90 to 420 Mb/s, averaging around 200MB/s per day.
      • Involving all major T1 sites.
        FZK - very unstable SRM, problems expected until 16th March.
      • Mostly traffic from CMS and Atlas. Little activity from LHCb.
      • Significant service degradation due to Castor Atlas problem over the weekend.
      • 12 tickets have been submitted to sites this week for a variety of problems, 7 problems have been solved.
      • FTS 2.0 pilot service available for testing by experiments.
      • Throughput plots

      Speaker: Gavin McCance (CERN)
    • ATLAS service / challenge issues & Tier-1/Tier-2 reports
      Speaker: Kors Bos (CERN / NIKHEF)
    • CMS service / challenge issues & Tier-1/Tier-2 reports
      -- Job processing: CMS MC production activities are switching to a new production round, hence a ramp-down occurred last week.
      -- Data transfers: last week was week-4 of the CMS LoadTest07 (see [*]) with focus on T0-T1 routes. All issues found in week-1/2 were addressed and fixed in week-3, and operations were quite smooth in week-4. Not all T1's participated though (ASGC just finished migration to Castor-2, RAL was down due to Castor-2), and not all participating T1's move data all at the same time, but a quite constant aggregate transfer rate out of CERN was seen in most days of the week (see Gridview). A stop on thursday noon time (CET) was seen, probably due to disturbances to some services after network interventions. Next week CMS will restart the LoadTest with T0-T1 transfers, in the multi-VO scenario, and will additionally test selected T1-T2 routes.
      -- Other preparation activities: as from the planning, a first working prototype of the set of CMS-specific tests within the SAM infrastructure is in place.
      [*] http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • ALICE service / challenge issues & Tier-1/Tier-2 reports
    • LHCb service / challenge issues & Tier-1/Tier-2 reports
      LHCb want to bring the issue risen last week about the inconsistent behavior ofSRM on d-cache sites when lcg-gt command is issued. We'd like to understandfrom d-cache experts and involved site admins (RAL,GRIDKA,IN2P3 and SARA) whatit could be done in order to have staged in files at the tURL request.
      For the time being the reprocessing activity in LHCb has been stopped for this problem because ROOT can open only file present on disk (or at least the gsidcap plug-inused by the applciation).
      To tackle this problem thay have developed on their own a pre-stager service (very old idea in LHCb) that should coordinate all their reprocessing jobs allowing them to run only when all input files are staged on the pool. This special agent (under test now) relies on customized tools/scripts/commands that discriminate the SRM back-end storage and accordingly trigger stage-in operations.
      Speaker: Dr roberto santinelli (CERN/IT/GD)
  • 17:00 17:05
    OSG Items 5m
    Item 1
  • 17:05 17:10
    Review of action items 5m
    list of actions
  • 17:15 17:20
    AOB 5m