WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    list of actions
    Minutes
      • 16:00 16:25
        EGEE Items 25m
        • <big> Grid-Operator-on-Duty handover </big> 5h
          From ROC UKI (backup: ROC Russia) to ROC France (backup: ROC Italy)

          Lead team handover
          1st mail 13
          2nd mail 16
          Quarantine 16
          Site Ok 26
          Unsolvable 1

          SAM tests page seem very slow to load, a lot of time was spent just waiting to check the test results.

          Backup team handover
          Opened new 7
          Closed 19
          2nd mails 10
          Updated 20
          All together 56

        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Asia Pacific; Italy; Northern Europe; Russia
        • No errors were reported on CIC portal for PPS sites.

        • ANSWER: This was likely due to a long unavailability of SAM results during last week.
          Question to CODs: Last week no SAM results available for several days. Why the CODs did not report it? [Central Europe]
        Speaker: Nicholas Thackray (CERN)
  • <big> Survey: Migration plan to SLC4 5m
    SA3 would like to know the sites migration plan/timeline to SLC4 OS so they can plan the SLC3 support and associated backporting (e.g. gLite 3.1). Could the ROCs collect this information and report through teh coming ROC reports?
    The official timeline to phase out SLC4 (from the SLC4 team) is 6 months from now)
  • <big> EGEE issues coming from ROC reports </big>
    Reports were not received from these ROCs:

    1. gLite update 16
      (CE ROC): Do we have any news why gLite update 16 caused automatic reconfiguration of services while such a reconfiguration should be done with site admin assistance?

      (UKI ROC): Yet another upgrade full of bugs and problems (with DPM, LFC). This is not acceptable, not only sites should strive to deliver a production quality of service but developpers should too! Some of bugs such as the dpmmgr problem is not only a bug but a sign of very low software development skills - I understand developpers are overworked :) but then this raises the question whether the projects are given enough resources to achieve their goals.

      (DECH ROC): Distinction between major and minor updates: Production update 16 hasn''t been deployed smoothly. In our last regional meeting, many sites expressed their disfavour about the circumstance, that small and major updates all being broadcasted with the same schedule. Major updates should be announced differently and e.g. should be part of a major update (here e.g. "3.1"). The ROCs would then have to take care, that major updates are sufficiently deployed in their region. In this case, we as ROC were not aware enough (though there was a corresponding action about the torque update itself), that this major update was deployed and sites where not warned about it, which caused many problems. It''s a general issue to distinguish between major and minor updates.


    2. (SEE ROC, Aegis site): We see a large number of superficial SAM test JS failures for our gCE (eight) that can only be accredited to SAM WMS problems (rb108.cern.ch). lcg-CE does not have such problems, nor we see such problems for our gCE in the regional SAM SEE-GRID instance. Any chance that SAM WMS performance is improved? Still there are problems with scheduled downtimes in CIC reports, since SAM failures are sometimes reported even during the downtimes. We reopened the relevant GGUS ticket


    3. (UKI ROC): Concerns over possible shift to a policy on no support of auto-updates


  • 16:00 16:05
    Feedback on last meeting's minutes 5m
    Minutes
  • 16:25 16:55
    WLCG Items 30m
    Reports were not received from these tier-1 sites: BNL, INFN, NDGF, NIKHEF
    Reports were not received from these VOs:
    Atlas

    • <big> Tier 1 reports </big>
      more information
    • <big> WLCG issues coming from ROC reports </big> 5m
      Reports were not received from these ROCs:

      1. (CERN ROC): We are still trying to drain CE101, CE102 and CE105 but we recently got new jobs on them although there is a scheduled downtime and they are in draining mode. We ask the experiments not to use such CEs (we need to be able to do some micro management). Experiments have been informed about this via broadcast.


      2. (NE ROC): Information item for ATLAS and LHCb VOs: One broken pool node. LHCb and ATLAS disk-only data residing on that node was lost. The experiments will be contacted about the missing SURLS


      3. (UKI ROC): Atlas have been set read only in dcache.gridpp.rl.ac.uk as they have filled the space up in this SE


      4. (UKI ROC): GSI DCap ports for LHCb have been opened in firewall


    • WLCG Network Problems
      Severe perturbations of the traffic to some T1 sites have been traced to a faulty card in a router. This hardware fault appeared after the router downgrade last Thursday when the card did not checkin properly, but this was not detected. We are working with the manufacturer to understand the cause of this malfunction.

      Unfortunately, it was not understood that the scheduled downgrade of the router software would affect ALL Force 10 routers including the OPN (see intervention announcement), which would have greatly simplified diagnosis of the problems seen.

      Update - March 16. It turns out that the intervention on the OPN was announced by e-mal (see below) but this information was not correctly updated on the CERN status board nor announced via the EGEE broadcast tool.

      Subject: Urgent network maintenance - Thursday 8 March 2007 Date: Wed, 07 Mar 2007 16:17:05 +0100 From: Edoardo Martelli <edoardo.martelli@cern.ch> To: enoc.support@cc.in2p3.fr, it-dep-gd-gmod@cern.ch, wlcg-tier1-contacts@cern.ch CC: It Manageronduty <Mod@cern.ch>

      Dear LHCOPN users

      Please be aware of the emergency network maintenance that CERN will run tomorrow morning (see below).

      IMPACT The two CERN routers that connect to the LHCOPN will be restarted: all the connections to the TIer1s will be down for few minutes while the routers reboot.

      Thank you for your understanding.

      Edoardo

      A second EGEE broadcast was sent out by the GMOD, unfortunately with the same message as the first (see attachment - only 1 of the two messages sent a few minutes apart is attached!)

      However, Edoardo's mail will have reached the same WLCG Tier1 contacts (but not the other mailing lists) as the EGEE broadcast.

      Once again, apologies for the many inconveniences resulting from this problem.

      Broadcast text
    • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>

      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      • CNAF srm endpoint intervention plan (see attached plan)

      • [LHCOPN] Maintenance on Tuesday 20/3/2007:
        On Tuesday the 20th of March 2007 between 8:00 to 8.30 AM CET the connections from CERN to RAL and BNL will be interrupted for 5 minutes to allow the replacement of a module in one of the CERN's LHCOPN router.
        Impact: During the maintenance the primary links to RAL (CERN-RAL-LHCOPN-001) and BNL (CERN-BNL-LHCOPN-001) will be down for 5 minutes. However the traffic will be re-routed to the backup links.

      Time at WLCG T0 and T1 sites.

      cnaf-srm-intervention
    • <big>FTS service review</big> 5h
        Read the attached report.
        Main issues this week:
        * FZK and BNL back to normal work
        * There was a problem with one of LGC routers connecting to the Tier1s at the start of the week.
        * Data corruption at BNL found by Atlas on files exported from CERN. This is being investigated by dCache experts with help from FTS team.
      • FTS report index - status by site and by VO
      • Transfer goals - status by site and VO
      • Transfer Operations Wiki
      Speaker: Gavin McCance (CERN)
      FTS report
    • ATLAS service / "challenge" issues & Tier-1/Tier-2 reports
      Speaker: Kors Bos (CERN / NIKHEF)
    • CMS service /
      See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning
      -- Job processing: CMS MC production activities are switching to the newproduction round with CMSSW_1_2_3, which is being installed CMS-wide.
      -- Datatransfers: last week was week-5 of the CMS LoadTest07 (see [*]) with focus onboth T0-T1 routes and T1-T2 routes. Operations were quite smooth, with dailyaveraged 300-400 MB/s of CERN outbound traffic to CMS T1's (then the weekend wasbasically off due to clean-up), and approximately daily averaged 100-200 MB/s of(aggregated) traffic for T1's to T2's. Since this week a major Castorintervention will occur at CERN on Wednesday, CMS will be partially moving thefocus of this week to T1-T2 transfers.

      [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • ALICE service /
      Due to the submission workflow of Alice, it is very important to know in real time the number of agents which are scheduled, finished and done at any moment. If this is not the case, Alice submission worksflow will not be able to decide the number of agents that should be submitted and there is the risk to leave the queues empty.

      This problem is affecting mostly big sites as FZK and CERN. Alice realized that the information provided by the LB is quite slow and also the information provided by the IS is not true in some cases. We are looking therefore for a solution to this issue.
      Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
    • LHCb service /
      1. massive removal of physical replicas on the various Storages still requires site administrator intervention. LHCb wish to rise this as an issue toensure:
      (i) SRM-2 developers continue to be aware of this use case.
      (ii) siteadministrators are aware of possible request from LHCB given the limitations ofthe current middleware. These large deletions need to be coordinated between LHCb & the sites; Marianne Bargiotti is coordinator for this activity for LHCb.

      2. Following on last weeks report of dcache not stageing in: ,br> Workarounds for making working the prestager agent are: either all LHCb d-cache sites open the dcap ports (now awaiting for Gridka and IN2P3) or we use a (still under test) utility that EIS made available for bringing on line files from remote.
      The LHCb preferred option is to have lcg-gt stageing files (whichever back-end behind srm);
      A less elegant solution is to use dccp against d-Cache sites and lcg-gt against CASTOR sites (then having port dcap open on these sites).

      Speaker: Dr roberto santinelli (CERN/IT/GD)
  • 16:55 17:00
    OSG Items 5m
    Item 1
  • 17:00 17:05
    Review of action items 5m
    list of actions
  • 17:10 17:15
    AOB 5m