WLCG-OSG-EGEE Operations meeting

Nick Thackray
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
        Grid-Operator-on-Duty handover
        Migration to SL4 WNs
            The WLCG Management Board and the GDB have requested that all WLCG tier-1 sites must migrate to SL4/gLite 3.1 WN by the end of August.
            The MB and GDB have also expressed the strong desire that all WLCG sites migrate to the SL4/gLite 3.1 WN as soon as possible.

          Updates from the Tier-1 sites:
          • ASGC: New CE hosting 200 SL4 cores has been brought online Aug 10, 2007. Remaining 350 cores will be migrated to the new CE in phases.
            Preparing for next batch of SL3 WNs to be migrated to SL4, but no changes this past week

          • CERN: CERN is on track to fulfil it's commitments, for providing SL4 based WNs, by the agreed date of end of August.

          • TRIUMF: All new resource will be installed with SL4 and will be coming online around 20th August. The old cluster will be moved and re-installed with SL4 shortly after.

          • GridKA: All WNs at GridKa on SL4 since 27-7-2007. gLite WN-package 3.021 (''compatible''). Upgrade to 3.1 WN package planned for early September after allowing some time for testing in PPS end of August.

          • INFN: Have migrated half and are unsure when this can be completed. Non-LHC VOs are holding then back.

          • SARA/NIKHEF: NIKHEF will upgrade their WNs to CentOS-4 this week (33). They have done this already by now.
            SARA will upgrade the WNs in September (no date fixed yet). It is not possible to do this earlier because of vacations of persons involved.

          • PIC: We have migrated nearly 90% of our WN's to slc4.
            All the Grid WNs at pic are now running under slc4 and Glite 3.1

          • RAL: A new CE has been deployed - lcgce02.gridpp.rl.ac.uk for access to SL4 WNs, and 20% of our worker node capacity has been reinstalled with SL4. The test CE lcgce0371.gridpp.rl.ac.uk has been taken out of service. We are discussing with the experiments further migration of capacity.

        PPS Report & Issues
          Issues from EGEE ROCs:
          1. SAM Client at Cyfronet has been reconfigured to use glite-wms-* commands. There is an unknown problem during matchamaking of jobs directed to ce110.cern.ch - under investigation (with Ulrich Schwickerath)[ROC CE].
          Release News:
          • gLite 3.0.2 Update 34 about to be fast-tracked in production. It will contain the new lcg-vomscerts package (version 4.6.0) that adds the host certificate of the US-ATLAS VOMS server vo.racf.bnl.gov and removes the old lcg-voms.cern.ch certificate that has expired. The release is supposed to be issed today.
        EGEE issues coming from ROC reports
          1. NE: The SAM job never completes successfully if one of the tests times out or blocks/hangs. For our express queue this results in the SAM job running out of wallclocktime and getting killed, which in turn means no SAM results are being published and the entire site fails for all tests. Is there a way to prevent this from happening, by killing off hanging/blocking tests in the SAM job within a few minutes? Or else it would be nice to know what the expected runtime of the SAM job is, i.e. with what wallclock specifications should it run. Is anything like that specified anywhere?

          2. Should Memory size published at GlueHostMainMemoryRAMSize be per WN or CPU core. more info

        Tier 1 reports
        WLCG issues coming from ROC reports
        WLCG Service Interventions (with dates / times where known)
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. Decommisioning of SL3 WNs at DESY-HH: Queues of CEs grid-ce0 and grid-ce2.desy.de will be drained 21.9.07 and finally shutdown on 24.9.07. WNs will be reinstalled with SL4 and included in the existing new CE grid-ce3.desy.de.

          Time at WLCG T0 and T1 sites.

        FTS service review

          Please read the report linked to the agenda.

        ATLAS service
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • We noticed that some sites, in order to run ATLAS prod, have tried to installpython32. https://gus.fzk.de/pages/ticket_details.php?ticket=25690&from=allt Wewant to remember that python32 is not supported anymore, and we'll discuss inthe next ATLAS taskforce (one week from now) which policy is better for ATLAS.
        CMS service
        LHCb service
        ALICE service
        Service Coordination
          The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September. See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4 The CMS CSA07 service challenge is due to start on 10 September and run for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
        Review of action items 5m
