WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Problems with CIC Portal Update from Gilles: all reports have been correctly submitted this week, cf broadcast sent to ROC managers this morning
  • VOs:
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC France / ROC SW Europe
          To: ROC Asia-Pacific / ROC SE Europe


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues:
            Report from France ROC:
          1. GSTAT BDII sanity check OK, but SAM BDII sanity check not OK (ex. FZK-PPS) -> which test should be taken into account?
          2. Problem with monitoring on/off in GOC DB, sites are complaining. -> from Ops meeting on August 13th, Andy said this is a bug in GOC DB: monitoring 'on' should be only possible for production site. GOC DB should fix that asap.
          3. Poor network which causes SAM test failure from time to time (ex. YerPhi) -> What does COD do with this kind of site?
          4. Ask for site suspension: BEIJING-CNIC-LCG2-IA64.
            Site/ROC asked to be present at Ops meeting on 2006-08-14.
            ROC CERN present at Ops meeting on 2007-08-20. ROC CERN ask site to answer before 2007-08-24.
            No answer from site -> ask suspension.
          5. Transfered to political instances, and still not solved:
            1. DESY-PPS id#5720 - GGUS Ticket #25511
            2. YerPhi id#4542 - GGUS Ticket #21435
              Opened for armgrid1.yerphi.am on 2007-05-02, and CE-sft-job is still not fixed. Last escalation step on 2007-08-09, and problem still there. Ask site or ROC to be present at Ops meeting on 2007-08-24. Site/ROC answer on 2007-08-24 about poor network.
            Report from SW Europe ROC:
          6. There are many nodes not registered in GOC DB but monitored in SAM test.
        • <big> Announcing SAM updates </big>
          Speaker: Piotr Nyczyk
        • <big> URL of central BDII configuration </big>
            The top-level BDII configuration URL is referenced in the Cern ROC Web site:

            http://www.cern.ch/roc/index.php?dir=./bdii/

            • http://lcg-bdii-conf.cern.ch/bdii-conf/bdii.conf
            which is replicated every 10 min to:
            • http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf

            See the News section on the CIC Portal: new item with the title "top-level BDII config URL"
        • <big> Use of production resources for SRM2.2 testing</big>
          Speaker: Flavia Donno
        • <big> Clarification regarding site names for WLCG tier-2 accounting </big>
            Here is a response from the WLCG GDB to the question raised by ROC DECH last week (starting "ROC DECH: [For Information] Some confusion about Tier-2 accounting/renaming initiative..."

            No-one is asking sites to rename. The only new names required are for Tier2 Federations. These do not currently appear in GOCDB. The T2 Federation name will point to a list of current GOCDB names of the sites which make up the federation. The accounting will then present a view for the T2 which sums the accounting of all the sites in that T2.
            An example: The Romanian T2 is a federation with a shortname of RO-LCG. It is made up of the following sites:- NIHAM, RO-01-ICI, RO-02-NIPNE, RO-03-UPB, RO-07-NIPNE, RO-11-NIPNE. They haven't renamed.

            There is a special case for T2 which are only one site.
            Example: SIGNET in Slovenia. The site name in GOCDB remains SIGNET but in the T2 list it is called SI-SIGNET by adding the country code to the start of the GOCDB name. No renaming.

            There may be issues if there are GOCDB sites which claim to be in two different federations but that is the sort of issue we are looking for. Sue Foffano (Sue.Foffano@cern.ch) is coordinating the report so issues should be reported to her.
        • <big> Migration to SL4 WNs </big>
            The WLCG Management Board and the GDB have requested that all WLCG tier-1 sites must migrate to SL4/gLite 3.1 WN by the end of August.
            The MB and GDB have also expressed the strong desire that all WLCG sites migrate to the SL4/gLite 3.1 WN as soon as possible.

          Updates from the Tier-1 sites:
          • ASGC: New CE hosting 200 SL4 cores has been brought online Aug 10, 2007. Remaining 350 cores will be migrated to the new CE in phases.
            Preparing for next batch of SL3 WNs to be migrated to SL4, but no changes this past week.

          • CERN: CERN is on track to fulfil it's commitments, for providing SL4 based WNs, by the agreed date of end of August.

          • BNL: No report. Status???

          • FermiLab: No report. Status???

          • TRIUMF: All new resource will be installed with SL4 and will be coming online around 20th August. The old cluster will be moved and re-installed with SL4 shortly after.

          • IN2P3: No report. Status???

          • GridKA: All WNs at GridKa on SL4 since 27-7-2007. gLite WN-package 3.021 (''compatible''). Upgrade to 3.1 WN package planned for early September after allowing some time for testing in PPS end of August.

          • INFN: Have migrated half and are unsure when this can be completed. Non-LHC VOs are holding then back.

          • SARA/NIKHEF: NIKHEF will upgrade their WNs to CentOS-4 this week (33). They have done this already by now.
            SARA will upgrade the WNs in September (no date fixed yet). It is not possible to do this earlier because of vacations of persons involved.

          • PIC: We have migrated nearly 90% of our WN's to slc4.
            All the Grid WNs at pic are now running under slc4 and Glite 3.1

          • RAL: CE lcgce02.gridpp.rl.ac.uk was deployed for access to SL4 WNs, and 20% of RAL's worker node capacity has been reinstalled with SL4. RAL is discussing with the experiments further migration of capacity. Four Tier-2 sites have migrated clusters to SL4 but other sites are waiting for confirmation of a fix for the CMS and LHCb problems or do not wish to upgrade during the August holiday period.

        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          14:50: ROC Reports not readable from the CIC Portal

          Issues from EGEE ROCs:
          1. [UKI]: Continued problems trying to install the gLite CE at PPS-RAL. Has any other site managed a fresh install of the latest gLite-CE version? Even though an LCG-CE is now being installed we would like to understand where the apparent repository problem came from.

          2. Answer by PPS Coordination: We just wanted to point out that the installation of the gLite CE has been discouraged in PPS as it was in production, so the effort needed in installing and configuring it is currently being diverted on other tasks, by most PPS sites.
          Release News:
          • .
        • <big> EGEE issues coming from ROC reports </big>
          1. UKI: Problem submitting the production report today - status page never showed the report as submitted. PPS report was submitted fine.


          2. Central Europe: When will be available lcg-CE for SL(C)4? Support for SLC3/i386 stops in October 2007. Also some modern HW doesn't work on SLC3, so we need lcg-CE on SLC4. SLC support dates


          3. DECH: Now that all sites are encouraged to upgrade their WNs to the SL4 package, how long can they still expect some support for the SL3 version?


          4. US-ATLAS: A site administrator can schedule a downtime via GOCDB> If the server monitor has to be ticked off to avoid impacting uptime, can GOCDB automatically tick it off when the downtime starts? experience is that when I schedule a downtime, I have to go back the GOCDB manually and tick it off when the downtime starts. It is very error-prone. I always forget to do it.


      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. . RAL Tier-1:

          2. Problem: LHCb VO Box went offline Diagnosis : After running some diagnostic tools it was determined that the hardware had developed a fault. Solution: The VO Box hard disk was moved to a new host and the service was restored.

            Problem: A dcache-tape.gridpp.rl.ac.uk tape snapped Diagnosis: tape was assigned to Dteam, so no experiment data was affected, tape appears mainly to have stored files from Service Challenge 3 and from SAM tests. Solution: File entries will be removed from dcache-tape.gridpp.rl.ac.uk SE

          3. CERN-PROD: CMS have requested that their production WMSs are installed with the latest version of the gLite 3.1/SL3 WMS which has just been released to the PPS. Any objections? Does CMS accept the responsibility and risk of installing software which isn't fully tested?


        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. CERN-PROD: router upgrade (batch/CE paused) + castorpublic/SE upgrade

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.

          On Tuesday the RAL FTS will have an outage to move to version 2.0.

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big>CMS service</big>
          • No report.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • No report.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • No report.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> Service Coordination </big>
          The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September. See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4 The CMS CSA07 service challenge is due to start on 10 September and run for 30 days. See https://twiki.cern.ch/twiki/bin/view/CMS/CSA07Plan
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
        • No meeting next week due to CHEP. Next meeting on Monday 10 September.