WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: All reports received.
  • VOs: Alice, CMS, BioMed, LHCb
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
        Minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC Russia / ROC SE Europe
          To: ROC Central Europe / ROC DECH


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues:
          1. The last one has id#4706 (GGUS Ticket #21903). It was opened for the node grid-srm.physik.rwth-aachen.de (RWTH-Aachen) on 2007-05-14, but the problem is not fixed yet.


          2. There are a lot of nodes which are not registered in GOD BD, but tested:
            1. datagrid.lbl.gov (CERN_PPS)
            2. ccsrmtestv2.in2p3.fr (IN2P3-CC-PPS)
            3. test-dpm.nibk.ac.at (HEPHY-UIBK)
            4. w-fts01.grid.sinica.edu.tw (Taiwan-LCG2)
            5. cclcgseli05.in2p3.fr (IN2P3-CC)
            6. cclcgseli06.in2p3.fr (IN2P3-CC)
            7. ant2.grid.sara.nl (CERN_PPS)
            8. storm-fe.cr.cnaf.infn.it (PPS-CNAF)
            9. atlasse01.ihep.ac.cn (BEIJING-LCG2)
            10. srm-durable-lhcb.cr.cnaf.infn.it (INFN-T1)
            11. lxdpm102.cern.ch (CERN-PROD)
            12. wormhole.westgrid.ca (SFU-LCG2)
            13. giis-fzk.gridka.de (FZK-LCG2)
            14. e5grid08.physik.uni-dortmund.de (UNI-DORTMUND)
            15. cclcgip03.in2p3.fr (IN2P3-CC-T2)
            16. host001.hpc.ntcu.edu.tw (TW-NTCU-HPC-01)
            17. host002.hpc.ntcu.edu.tw (TW-NTCU-HPC-01)
            18. host003.hpc.ntcu.edu.tw (TW-NTCU-HPC-01)
            19. nanlcg03.in2p3.fr (IN2P3-SUBATECH)
            20. gridce01.ifca.es (IFCA-LCG2)
            21. dcache-core-cms01.desy.de (DESY-HH)
            22. test-gliteCE.uibk.ac.at (HEPHY-UIBK)
            23. e5grid09.physik.uni-dortmund.de (UNI-DORTMUND)
            24. lxb2039.cern.ch (CERN_PPS)
            25. lxb2090.cern.ch (CERN_PPS)


          3. TAU-LCGs was discussed in previous meeting, and agreement was that either site will reply to ticket 24591, or SEE ROC will suspend the site. The site acted on the ticket by simply closing it (status became solved by ROC). The issue is still there though (I sent information email, with no reply so far).


          4. Site to be discussed at OPS meeting (if problem is not solved until Monday) – Ru-Troitsk-INR-LCG2 with ticket 24239. No reply and problem still there.


          5. The user submitting PPS monitoring jobs is member of both ops and dteam VO. This is known to cause problems in production. PIC PPS site is affected by this problem, perhaps easier to be solved on the monitoring side.


          6. sBDII-sanity tests seems unreliable, causes to open many times the GSTAT page for the corresponding site only to see that there is no problem at all for the site.


          7. SARA-MATRIX complaining in one ticket that they can not update their GOC DB page to update some nodes. Is this still the case?


          8. Lots of sites are in downtime, but the alarms still appear on the dashboard. Since we should not open tickets for such sites, such alarms are not useful and only clutter the view.
        • <big> gLite 3.0/3.1 and Itanium ia64 </big>
          Speaker: Oliver Keeble
        • <big> Migration to SL4 WNs </big>
            The WLCG Management Board and the GDB have requested that all WLCG tier-1 sites must migrate to SL4/gLite 3.1 WN by the end of August.
            The MB and GDB have also expressed the strong desire that all WLCG sites migrate to the SL4/gLite 3.1 WN as soon as possible.

          Updates from the Tier-1 sites:
          • ASGC: New CE hosting 200 SL4 cores has been brought online Aug 10, 2007. Remaining 350 cores will be migrated to the new CE in phases.

          • CERN: CERN is on track to fulfil it's commitments, for providing SL4 based WNs, by the agreed date of end of August.

          • BNL: No report.

          • FermiLab: No report.

          • TRIUMF: All new resource will be installed with SL4 and will be coming online around 20th August. The old cluster will be moved and re-installed with SL4 shortly after.

          • IN2P3: No report.

          • GridKA: All WNs at GridKa on SL4 since 27-7-2007. gLite WN-package 3.021 (''compatible''). Upgrade to 3.1 WN package planned for early September after allowing some time for testing in PPS end of August.

          • INFN: No report.

          • SARA: No report.

          • NIKHEF: No report.

          • PIC: We have migrated nearly 90% of our WN's to slc4.

          • RAL: A new CE has been deployed - lcgce02.gridpp.rl.ac.uk for access to SL4 WNs, and 20% of our worker node capacity has been reinstalled with SL4. The test CE lcgce0371.gridpp.rl.ac.uk has been taken out of service. We are discussing with the experiments further migration of capacity.

        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:
          SEE, IT, AP

          Issues from EGEE ROCs:
          Several ROCs recommended the decommisioning of gLiteCEs to their sites
        • <big> EGEE issues coming from ROC reports </big>
          1. CERN (CERN-PROD): FOR INFORAMTION: On all our lcg-CEs the globus-mds has been replaced by a local BDII. We expect that this should reduce The number of "job list match failed" job submission failures


          2. BNL: The plots of hourly report for VO "OPS", and Tier 1 site at BNL seems not right. It is strange that all individual services are green and the overall services showed different result. How is the overall service generated? Can this problem be fixed?


          3. (ROC DECH): The last updates of APEL rpms in production are from end of last year. The APEL rpms in certification fixed important bugs for sites in our region. What is holding them back to enter production?


          4. SouthWest Europe ROC (PIC): FOR INFORMATION: We have removed as requested our glite CE's both in production and PPS. These machines were ce02.pic.es (PIC-PPS) and ce05.pic.es (pic production). ce05.pic.es is been reinstalled as an lcg-CE with the same queues as the other CE we have which is the frontend of our slc4 WN's (ce07.pic.es) There's still another ce (ce06.pic.es) which is the one that has the slc3 queues which will be removed next week.


          5. UK/I ROC: Many UKI sites are now enforcing a limit on the maximum number of biomed jobs which can be queued due to (upto) 1000s of jobs being queued recently which has overloaded the CEs at 4+ sites.


      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
        • <big> WLCG issues coming from ROC reports </big>
          1. CERN ROC (CERN): CERN has switched to pool accounts for the ATLAS production role on the 9/8/2007


          2. TRIUMF: Network connection to FZK via OPN (CERN) blocked. Is there a policy that T1-T1 transafers should use the OPN, i.e. star topology via CERN. If so this must be stated.


        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

          Time at WLCG T0 and T1 sites.

          1. CASTOR upgrades (see status board)
          2. Oracle upgrade Wednesday on LCG Oracle cluster (transparent)
          3. INFN TIER-1 DOWN: Due to maintenance to electrical distribution and insertion of a new air cooling systems for the computing facilities T1 Cnaf will be down from 2007-08-27 0900 UTC to 2007-08-28 1400 UTC. The data center will be completely down except for network WAN connection, LFS license node and few critical services. The queues will be set to draining state on Thursday 23/8/07 in order to have the farm completely empty for Sunday and prepare the services shutdown.
          4. Network interuption at CERN tomorrow, 16th August. 5 minute downtime some time between 6:00 and 6:30 UTC. CERN-RAL path will be lost during this 5 minutes.
          5. ASGC will have two network outages this week. These have been announced via the broadcast tool and can been seen in the News section of the CIC Portal.
          6. TRIUMF will be installing new hardware during this weekend and next week. Should be transparent.
        • <big>FTS service review</big>
            Please read the report linked to the agenda.
          Speaker: Gavin McCance (CERN)
          Paper
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • We noticed that sites have transfer errors when getting files from LAL or SACLAY(2 sites of GTIF in Paris). This is due to some manipulation with srmv2 probably in July, when the main responsible of the site Stephane Jezequel was in vacation. Lyon people found a solution so that FTS manages this problem (they have FTS 1.5). This solution was sent to the FTS developers. Could you check that it has been spread (or will be soon) to other T1 which manages FTS channels?
          Speaker: Kors Bos (CERN / NIKHEF)
        • <big>CMS service</big>
          • No report.
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • No report.
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • No report.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> Service Coordination </big>
          The ATLAS M4 cosmic ray run is scheduled from 23 August to 3 September. See https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperatonsPlanningM4
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
        1. Item 1
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m