WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray, Steve Traylen
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: All received.
  • VOs: Alice, CMS, BioMed
  • list of actions
    Minutes
      • 16:00 16:05
        Feedback on last meeting's minutes 5m
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: ROC CERN / ROC Russia
          To: ROC France / ROC SE Europe


          NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

          Issues from CERN COD:
          1. None to report.
          Issues from Russian COD:
          1. One site in last escalation step because of no progress : CNB-LCG2
          2. There are many nodes which are not registered or have switch off the monitoring in the GOC DB, but tested by SAM:
            • ant1.grid.sara.nl
            • epgce2.ph.bham.ac.uk
            • house.grid.kiae.ru
            • lfc0001.m45.ihep.su
            • se02.athena.hellasgrid.gr
            • xg009.inp.demokritos.gr
            Recall that nodes below had the same problem in September:
            • ant2.grid.sara.nl
            • lfc0001.m45.ihep.su
        • <big> PPS Report & Issues </big>
          PPS reports were NOT received from these ROCs:
          Asia Pacific, France, Italy North Europe, SE Europe

          Issues from EGEE ROCs:
          1. gLiteCEs are being decommisioned bu it is still impossible to remove nodes from the GOC DB. [ROC CE]
          Release News:
          • gLite3.1.0 PPSUpdate07 passed the pre-deployment tests and is being announced (today) at all PPS sites
            • glite-FTM
            • gLite 3.1 BDII (slc4/ia32)
          • PPS is ready to receive the lcg-CE v3.1 (SLC4). Certification is expected to be completed very soon (today)
        • <big> EGEE issues coming from ROC reports </big>
          1. UKI: GGUS ticket #27747 raises the issue that if a WMS/LB node has its host name changed, the configuration needs to be updated on any UI that uses it. Is there a procedure for doing this change?
          2. Item 2.
          3. Item 3.
          4. Item 4.
          5. Item 5.
          6. Item 6.
          7. Item 7.
          8. Item 8.
      • 16:30 17:00
        WLCG Items 30m
        • <big> Tier 1 reports </big>
          • PIC Tier-1 Report
            • We discover that LHCb SAM tests are saving small (40KB) files into the tape system every few hours. We believe this is not a good practice, since it could affect the performance for accessing the data in that tape later. We have changed the properties of the LHCb SAM test directory so that it is not migrated to tape.
            • We discover ATLAS production jobs doing nothing (sleep 3600) after an apparent failure staging output (apparently log) files into voatlas01.cern.ch. We were not aware of local SE problems at that time. Could atlas production jobs stage out files first into the local SE and then, if needed, transfer them to CERN asynchronously through FTS?
            • Many CMS prod jobs detected running with very low efficiency. CMS prod manager contacted and are having a look into it. First we thought it had to do with the dCache system no cappable of serving data through dcap fast enough, but it seems that it could also be related to the merging jobs of CMS opening many files in a not very efficient way. Experts still investigating.
            • We have failed 2 SAM tests which are a bit confusing. The first of all was on 11 Oct 2007 06:26:37 and it was a Certificate test. We have only seen failures on these kind of tests when the machine is loaded. This is not the case so this fails remains still under investigation. It seems from the logs we have seen in our CE that the user itself after submitting the jobs to the WMS has canceled it. Since we have seen this issue happening again a few times we have opened a ggus ticket (https://gus.fzk.de/pages/ticket_details.php?ticket=27785) in order to understand the problem better.
            • We have given access to production WN's to the PPS. This was the original configuration but since we didn''t see too much activity from the experiments in the PPS cluster and we wanted to have a complete separate cluster, we decided to have independent WN''s on PPS. After some PPS discussion held in Budapest we have decided to roll back to this configuration. We are giving access to the production farm but of course limited to the same number of cpu''s we had before.
            • Still waiting for the WMS bug to be solved (https://savannah.cern.ch/bugs/index.php?29604 ). In the mean time we just use one account for sgm users at pic.
        • <big> WLCG issues coming from ROC reports </big>
          1. [FZK] No final results on the problems LHCB is encountering at GridKa. We have found and corrected several configuration errors and moved gridftp doors off the LHCB tape mover machines in order to alleviate some of the stalls. We are now investigating the indeed troublesome SRM.
          2. Item 2.
          3. Item 3.
          4. Item 4.
          5. Item 5.
          6. Item 6.
          7. Item 7.
          8. Item 8.
          9. Item 9.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big>

          Please read the report linked to the agenda.

          Speakers: Gavin McCance (CERN), Steve Traylen
          Paper
        • <big> ATLAS service </big>
          See also https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071 and https://twiki.cern.ch/twiki/bin/view/Atlas/ComputingOperations for more information.

          • The VOViews problem reported in the last meeting is still present. The list of the queues affected is in: http://voatlas01.cern.ch/atlas/data/VOViewProblem.log
            (a snapshop of the CEs effected is given below).
            The LFC server are running 1.6.3 version of LFC server, that does not support secondary groups. It should be upgraded to 1.6.5-3


          List of effected CEs:
          agh2.atlas.unimelb.edu.au
          atlasce.phys.sinica.edu.tw
          bigmac-lcg-ce.physics.utoronto.ca
          ce-iep-grid.saske.sk
          ce.gina.sara.nl
          ce.keldysh.ru
          ce.phy.bg.ac.yu
          ce.ui.savba.sk
          ce.ulakbim.gov.tr
          ce00.hep.ph.ic.ac.uk
          ce01.afroditi.hellasgrid.gr
          ce01.ariagni.hellasgrid.gr
          ce01.athena.hellasgrid.gr
          ce01.kallisto.hellasgrid.gr
          ce02.athena.hellasgrid.gr
          ce02.lip.pt
          ce04-lcg.cr.cnaf.infn.it
          ce05.pic.es
          ce05.pic.es
          ce07.pic.es
          ce1-egee.srce.hr
          ce101.cern.ch
          ce102.cern.ch
          ce106.cern.ch
          ce107.cern.ch
          ce108.cern.ch
          ce109.cern.ch
          ce112.cern.ch
          ce113.cern.ch
          ce114.cern.ch
          ce115.cern.ch
          ce116.cern.ch
          ce117.cern.ch
          ce118.cern.ch
          ce119.cern.ch
          ce120.cern.ch
          ce123.cern.ch
          ce2.triumf.ca
          ceitep.itep.ru
          cox01.grid.metu.edu.tr
          fornax-ce.itwm.fhg.de
          glite-ce-01.cnaf.infn.it
          golias25.farm.particle.cz
          grid-ce3.desy.de
          grid003.roma2.infn.it
          grid01.cu.edu.tr
          grid01.erciyes.edu.tr
          gridba2.ba.infn.it
          gridce.ilc.cnr.it
          gridce.pi.infn.it
          gridce2.pi.infn.it
          gridit-ce-001.cnaf.infn.it
          hep-ce.cx1.hpc.ic.ac.uk
          heplnx206.pp.rl.ac.uk
          kalkan1.ulakbim.gov.tr
          lcg-ce.rcf.uvic.ca
          lcgce0.shef.ac.uk
          lgdce01.jinr.ru
          mars-ce2.mars.lesc.doc.ic.ac.uk
          mu9.matrix.sara.nl
          node001.grid.auth.gr
          skurut17.cesnet.cz
          snowpatch.hpc.sfu.ca
          t2-ce-02.lnl.infn.it
          t2ce03.physics.ox.ac.uk
          tbit01.nipne.ro
          u2-grid.ccr.buffalo.edu
          yildirim.grid.boun.edu.tr
        • <big>CMS service</big>
          • Item 1
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          • GRIDKA is still unusable. Last week we have been able to stage in and process just three files despite the fact they claimed they restored all their system. This shortage (outage) is now since one month and the LHCb reprocessing activity has been extremely penalized. Can site representatives comment on that?
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • Item 1.
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • WLCG Service Reliability workshop, CERN, November 26 - 30 - agenda - wiki
          • Common Computing Readiness Challenge - CCRC'08 - meetings page
          • Atlas Cosmics M5 scheduled from 23 Oct to Nov 5. Rates not known yet but will be higher than M4 (which was about 100 MB/s from CERN).
          • ALICE FDR phase 1 should be starting soon building up to 100 MB/s export from CERN.
          Speaker: Harry Renshall / Jamie Shiers
      • 16:55 17:00
        OSG Items 5m
      • 17:00 17:05
        Review of action items 5m
        list of actions
      • 17:10 17:15
        AOB 5m
        • .