WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Maite Barroso Lopez (CERN)
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    list of actions
    Minutes
      • 16:00 16:25
        EGEE Items 25m
        • <big> Grid-Operator-on-Duty handover </big> 5h
          From ROC France (backup: ROC Italy) to ROC AsiaPacific (backup: ROC Central Europe)

          1. Russian TOP BDIIs: Found repeated IS timeout problems on ru-IMPB-LCG2 and other RU sites, due to central BDIIs. Found several sites pointing GFAL to:
            lcg15.sinp.msu.ru (single host) or
            lcgbdii.jinr.ru (single host)
            plus others pointing to CERN TOP BDIIs. We had TOP BDII discussions at last few phones. I can't see any BDII status from Russia in the minutes. Should we:
            - suggest some reorganization?
            - wait for GFAL improvements and see?


          2. Some disturbance because GOC-DB unscheduled down:
            - 2007-03-20 morning
            - 2007-03-21 afternoon
            - 2007-03-22 late morning
            * if you want, Gridice http://gridice2.cnaf.infn.it:50080/gridice/site/site.php gives a reasonably updated cache of downtimes (GOCDB failover replica not yet ready).


          3. Again on the very long PPS tkt #15574 - PreGR-01-UoM. I don't know if ROC opinion: "It's a SAM problem" is right. I try to transfer it to PPS unit and ask at Operations Meeting if this is correct. If we close it, someone will open a new one soon because tests are failing!


          4. Just a remark at sites(/ROCs): many sites use SD after a problem is detected on them, and extend the SD while they are trying to solve it. We think this can be reasonable or not on a "case by case" basis, but in general it shouldn't be the regular practice on a production system.
        • <big> PPS reports </big>
          PPS reports were not received from these ROCs: Italy, North Europe, Asia Pacific
        • gLite 3.0 PPS-update 24 deployed. This update contains:
          • improved slapd cache on BDII
          • vulnerability fix in gsiopenssh

        • gLite 3.0 PPS-update 25 coming soon. This update contains, among other, also patches for YAIM, so at leat "formally" all meta-packages will turn out to be affected. We'll try to be as specific as we can in the release notes, but sysadmins are asked to cross-check carefully to see which services are actually affected
          1. Problems in submission of SAM tests to gLite Ces still under investigation. Submission of SAM tests has been re-started through a workaround [CERN]


          2. gLiteCE (zeus76.cyf-kr.edu.pl) has been put into downtime, because of some unresolved problems [Central Europe]

          3. Question from PPS-Coordination: Are the problems you experience due to the release? Any GGUS tickets existing?
        Speaker: Nicholas Thackray (CERN)
  • <big> EGEE issues coming from ROC reports </big>
    Reports were not received from these ROCs:
    1. (CERN ROC): Maintenace day correctly handled by GOC, but timezones in SAM were all wrong. They had the maintenance starting 8 hours earlier than it did, i.e. interpretted GOC time as UTC rather than PST. This lead to SAM reporting filure diring downtime, and affects efficiency stats. https://gusiwr.fzk.de/pages/ticket_details.php?ticket=12884.


    2. (CERN ROC): SAM test in every job on WN seem to take up to 300s at beginning and end of job - this is an enormaous waste of cpu, and makes job turnaround poor. Can we disable it? Do other sites see this, or could it be a local mon box(rgma) problem?


    3. (France ROC): How is a VOMS proxy mapped on a grid node (CE, SE, etc.) using LCMAPS ? Is there an official document that explains this mapping mechanism?


    4. (DECH ROC): 64-bit support: Do others have experience finding workarounds? (in addition to discussion e.g. on LCG Rollout, "Who's planning to move to SL/SLC/CentOS 4.x and when?")


    5. (DECH ROC): Problems with LFC upgrade - Impression: testing/certification of MySQL related middleware features has flaws. Improve MySQL support for the future? Is the current testing of MySQL in PPS enough?


    6. (SE Europe ROC): It seems that CIC daily reports for sites contain incorrect links to SAM failures details as of today: https://gus.fzk.de/pages/ticket_details.php?ticket=20043


    7. (SE Europe ROC): One site in IL reports that they get "submitter proxy expired" ggus ticket https://gus.fzk.de/pages/ticket_details.php?ticket=19854 any ideas?


    8. (UK/I ROC): The site is marked as having failed some replica management tests on 22-03-2007. However, the "details" link does not display any data about this job or the reasons for this job failure.


  • 16:00 16:05
    Feedback on last meeting's minutes 5m
    Minutes
  • 16:30 17:00
    WLCG Items 30m
    Reports were not received from these tier-1 sites: INFN
    Reports were not received from these VOs:

    • <big> WLCG issues coming from ROC reports</big>
      Reports were not received from these ROCs:
      1. (AsiaPacific ROC): Do we have an updated estimate of when the following will be available:
        * SLC4 WN
        * unified version of RFIO client for DPM and Castor
        We asking on behalf of our CMS coordinators.


      2. (Central Europe ROC): CE sites can migrate to SLC4 as long as VO software is running on SLC4. Specifically we identified ATHENA (Atlas VO) that could prevent us from migration to SLC4. The site reporting this was not sure about ATHENA status and migration plan. Can someone from Atlas VO comment on ATHENA status?


    • <big>Upcoming WLCG Service Interventions (with dates / times where known) </big>
      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      • On Tuesday 27th, the Castor system at RAL will be offline for upgrades, this will affect the ralsrm[a-f].rl.ac.uk endpoints, between 0900 and 1500, at the same time there will be some maintenance on the tape robot, preventing restores from tape on dcache-tape.gridpp.rl.ac.uk. Ops vo CE SAM Replica Management tests will be moved to dcache.gridpp.rl.ac.uk while ralsrma.rl.ac.uk is down.

      Time at WLCG T0 and T1 sites.

    • <big>FTS service review</big>
      Speaker: Gavin McCance (CERN)
    • <big> ATLAS service </big>
      Speaker: Kors Bos (CERN / NIKHEF)
    • <big>CMS service</big>
      See also CMS Computing Commissioning & Integration meetings (Indico) and https://twiki.cern.ch/twiki/bin/view/CMS/ComputingCommissioning

      -- Job processing: CMS MC production activities surveying on CMSSW_1_2_3installation, new round of MC prod is starting soon.
      -- Data transfers: lastweek was week-1 of the CMS LoadTest07 (see [*]) with focus on both T0-T1 routesand T1-T2 routes. Good stop&start exercise by PhEDEx to handle the scheduleddown at CERN due to Castor intervention (firmware upgrades, Wednesday March21st) with no problems, in good synchronization and communication withCastor@CERN people, and no problems seen on the CMS Castor pool after theintervention, also.
      --- T0-T1 exercises were quite smooth through all week: all7 T1's joined, and CMS ran at 300-350 MB/s of aggregate transfer rate to allT1's (daily average).
      --- T1-T2 exercises performed differently in differentregions. ~27 T2's joined, and CMS ran at 250-400 MB/s of aggregate transfer ratefrom T1's to T2's.
      --- This week we will focus on multi-VO transfers (asrequested by WLCG) and still exploring and debugging T1-T2 routes, also.
      [*]http://www.cnaf.infn.it/~dbonacorsi/LoadTest07.htm
      Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
    • <big> ALICE service </big>
      Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
    • <big> LHCb service </big>
      Only one major issue this week to report (and to be followed closely) and itregards the format of the tURLs returned by SRM that, from time to time (we hada previous experience at CERN more than one year ago) are incosistent with theunderlying application (ROOT) and then useless. From GGUS ticket #20160 Itlooks like the format of the tURL returned by SRM for accessing data to CASTOR 1at PIC is not in a format that ROOT can understand. Even playing somemanipulation on the returned tURL string from SRM doesn't help and ROOT can'topen the file. This is a problems that reminds another one that has been facedat CERN a long while ago. The procedure we use in DIRAC is the following: 1.The SURL at the site is obtained from the LFC, given the LFN 2. The SURL isconverted into a tURL (and the file is pre-staged) using lcg-gt rfio 3.This tURL is used by Gaudi/POOL/ROOT for opening the file An examplefollows: [lxplus014] ~/DIRAC > lcg-lrlfn:/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digisrm://castorsrm.pic.es:8443/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digi [lxplus014] ~/DIRAC > lcg-gtsrm://castorsrm.pic.es:8443/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digirfiorfio://cfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133575079557 0 For simplicity we just run a simple Python script that uses ROOT(5.13.04c) with just a TFile.Open() and here is the result: Executingresult =TFile.Open(rfio://cfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133)====================================================== Error in: filecfs0163.pic.es//stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133 does notexist Result is: None We have tried to remove the stager host name (this seemsto work at CERN?), without success =============================== New TURL is:rfio:/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133====================================================== Executing result =TFile.Open(rfio:/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133)====================================================== Error in: file/stage/cfs0163/lh/stage/00001464_00000722_4.digi.133133 does not exist Resultis: None Here is a similar attempt (successful) at CERN Castor1: root [0]TFile::Open("rfio:/shift/lxfsrk5504/data03/z5/stage/00001355_00034199_5.digi.162180")(class TFile*)0x8c1fdc0 The same problem was faced some time ago at CERN andthe only solution that was found so far was to return a simple tURL of the formrfio:/castor/pic.es/grid/lhcb/production/DC06/v1-lumi2/00001464/DIGI/0000/00001464_00000722_4.digi Is this something that can be configured in the SRM server at PIC? For thetime being this is a major show-stopper for any kind of reconstruction oranalysis job at PIC.
      Speaker: Dr roberto santinelli (CERN/IT/GD)
  • 16:55 17:00
    OSG Items 5m
    Item 1
  • 17:00 17:05
    Review of action items 5m
    more information
  • 17:10 17:15
    AOB 5m