WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: CERN and DECH
          To: Italy and France


          Report from CERN :
          • UNI-KARLSRUHE site is scheduled in SD until the end of February, with a downtime of 45 days. So I think this site should be temporarily suspended and recertified in March.

          • The alarm FTS-infosites on fts-t1import.cern.ch which is failing due to the fact that the middleware does not foresee the current production scenario in use at CERN. Developers are aware, a bug has been opened and it will be like this until the bug is fixed. What shall we do with the alarm?

          • List of sites escalated to "political instances":
            1. SITE NAME: SDU-LCG2
              ROC NAME: CERN
              GGUS TICKET NUMBER: ???
              Reason for escalation: site has not responded for 3 weeks.

          • A few nodes appeared not to be registered in the GOCDB:
            1. ROC DECH:
              udo-dcache01.grid.uni-dortmund.de
              udo-dcache03.grid.uni-dortmund.de
              udo-ce01.grid.uni-dortmund.de
              rb-goegrid.local
            2. ROC ITALY:
              gridit002.pd.infn.it
              atlas-ce-02.roma1.infn.it
          Report from DECH :
          • Nothing to report.
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          2009-02-02: Pilot of SCAS in preparation The gLite release team informed us that they reckon the new SCAS service (Site Central Authorization Service) to be in a sufficiently stable condition for a pilot service to be set up. In particular the most severe issues found earlier (memory leaks, bad configuration) were solved. The software is currently undergoing stress testing in certification. In parallel we contacted the LHC experiments (specifically CMS, Atlas and LHCb) in order to address the activity and they were in favour of a controlled deployment in production of a pilot service based on some instances of SCAS. Specifically LHCb would like a supporting T1 to be involved in the pilot, and suggest IN2P3 and/or FZK as first choices. The hardware set-up is presumably not very demanding (one node to host the SCAS server). The patches in object are:

          • https://savannah.cern.ch/patch/index.php?2767
          • https://savannah.cern.ch/patch/index.php?2635

          The kick-off meeting of the pilot activity (definition of service, goals and duration) is expected by the end of this week (Thursday of Friday). Sites interested can contact pps-support@cern.ch and they will be invited to the kick-off. More info about the SCAS service can be read at http://indico.cern.ch/getFile.py/access?contribId=235&sessionId=95&resId=0&materialId=slides&confId=32220

        • <big> gLite Release News</big>
          Please find gLite release news in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

          Now in Production Nothing new. Now in PPS Nothing new. Soon in Production 2009-01-30: release of gLite 3.1 Update 40 and of gLite 3.0 Update45 to production in preparation. The update, scheduled for Wednesday, 4th will contain The update contains an upgrade of lcg-vomscerts-5.3.0. It adds 3 new host certificates:

          • cclcgvomsli01.in2p3.fr (biomed + egeode);
          • next cert for vo.racf.bnl.gov (atlas);
          • cert for voms.fnal.gov (cms)
        • <big> EGEE issues coming from ROC reports </big>
          • DECH: Concerning WN scratch space discussion:
            • Using too much scratch space on WNs has also happened at DESY and UNI-BONN.
            • What is the procedure to be done by the sites?
              • Kill jobs that exceed the number in the VO card?
              • What about sites that cannot provide e.g. 15GB for ATLAS?
            • Some users (even production frames) tend to missuse /tmp on WNs. Any recommendation to the sites how to proceed here?

          • SouthWest: We saw at several SWE sites that the CE SAM job submission test was failing, because of an WMS "request expired" error.
        • <big> Obsoletion of gLite 3.0 </big>
          It is proposed that all remaining gLite 3.0 clients and services will be obsoleted at the end of April 2009. This proposal will go to the TMB for approval. Concerns, questions, etc. should be sent to nick.thackray@cern.ch and Maite.Barroso@cern.ch
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. SouthWest: We would like ask CMS if they could update their requirements table on the VO card.
        • <big>WLCG Service Interventions </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.
          Also...
            SAM: The intervention scheduled for next Monday on the SAM and GridView databases has been moved to next Wednesday, 4th of February. During this downtime the SAM and GridView services will be down, including submissions, web services and interfaces. This downtime is required to improve the database schemas of these two services, moving common objects to a separate account, thus easing any future modifications. Thanks for your comprehension.

          Time at WLCG T0 and T1 sites.

        • <big> FTM endpoints </big>
          Can the tier-1 sites please correct the following list:
          • ASGC: http://w-ftm01.grid.sinica.edu.tw/transfer-monitor-report/
          • BNL: ???
          • CERN: https://ftsmon.cern.ch/transfer-monitor-report/
          • FNAL: https://cmsfts3.fnal.gov:8443/transfer-monitor-report/
            https://cmsfts3.fnal.gov:8443/transfer-monitor-gridview
          • FZK: http://ftm-fzk.gridka.de/transfer-monitor-report/
          • IN2P3?: http://cclcgftmli01.in2p3.fr/transfer-monitor-report/
          • INFN: https://tier1.cnaf.infn.it/ftmmonitor/
          • NDGF: ???
          • PIC: http://ftm.pic.es/transfer-monitor-report/
          • RAL: ???
          • SARA/Nikhef: http://ftm.grid.sara.nl/transfer-monitor-report
            http://ftm.grid.sara.nl/transfer-monitor-gridview
          • TRIUMF: http://ftm.triumf.ca/transfer-monitor-report/
        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice items </big>
        • <big> Atlas items </big>

        • <big> CMS items </big>
          Speaker: Daniele Bonacorsi
        • <big> LHCb items </big>
        • <big> WLCG service recommended baseline versions </big>
          The recommended baseline versions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB