WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: CERN & France
          To: Italy & UK/I


          Report from CERN :

          • List of unresponsive sites:
            • None.
          • Problems Encountered during shift:
            • [Information for next COD] Because the SRM tests are no longer running there may still be some old tickets or alarms for these that are no longer relavent since no new results are present. This is all expected. Just close tickets or set alarms to off. They will not take long to clear out I expect.

          Report from France :

          • List of unresponsive sites:
            • None this week.
          • Problems Encountered during shift:
            1. Could SAM team reverse the history order on SAM portal putting current day on top of page as in the old SAM display?
            2. Gridview graphs on SAM portal are not as up to date as the detail SAM test results: frequently, an alarm is raised againt a site and looking at SAM portal, the Gridview graph is OK. But if one goes to the detail results, one can see last test in error.
              Could the Gridview graphs be fully synchronized with the detail results and alarms?
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          • Pilot service of Cream CE: in progress
            • Results of direct ans WMS bbased submission test against the CREAM CEs in the pilot are now available on the SAM PPS portal (https://pps-sam.cern.ch:8443/sam/sam.py), specifically at http://tinyurl.com/ctwfaz
            • Details about the pilot (planning, layout, technical info) can be found in the page https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotCream
            • Details about the single tasks can be found in the tracker http://www.cern.ch/pps/index.php?dir=./ActivityManagement/SA1DeploymentTaskTracking specifically listing the subtasks of TASK:7981


          • Pilot service of glexec/SCAS: in progress
            • Check-point meeting held on the 19th
            • release of new version of glexec implementing the error codes scheduled on the 20th Feb
            • release of new version of glexec implementing fault tolerance mechanism (support of multiple SCAS servers) on the 25th Feb
            • Atlas will use the installaion at FZK to try the integration of the new error codes
            • ramp-up of FZK production and IN2P3 re-scheduled to start the 6th of March
            • first results of the certification stress tests are now available at https://twiki.cern.ch/twiki/bin/view/EGEE/SCAStestsresults
            • NiKHEF? has upgraded the production system with the current version
            • Minutes in http://indico.cern.ch/conferenceDisplay.py?confId=52981
            • Details about the pilot (planning, layout, technical info) can be found in the page https://twiki.cern.ch/twiki/bin/view/LCG/PpsPilotSCAS
            • Details about the single tasks can be found in the tracker http://www.cern.ch/pps/index.php?dir=./ActivityManagement/SA1DeploymentTaskTracking specifically listing the subtasks of TASK:8986

        • <big> gLite Release News</big>
          Please find gLite release news in: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

          • Now in Production
            • Release of gLite 3.1 Update 41 went into production last week. The update contains:
              • update to WMS 3.1 with numerous bug fixes
              • New version of Cream CE. (PATCH:2667 ,PATCH:2669). Among others this version provides:
              • Short term proxy renewal solution in CREAM based CE
              • fixes in particular BUG:44712 (Problem with lcmaps conf file used for glexec) currently affecting Alice

          • Now in PPS
            • gLite 3.1 PPS Update 44 went through deployment test and it is now being installed by the remaining PPS sites. The update contains:
              • New version of Cream CE. (PATCH:2667 ,PATCH:2669). Among others this version provides:
                • Short term proxy renewal solution in CREAM based CE
                • fixes in particular BUG:44712 (Problem with lcmaps conf file used for glexec) currently affecting Alice
              • [ YAIM ] glite-yaim-core 4.0.6 with many bug fixes (PATCH:2636)(PATCH:2697)
              • [BDII] Default DB cache size reduced to 50Mb(PATCH:2679) for x86_64
              • [WN] New glite-wn-info command designed to be executed on the WN by a job submitter. It returns information about that worker node to be used in a grid context (PATCH:2757 ; PATCH:2758)

          • Soon in Production
        • <big> EGEE issues coming from ROC reports </big>
          • France ROC:
            • GRID2-FR CA: French catch-all GRID-FR CA has recently been replaced by GRID2-FR CA. GRID-FR certificate won t be provided anymore. GRID2-FR CA was distributed with the last CA update. Some problems were however detected with some services (including CERN VOMRS and GOD DB) but not really understood as the CA was well deployed on sites. In some cases, restarting services solved the solution. In other cases, the problem disappeared after a while.
              In order to ease users life, Steve Traylen has also proposed to pre-add the GRID2-FR certificate into the CERN VOMRS/VOMS for every member who is registered with a GRID-FR certificate.
        • <big>Grid Service Interventions </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Please consult the URLs above for details.

          In particular, the following sites requested that these downtimes be reported here:

          1. IN2P3-CC T1/T2 will be in SD from march 9th to wednesday 11th at noon. The electrical operation is scheduled on tuesday march 10th, so the LRMS will be drained 24h before. All the core services should be keep online. Dcache SEs will work for data import only.

          2. Planned downtime for FZK-LCG2: 9th March 08:00 - 13:00 UTC
            Service: SRM - gridka-dcache.fzk.de
            Due to the migration of atlas to a new SRM instance this service will be unavailable on 9-Mar morning.
      • 16:30 17:00
        WLCG Items 30m
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
      • 17:30 17:35
        Review of action items 5m
      • 17:35 17:35
        AOB