WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs:
  • VOs:
  • list of actions
    Minutes
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France / Central Europe
          To: CERN / AsiaPacific


          Issues from Central Europe ROC:
          • No issues
          Issues from France ROC:
          • It becomes more and more heavier to be COD with the synchronization problem between GOC DB and SAM. Tickets #30046 and #30306 are about this problem.
            Why does SAM need 3 days of retention (which becomes 2 weeks!!!) to update SAM DB when node is removed from GOCDB and information system? The reason of preventing GOCDB failure is not a good one.
            Moreover since the last update of GOC DB some old nodes with monitoring off in the previous version are now monitored. This has not been announced to site admins neither to COD.
            Site admins must delete now in GOCDB all old nodes which are not used anymore. And these nodes must be removed from the information system in order not to be monitored by SAM.
            Shall we need to open ticket against site about such nodes?
        • <big> PPS Report & Issues </big>
          PPS reports were not received from these ROCs:

          Issues from EGEE ROCs:
          1. Item 1

          Release News:
          1. Item 1
        • <big> EGEE issues coming from ROC reports </big>
          1. (ROC Central Europe): New GGUS ticket with no response (from Gridview Team) for 2 weeks: https://gus.fzk.de/ws/ticket_info.php?ticket=30025
          2. (ROC France): SAM DB not updated (GGUS #30046): No real answer from SAM... Why does SAM need 3 days of retention (which becomes 2 weeks!!!) to update SAM DB when node is removed from GOCDB and information system? The reason of preventing GOCDB failure is not a good one. See also GGUS ticket #30561.
          3. (ROC France): Downtime not taken into account by Gridview (GGUS #30042): Gridview says it comes from a bug in GOC DB synchronization script. Is it fixed? Mention about SAM removing tests in Gridview (GGUS #30044): no answer from Gridview.
          4. (ROC France): IN2P3-SUBATECH Comments:
            With Glite3.0, it seems that GRIS is now implemented with a BDII (instead of globus-mds) for LCG-CE node. In such a case, is it still possible to combine LCG-CE and site BDII on the same machine ? If yes, how to configure this combined node with YAIM ?
          5. (ROC North Europe): Because of a security incident several certificates issued by the Dutch CA had to be revoked. However it was noticed that some services still accept revoked certificates a day later. Details:
            Services still accepting:
            • CIC portal (e.g. this page)
            • GOC database
            • Web portal of VOMS server at Sara (this is under investigation by SARA)
            • SAM Admin page


            Services OK:
            • GGUS
            • SAM test results page
          6. (ROC UK/I): SAM test timeouts need reviewing. Examples: "SAM test: CE-host-cert-valid... Checking svr016.gla.scotgrid.ac.uk (130.209.239.16:2119) using SSL protocol Timeout after 60 seconds" and "Timeout when executing test CE-sft-job after 600 seconds!" Connections can hang in many ways and often a timeout is an invalid test result, at least it should not always change the site''s status. Timeouts might be better interpreted as status "unknown" rather than "failed".
          7. PKU from Bejing has contacted us and asked for deployment support. Should we forward these request to CERN? or should APROC start supporting Chinese sites?
        • <big>Implications When No SAM Results for 24 Hours</big> 15m
          The French ROC asked:
          What are the implications of no SAM test results at a site for >24 hours? How does it affect availability/reliability calculations?

          Rajesh from the GridView team:
          By default each SAM test result is considered to be valid for 24 hours. It can be changed by the VO running the test to be any other value, but at present there is no interface to do that, it will have to be entered directly in our DB upon request.

          Now if gridview does not find SAM results for a site or service not older than 24 hours, it considers the status of the service as Unknown. If the site is one of CE, SE, SRM or sBDII, the site status too would go into unknown state. You can find documentation about this algorithm to compute availability in the Gridview twiki page at the address below:

          LCG.GridView

          You can look in the section on documents and the presentations. The presentation file has a slide which explains the algorithm nicely.

          Rajesh

      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board
          1. [Announcement]

          Time at WLCG T0 and T1 sites.

        • <big>FTS service review</big> 5m

          Please read the report linked to the agenda.
          In particular ?

          Speakers: Gavin McCance (CERN), Steve Traylen
        • <big>CMS service</big>
          • Item 1
          Speaker: Mr Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)
        • <big> LHCb service </big>
          1. GGUS ticket #30562
          2. CNAF is not usable for reprocessing activity because files cannot be open through rfio protocol (stuck connection after file has been open). CNAF people are awaiting for CASTOR support in order to have instruction on this issue. One solution is to install rootd and access files through it (that cured a similar problem experienced at CERN in the past). I'd set a very urgent action on CASTOR support for giving the recommendations to CNAF guys to get the site back working.
          3. LHCB wants to emphasize how NIKHEF/SARA has not be used/(has felt to be not usable) because of continous downtimes and problems at the Storage in the last weeks. Can we have a report from SARA?
          Speaker: Dr roberto santinelli (CERN/IT/GD)
        • <big> ALICE service </big>
          • Item 1
          Speaker: Dr Patricia Mendez Lorenzo (CERN IT/GD)
        • <big> WLCG Service Coordination </big>
          • Item 1
          Speaker: Harry Renshall / Jamie Shiers
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
      • 17:30 17:35
        Review of action items 5m
        list of actions
      • 17:35 17:35
        AOB
        1. Item 1