WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 4:00 PM 4:00 PM
        Feedback on last meeting's minutes
      • 4:01 PM 4:30 PM
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: France and Italy
          To: UKI and Russia

          Report from France:
            2 cases transfered to political instances
          • IN-DAE-VECC-01: GGUS ticket #40782 APEL failure on gridce01.tier2-kol.res.in Ticket submitted on 11/09/08 No answer GGUS ticket #41152 SRM failure on gridse001.tier2-kol.res.in Ticket submitted on 22/09/08 No answer => Already discussed about suspension for IN-DAE-VECC-01 at Ops meeting, but still not suspended by ROC -> CODs have rights in GOCDB to suspend, but are they allowed to do it?
          • RU-Phys-SPbSU: APEL failure on phys5.gridzone.ru GGUS Ticket #40521 Ticket submitted on 05/09/08 No answer => ask for suspension
          • UKI-LT2-QMUL: RGMA failure on mon01.esc.qmul.ac.uk GGUS Ticket #40945 Ticket submitted on 16/09/08 Answered on 04/10/08: site did not receive the ticket => ROC_UKI seems not answering. It seems ROC_UKI does not receive GGUS notifications. This should be fixed.
          • KR-KISTI-HEP: APEL failure on hep001.kisti.re.kr GGUS ticket #40773 Answer on 03/10/08
          • srm.pps.cern.ch (CERN-PROD): in SD until 03/10/09 Is it a test node or a CERN-PPS node? If yes, it would be better to change the SD description in "Test node" => Still nothing about the possibility to declare a test node in GOCDB (see https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus) What is the status on that 'test node' problem?


          Report from Italy:
            2 Cases transfered to political instances
          • GGUS Ticket #40521 Affected site: RU-Phys-SPbSU Responsible Unit: ROC_Russia No replies
          • GGUS Ticket-ID: 40945 Affected Site: UKI-LT2-QMUL Responsible Unit: ROC_UK/Ireland Apologies received on 2008-10-04: "The delay in responding was related to the fact that the QMUL site admin email list was left off the orginal list of assignees. Anyway, mon01 has a problem which should be fixed early next week."
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
        • <big> gLite Release News</big>
          Please find gLite release news in:

          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

          Now in Production:
          Now in PPS:


          Soon in Production:
        • <big> EGEE issues coming from ROC reports </big>
          • France: Which is the status of the SAM problem raised with GGUS ticket #40565 ? Somehow some nodes might not be taken into account by SAM after a SD.
        • <big> Comparison of BDII and GOCDB entries for LFC in GSTAT</big> 10m
          Some sites have noticed that GSTAT is now comparing LFC entries in the GlueService of the BDII and the nodenames in GOCDB.

          Example: http://gstat.gridops.org/gstat/CERN-PROD shows prod-lfc-atlas-local.cern.ch as being present in the BDII as a GlueServiceType: lcg-local-file-catalog but in the BDII this host is entered as node type LFC. Assuming it is a local LFC it should be a node type Local-LFC

          History: GGUS:38053

          While this test does produce in Error in gstat it is not critical in the sense of availability or for the CODs. i.e fix it in your own time.

          To pass this test the following comparison is made.

          Service BDII Service TypeGOCDB Node Type
          Central LFClcg-file-catalogLFC
          Local LFC lcg-local-file-catalogLocal-LFC
          If you have an LFC which is both a central and local LFC for different VOs then you should enter the node in the GOCDB as both an LFC and a Local-LFC.
        • <big>Comparison of BDII and GOCDB Entries for bdii_site and bdii_top Services. 5m
          Similar to the LFC test above another test is also done by gstat to compare the BDII entries for SiteBDII and TopBDII endpoints. The conditions that will pass are.
          Service GlueServiceTypeGOCDB Node Type
          Top BDIIbdii_topTop-BDII
          Site BDIIbdii_siteSite-BDII

          History: GGUS:40475

          In the case of the top_bdii there is an existing bug that can make this harder to resolve than it should be when you wish to publish a host alias as the service endpoint. BUG:41361. A fix for this trivial bug will pushed forward.

        • <big>New LFC SAM tests</big> 5m
          Later this week, two new services will be added to SAM production: LFC_L and LFC_C. The associated tests will be made critical so that history can be viewed in the SAM portal, but they will be ignored for availability calculations, and COD alarms will be supressed. At some stage in the future, and after suitable notifications, they will replace the existing LFC service. The new tests avoid trying to write to read-only LFCs, and include an lfc-ping test on which the others are dependent.
          Speaker: John Shade (CERN)
        • <big>gLite 3.0 services to be obsoleted</big> 5m
          • glite-SE_classic
          • glite-VOBOX
          • glite-WMS
          • glite-PX
          • glite-MON

          An announcement for this retirement is already on the gLite 3.0 page :
          http://glite.web.cern.ch/glite/packages/R3.0/
          This corresponds to the procedure (until we have new one) that was discussed in the ops meeting in Feb 08:
          https://twiki.cern.ch/twiki/bin/view/EGEE/WlcgOsgEgeeOpsMinutes2008x02x25#Support_for_gLite_3_0_services
          PLEASE, LET US KNOW ANY OBJECTION BY NEXT WEEK!
      • 4:30 PM 5:00 PM
        WLCG Items 30m
        • <big>Changes in VO Cards, e.g change in required OS Software</big> 10m

          Following recent requests from a VO member directly to sites to install a particular extra piece of OS software then a recap of the policy is made.

          VOs wishing to change their needs to be supported by a site should of course use the VO cards as the definitive reference.

          Any change to the VO card by any VO which would trigger site action should be discussed first at the weekly EGEE/WLCG operations meeting.

          The purpose is to allow other VOs to sites to raise concerns. Also a sensible time line can be decided for the sites to implement the changes.

        • <big>Job Storm for Last Friday's GridFest.</big> 5m
          For last Friday's LHC GridFest several 100 thousand jobs were submitted.

          It is clear that sites and resource centres should have been notified about this. Thanks to all sites who propped up services during this time. To my knowledge only one 3.0 lcg-CE actually died.

          Apologies for not informing the sites, all jobs should now have exited and be clear of the system.

        • <big> WLCG issues coming from ROC reports </big>
          1. France: Is there a procedure to notify sites and GGUS about changes in LHC alarm DN list automatically? (cf. https://twiki.cern.ch/twiki/bin/view/LCG/OperationsAlarmsPage) Checking manually this list is not very user-friendly and could lead to alarm from a new authorized person being rejected if sites or GGUS are not up to date. This kind of changes could be notify to sites and GGUS by a GGUS ticket. This will ensure that everyone is aware of the changes, and that it has been taken into account. This should also concerned the possible change of the alarm email addresses for site/VO.
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Many interventions scheduled this week. Please consult the URLs above for details.

          Time at WLCG T0 and T1 sites.

        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>
          None.
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          text
        • <big> Storage services: Recommended base versions </big>
          The recommended baseline versions for the storage solutions can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions

        • <big> Storage services: this week's updates </big>
      • 5:00 PM 5:30 PM
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
      • 5:30 PM 5:35 PM
        Review of action items 5m
      • 5:35 PM 5:35 PM
        AOB