WLCG-OSG-EGEE Operations meeting

Europe/Zurich
28-R-15 (CERN conferencing service (joining details below))

28-R-15

CERN conferencing service (joining details below)

Nick Thackray
Description
grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0148141

    OR click HERE
    (Please specify your name & affiliation in the web-interface)

    Click here for minutes of all meetings

    Click here for the List of Actions

      • 4:00 PM 4:00 PM
        EGEE Items
        • <big> Grid-Operator-on-Duty handover </big>
          "Old" COD: UK/I => South East Europe (SEE)

          Report from UK/I (regular COD) :
          Quiet Week. No Serious Alarms.
          Outstanding issues are three tickets at 3rd mail to site admin level. No response received from grid site:
          MPI-K from the ROC_DECH It looks like the entire site is failing and should probably be put into downtime.
          • GGUS:47952 grid-mon.mpi-hd.mpg.de MPI-K
          • GGUS:47920 grid-se.mpi-hd.mpg.de MPI-K
          • GGUS:47872 grid-ce.mpi-hd.mpg.de MPI-K


          Report from cCOD (SouthWest Europe):
          Problems with ROC CERN. We don't know if they haven't received mails or notifications, as they haven't reacted to our notifications, they have a lot of "OK" time out and alarms without tickets. Alas, this site just start on C-COD and may be adapting still.

          NorthenEruope has ticket(10045) time out since 10-04-2009

          Italy has in adapting time

        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:
          https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps

          Highlights:

        • <big> gLite Release News</big>
        • <big> EGEE issues coming from ROC reports </big>
          • Central Europe: In CE ROC we have some sites with 32bit machines so we are interested in plans when 32bit glite 3.1 is planed to be discontinued. We need that to check if those sites are able to upgrade till that time.

          • UK/I: SAM results in the ROC report this week are missing.

          • UK/I: Last August there was a security incident and a UKI site was involved. As a precaution a UKI user was banned at sites pending investigation. No connection with or compromise of this user s account was found. The user still finds that they are banned at some sites - their grid certificate was reissued. What is the procedure here - should the user send GGUS tickets to all sites. Is this a ROC responsibility?
            Response from EGEE Security Officer: The VO is the entity who can suspend a user. Of course some sites may manually ban the user (and sometimes they are asked to), but usually the problem is solved after the affected site sends its security incident resolution report.
            However, ultimately, any site can decide to never re-enable the affected user again. In this case the user should contact his VO, who should talk to the sites. When this happens, usually users open a GGUS ticket.
            To answer specifically to your question, the agreement is between the sites and the VOs (hence the VO should be contacting the site(s), but whenever appropriate (eg: when there is clearly no security risk anymore) the ROC could help and simply recommend to all its sites to un-ban the user. Usually this poses no problem.
        • <big>Grid Service Interventions </big>


          ALL TIMES IN UTC+2

          Downtimes effecting the WLCG tier-1 sites:

          RAL: At Risk: From 09:00 29 April to 14:00 29 April. Services: srm-hone.gridpp.rl.ac.uk ; srm-mice.gridpp.rl.ac.uk ; srm-minos.gridpp.rl.ac.uk ; lcgce03.gridpp.rl.ac.uk ; lcgce04.gridpp.rl.ac.uk ; srm-cms.gridpp.rl.ac.uk ; lcgce02.gridpp.rl.ac.uk ; srm-dteam.gridpp.rl.ac.uk ; lcgce01.gridpp.rl.ac.uk ; srm-atlas.gridpp.rl.ac.uk ; srm-lhcb.gridpp.rl.ac.uk ; lcgce05.gridpp.rl.ac.uk ; srm-alice.gridpp.rl.ac.uk ; srm-ilc.gridpp.rl.ac.uk

          RAL: At Risk: From 10:00 28 April to 18:00 28 April. Services: srm-hone.gridpp.rl.ac.uk ; srm-mice.gridpp.rl.ac.uk ; srm-minos.gridpp.rl.ac.uk ; lcgce03.gridpp.rl.ac.uk ; lcgce04.gridpp.rl.ac.uk ; srm-cms.gridpp.rl.ac.uk ; lcgce02.gridpp.rl.ac.uk ; srm-dteam.gridpp.rl.ac.uk ; lcgce01.gridpp.rl.ac.uk ; srm-atlas.gridpp.rl.ac.uk ; srm-lhcb.gridpp.rl.ac.uk ; lcgce05.gridpp.rl.ac.uk ; srm-alice.gridpp.rl.ac.uk ; srm-ilc.gridpp.rl.ac.uk

          CERN-PROD: At Risk: 09:30 28 April - 12:30 28 April. Services: srm-public.cern.ch ; srm-atlas.cern.ch

          RAL: OUTAGE: From 11:00 27 April to 19:00 30 April. Services: lcgce03.gridpp.rl.ac.uk

          NDGF-T1: OUTAGE: 16:10 17 April - 16:00 01 May. Services: ce01.titan.uio.no

          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          Please consult the URLs above for details.

        • <big> REMINDER: Retirement of gLite 3.0 </big>
          As previously announced, it is planned that all remaining gLite 3.0 services will be retired by the end of April. At this point, all support for these services will cease.
          All sites should ensure that they are running up-to-date versions of their services.
          If any site sees a need to keep a gLite 3.0 service in the middleware stack, please submit a GGUS ticket as soon as possible.

          List of sites still publishing gLite 3.0 : https://twiki.cern.ch/twiki/bin/view/EGEE/SitesPublishinggLite30
      • 4:01 PM 4:01 PM
        OSG Items
        Speakers: Maria Dimou, Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          OSG
          Information on GOCdb-to-OIM migration for USA wLCG T1 sites
          • ggus #44104. This ticket is waiting on the OSG GOC to roll out changes to their production BDII that will publish entries by their OSG resource group, not the OSG resource name. This will remove this issue before it gets to the BDII. Next action deadline in OIM is in Feb 2010. Should we close as unsolved to free the escalation reports?
          • ggus #46988. Nothing happened since my comment of 2009-03-30.
          • ggus #47716. Misrouted ticket. In status 'Closed' in OIM. Why not in GGUS? Rob/Guenter, please comment in the ticket.
          • ggus #47786. Urgent. Submitted 2009-04-08!
      • 4:02 PM 4:02 PM
        Review of action items
      • 4:03 PM 4:03 PM
        AOB

        Please join the USAG meeting this Thursday 2009-04-30 at 9:30am CEST with theme "EGEE-->EGI New software process; impact on User Support. Agenda here!