lecture WLCG-OSG-EGEE Operations meeting
Date/Time: Monday, 4 June 2007 - 16:00 (Europe/Zurich)
Location: CERN conferencing service (joining details below) ( 28-R-15 )
Chairperson: Steve Traylen (CERN)
Description: grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: AP, Russia, SWE
  • Tier-1 sites: ASGC, INF, IN2P3, NDGF, PIC
  • Tier-1 availability reports: ASGC, FZK, IN2P3, INFN, PIC, SARA/NIKHEF
  • VOs: CMS, Alice
  • Material: Minutes unknown type file list of actions unknown type file

     
     Monday, 4 June 2007
     16:00
    Feedback on last meeting's minutes (5')   Minutes link    
     16:01
    EGEE Items (29')    
    • Grid-Operator-on-Duty handover
      From ROC Russia (backup: ROC Italy) to ROC France (backup: ROC SWE)

      NB: Please can the grid ops-on-duty teams submit their reports no later than 12:00 UTC (14:00 Swiss local time).

      Tickets:
      lead team:
      Opened new 53
      Closed 23
      2nd mails 21
      Updated 57
      All together 154

      Issues:
        There are no e-mail addresses in "contacts" at a new version of the dashboard.
        Problems with the SAM test took place on Friday 2007-06-01.

        Backup team
        - huge amount of tkts on our side, we asked for help to the backup team (Russia). Help was given, thanks!
        - SAM problems on friday
     
    • PPS Report & Issues
      PPS reports were not received from these ROCs:

      Issues from EGEE ROCs:
      1. (ROC xxx):


    Nicholas Thackray (CERN)  
    • EGEE issues coming from ROC reports
      • (ROC Central Europe):
      • Several sites started to fail SAM OPS tests with "7 authentication with the remote server failed" due to new mapping of SAM tests to SGM account and SGMs are mapped to polled account instead of one account as introduced by new yaim. Site were adviced to change mapping in /opt/edg/etc/lcmaps/gridmapfile temporarily and as a long term solution to create a pool of accounts.
      • Core Services: - one machine in toplevel BDII pool was replaced due to disk problems
      • Submitted GGUS tickets: https://gus.fzk.de/pages/ticket_details.php?ticket=22664 regarding DPM issue and new YAIM. https://gus.fzk.de/pages/ticket_details.php?ticket=21937 about R-GMA instabilities that prevents to publish accounting data. Developers cannot trace the source of problems. https://gus.fzk.de/pages/ticket_details.php?ticket=22244 about issues related to introducting pooled accounts in YAIM for SGM groups.
      • (ROC CERN):
      • TRIUMF-LCG2: How about CRL expiry check in SAM tests to give warning?
      • USCMS-FNAL-WC1: On June 11, I am decommisioning the 2 LCG gateway CEs to USCMS resources. All access after that point will be through the OSG gateways.
      • (ROC Germany-Switzerland):
      • (issue) Ticket https://gus.fzk.de/ws/ticket_info.php?ticket=21600 (middleware problem, ownership on AFS) is currently said to be to be forwarded to the EMT.
      • DESY HH provides a UI for the users in our AFS space. Some commands like voms-proxy-init require some files to belong to root or the user that executes the command. This is not feasible with AFS, users have to copy the files locally. We reported the problem in GGUS ticket 21600, but no real action is taken by the developers. This bug in the middleware was especially annoying this week as several users complained because their copy was no longer consistent with the CERN VOMS server certificates.
      • (for information) FZK: Part of the availability pages do not seem to cover actual availability after reported SAM problems
      • South Eastern Europe
      • No Major issues to report other than that a ticket for TAU-LCG2 has reached the final escalation step, Yan Ben Hammou will be present at the meeting to sort this one out.
      • United Kingdom and Ireland
      • UKI-SCOTGRID-GLASGOW Had to clear jobs which were stalled due to lcg-cr commands hanging (http://scotgrid.blogspot.com/2007/05/users-and-stalled-jobs.html). No response from biomed user who was responsible for most of these (https://gus.fzk.de/pages/ticket_details.php?ticket=22717). User will be banned from our site if no response is forthcoming. We believe this is a reasonable policy for our site, but are there official guidelines on this?
     
     16:30
    WLCG Items (30')    
     
    • WLCG issues coming from ROC reports
      1. None this week.


     
    • WLCG Service Interventions (with dates / times where known)
      Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

      See also this weekly summary of past / upcoming interventions at WLCG Tier0 and Tier1 sites (extracted manually from EGEE broadcasts and other sources).

      1. FTS service upgrade at CERN foreseen for Monday 18th June (TBC)
      2. CASTOR upgrades for C2PUBLIC, C2ATLAS, C2CMS, (C2LHCB) foreseen for next week

      Time at WLCG T0 and T1 sites.

     
    Gavin McCance (CERN)  
    Kors Bos (CERN / NIKHEF)  
    • CMS service
    Daniele Bonacorsi (CNAF-INFN BOLOGNA, ITALY)  
    • LHCb service
      1. LHCb ask for the possibility to integrate in the official "ops" SAM suite specific SRM endpoint tests. LHCb have developed a set of such test which we would like to see integrated. One area of concern is some of the tests point to LHCb specific endpoints that have to be reachable by the “ops” VO (In addition there are a few minor integrations of the test). The motivations for the LHCbrequest of these tests are:
        1. It is highly desirable to have a close monitorof a critical service like SRM and the corresponding prompt reaction of WLCG operations
        2. The site availability should be computed by also taking into account this SRM test. This will provide a strong incentive for sites to providea reliable SE service.
      2. Operational point: issue with the new CERN CAcertificates at dcache sites. Users using voms proxies generated from this CAc annot access data. Flavia and Maarten are looking at this problem together with dcache experts and affected site admins (e.g. Ron and Patrick). Joel's old certifciate will expire in 15 days from now. Afterwards he will not be able torun on the grid. Status of the art of this problem? Wouldn't it be worth to run SAM SRM tests with this new CA certificates?
    roberto santinelli (CERN/IT/GD)  
    • ALICE service
    Patricia Mendez Lorenzo (CERN IT/GD)  
    Jamie Shiers / Harry Renshall  
     16:55
    OSG Items (5')    
    1. Item 1
     17:00
    Review of action items (5')   list of actions linkdown arrow    
     17:10
    AOB (5')    
  • Operations workshop in Stockholm, 13-15th June, agenda available:
    http://indico.cern.ch/conferenceTimeTable.py?confId=12807