lecture WLCG-OSG-EGEE Operations meeting
Date/Time: Monday, 18 December 2006 - 16:00 (Europe/Zurich)
Location: CERN conferencing service (joining details below) ( 28-R-15 )
Chairperson:
Description: grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0157610

    OR click HERE

    Material: list of actions link minutes link

     
     Monday, 18 December 2006
     16:00
    Feedback on last meeting's minutes (5')   minutes link    
     16:05
    EGEE Items (25')    
    • Grid-Operator-on-Duty handover (5')
      From ROC Russia (backup: ROC CERN) to ROC CE (backup: ROC SEE)

      Tickets:
      New tickets :
      2nd email:
      Qurantine:

      Notes:

    • There is no sense for opening on Friday or on Thursday tickets which would be expired at the nearest Monday. On Monday 11 it was a lot of such tickets.
    • Please, include the SUSE Linux in the list of valid OS.
    • Merry Christmas and Happy New Year!
     
    • SLC4 testing in PPS (5')
    Nicholas Thackray (CERN)  
    • VOMS configuration update effective Jan 9th 2007 due to new hostcert of voms.cern.ch (5')
    Maria Dimou (CERN)  
    • Operational issues raised by VOs (5')
      Communication channels and escalation steps for VOs to raise operational issues:
    • VO to contact site and ROC (GGUS is preferred way)
    • if no answer: escalate it to operations meeting

    • IN GENERAL: Communication from sites (e.g. in broadcasts and site reports) should be as detailed as possible
     
    • EGEE issues coming from ROC reports (10')
      Reports were not received from these ROCs:
      Reports were not received from these non-HEP VOs:

      1. There was no SAM test results for lcg00125.grid.sinica.edu.tw from Nov 24- 28. We just wanted the SAM team to know about this.
        * https://lcg-sam.cern.ch:8443/sam/sam.py? funct=ShowHistory&sensors=CE&vo=ops&nodename=lcg00125.grid.sinica.edu.t w
        321 29-Nov-2006 13:49:59 OK ok 3.0.0 ok ok ok ok
        322 23-Nov-2006 11:35:44 ERROR error na na na na na
        (AP ROC)


      2. There are periodic failures for SAM ops VO while at the same time no failures for SAM dteam VO jobs on the gCE with:
        Got a job held event, reason: \\\\\\\\\\\\\\\"The job attribute PeriodicHold expression \\\\\\\\\\\\\\\''\''\\\''\''\\\\\\\''\''\\\''\''Matched =!= TRUE && CurrentTime > QDate + 900\\\\\\\\\\\\\\\''\''\\\''\''\\\\\\\''\''\\\''\''
        No problems found at CE at all. So it looks like SAM problem, and looks like other sites are experiencing the same problems.
        This seems to be a well known issue, and was introduced since lcg-CE was upgraded to glite, other sites are also having the same error once in a while. While of htis error no problem were found on the CE
        See also https://gus.fzk.de/ws/ticket_info.php?ticket=16530&from=search (SEE ROC)


      3. The LIP-Lisbon site switched off the monitoring of their old CE (ce01.lip.pt) in the GOCDB about one week ago, but they see they are still receiving SAM tests from OPS.(SWE ROC)


      4. lcg-voms.cern.ch has been having serious reliability problems over the last week or two. Andrea Sciaba suggested to add voms.cern.ch as a secondary voms for CMS (after we took it out everywhere?)
        Is this the official route to take? Is voms.cern.ch synched with lcg-voms.cern.ch at a reasonable rate? (FNAL site, CERN ROC)


      5. FOR INFORMATION:
        1. CE has finished regional certification of bulk submission method on WMS - results are here:
        http://wiki.grid.cyfronet.pl/RegionalCertification/WMSLB-3.0.2u10- bulk
        2. yaim 3.0.1 certification ended - results are here:
        http://wiki.grid.cyfronet.pl/RegionalCertification/Yaim-3.0.1-preview
        3. We\''ve improved regional toplevel BDII stability by balancing the load with second machine setting second IP address in the toplevel BDII entry.
        see: host bdii.cyf-kr.edu.pl bdii.cyf-kr.edu.pl has address 149.156.9.24
        bdii.cyf-kr.edu.pl has address 161.53.0.229
        Each second bdii query is handled by the the same machine. In case of machine failure lcg-* utils use the other one. We encountered no problem with this setup for 2 weeks. Almost no RM errors in CE related to toplevel BDII. (CE ROC)


      6. Today, update 10 to gLite-3.0.2 was announced, but there are problems with the documentation on this update: it does not specify at all if service reconfiguration or restart is needed for nodes other than gLite-CE. This is very important issue, and we postponed the update until this is resolved. GGUS ticket on this issue is opened:
        https://gus.fzk.de/pages/ticket_details.php?ticket=16551
        Description of update 10 to gLite-3.0.2 is changed and information on fixed bugs is entered for each node type, but what is very problematic is that the information on the need for service reconfiguration or restart is entered:
        Information not available.
        Until this is resolved (GGUS ticket 16551), this update will not be deployed on our site. (SEE ROC)


      7. The site IFAE found a problem publishing accounting results with APEL and on 21st Nov opened a GGUS (15884) to request for support from the APEL experts.
        Carsten Preuss established a very fast first contact with IFAE, but the problem was not solved and there has been no more contacts since then.
        On the 1st December, a ticket was opened on IFAE because it was not publishing accounting.
        APEL support should be notified somehow that they should contact back IFAE again urgently. (SWE ROC)


     
     16:30
    OSG Items (20')    
    No items for discussion.
     16:50
    WLCG Items (45')   Paper word file pdf file    
     
    Harry Renshall  
    • WLCG related Issues coming from experiment VOs and Tier-1/Tier-2 reports (15') Tier-1 reports pdf file  
      Reports were not received from:
      > Tier-1 sites:
      BNL, NIKHEF, TRIUMF
      > VOs: NO VO REPORT RECEIVED

    • Item 1
     
     17:35
    Review of action items (20')   list of actions link    
     17:55
    AOB (5')    
    Next operations meeting: Monday 8th January 2007