lecture WLCG-OSG-EGEE Operations meeting
Date/Time: Monday, 30 June 2008 - 16:00 (Europe/Zurich)
Location: CERN conferencing service (joining details below) ( 28-R-15 )
Description: grid-operations-meeting@cern.ch
Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
Attendees:
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Russia
  • VOs: Only LHCb submitted a report

  •  
     Monday, 30 June 2008
     16:00
    Feedback on last meeting's minutes   Minutes link    
     16:01
    EGEE Items (29')    
    • Grid-Operator-on-Duty handover
      From: CE / UK/I
      To: South Western Europe / Taiwan


      No reports from this week's COD teams. (Nb. cic.in2p3.fr uses an invalid security certificate (it belongs to a different site))
     
     
     
    • EGEE issues coming from ROC reports
      1. [ROC CE]: Multi-valued LCG_GFAL_INFOSYS

        Do the other federations have experience with multi-valued LCG_GFAL_INFOSYS?

        Suggest that SAM should extend RM test timeout with introduction of multi-value LCG_GFAL_INFOSYS. This settings allows the test to fail-over but will execute longer probably.

        FYI: there is a ticket created (GGUS Ticket ID# 37754) that SAM does not recognize SE downtime. The answer was that this is just an error of the visualization layer, and GridView scores are properly updated, but this report also doesn t recognize the downtime.

      2. [ROC France]: Air conditioning trouble at IN2P3-CC due to excessive heat.

      3. [ROC DECH]: DESY: What is the procedure in case users use site resources in a denial-of-service manner? Contacting the user and/or ban the user is an immediate solution, but is not a scalable one. The problem in case is a memory fork bomb on a gLite WN (torque client). Do generic linux or torque/maui configurations or tools exist to prevent these, or at least monitor them? We would appreciate feedback from other ROCs/Sites.

      4. [ROC Northern Europe]: There has been a bug reports submitted on june 11th about a crashing glite-proxy-renewd, (GGUS ticket 37334). It is still in an assigned status. Could someone have a look at it.

      5. [ROC South Eastern Europe]: AEGIS-01 and AEGIS-07 are asking if one monbox can handle the accounting for two sites.
     
     16:30
    WLCG Items (30')    
    • WLCG issues coming from ROC reports
      1. [ROC France]: Many jobs (from Alice and Atlas) had to be cancelled to solve a problem which resulted from a massive job submission by Atlas (>30'000 jobs).
     
     
    • Status of deployment of FTM at tier-1 sites
      Which LCG tier-1 sites have successfully deployed FTM?
      For those tier-1 sites which have not deployed FTM, when is this planned to take place?
      The reason the experiments want this is because the FTM publishes transfer logs to GridView (thanks Steve ;o)

      Responses:
      • ASGC: Already deployed and operational.
      • BNL: Already deployed and operational.
      • CNAF: Installed last week but still being tested.
      • DE-KIT (FZK/GridKa): Already deployed and operational.
      • IN2P3-CC: Not yet installed. Hope to have it in place during July.
      • NDGF: Not installed. Will take at least 3 weeks if needed.
      • PIC: A test instance is being deployed now and is planned to be in production by mid July
      • RAL: Already deployed and operational.
      • SARA: Intend to install FTM early in July.
      • TRIUMF: Already deployed and operational.
     
    • WLCG Operational Review
    Harry Renshall / Jamie Shiers  
    • Alice report
     
    • Atlas report
     
    • CMS report
    Daniele Bonacorsi  
    • LHCb report
      1. In2P3 gsidcap file access issue: https://gus.fzk.de/pages/ticket_details.php?ticket=36625&from=allt Problem has finally been understood (global GSI environment screwed up with multiple connections into the same gsidcap door). And a new patch (1.8.0-15p8 out next week) will cure this problem that has to be rolled out very, very quickly.

      2. SARA SRMv1: no pools configured. https://gus.fzk.de/pages/ticket_details.php?ticket=37712

     
    • Recommended base versions for storage services: URL link  
     
     17:00
    OSG Items (30')   Rob Quick (OSG - Indiana University)  
     
     17:30
    Review of action items (5')   list of actions link    
     17:35
    AOB    
    Suggestion to use EVO rather than the CERN conferencing system in the future.
    We could use the EGEE community which exists in EVO:
    http://evo.caltech.edu/evoGate/