WLCG-OSG-EGEE Operations meeting

28-R-15 (CERN conferencing service (joining details below))


CERN conferencing service (joining details below)

Weekly OSG, EGEE, WLCG infrastructure coordination meeting.
We discuss the weekly running of the production grid infrastructure based on weekly reports from the attendees. The reported issues are discussed, assigned to the relevant teams, followed up and escalated when needed. The meeting is also the forum for the sites to get a summary of the weekly WLCG activities and plans
  • OSG operations team
  • EGEE operations team
  • EGEE ROC managers
  • WLCG coordination representatives
  • WLCG Tier-1 representatives
  • other site representatives (optional)
  • GGUS representatives
  • VO representatives
  • To dial in to the conference:
    a. Dial +41227676000
    b. Enter access code 0140768

    OR click HERE

    NB: Reports were not received in advance of the meeting from:

  • ROCs: Russia
  • VOs: Only LHCb submitted a report
      • 16:00 16:00
        Feedback on last meeting's minutes
      • 16:01 16:30
        EGEE Items 29m
        • <big> Grid-Operator-on-Duty handover </big>
          From: CE / UK/I
          To: South Western Europe / Taiwan

          No reports from this week's COD teams. (Nb. cic.in2p3.fr uses an invalid security certificate (it belongs to a different site))
        • <big> PPS Report & Issues </big>
          Please find Issues from EGEE ROCs and general info in:

        • <big> gLite Release News</big>

          Release News:
          Please find gLite release news in:

        • <big> EGEE issues coming from ROC reports </big>
          1. [ROC CE]: Multi-valued LCG_GFAL_INFOSYS

            Do the other federations have experience with multi-valued LCG_GFAL_INFOSYS?

            Suggest that SAM should extend RM test timeout with introduction of multi-value LCG_GFAL_INFOSYS. This settings allows the test to fail-over but will execute longer probably.

            FYI: there is a ticket created (GGUS Ticket ID# 37754) that SAM does not recognize SE downtime. The answer was that this is just an error of the visualization layer, and GridView scores are properly updated, but this report also doesn t recognize the downtime.

          2. [ROC France]: Air conditioning trouble at IN2P3-CC due to excessive heat.

          3. [ROC DECH]: DESY: What is the procedure in case users use site resources in a denial-of-service manner? Contacting the user and/or ban the user is an immediate solution, but is not a scalable one. The problem in case is a memory fork bomb on a gLite WN (torque client). Do generic linux or torque/maui configurations or tools exist to prevent these, or at least monitor them? We would appreciate feedback from other ROCs/Sites.

          4. [ROC Northern Europe]: There has been a bug reports submitted on june 11th about a crashing glite-proxy-renewd, (GGUS ticket 37334). It is still in an assigned status. Could someone have a look at it.

          5. [ROC South Eastern Europe]: AEGIS-01 and AEGIS-07 are asking if one monbox can handle the accounting for two sites.
      • 16:30 17:00
        WLCG Items 30m
        • <big> WLCG issues coming from ROC reports </big>
          1. [ROC France]: Many jobs (from Alice and Atlas) had to be cancelled to solve a problem which resulted from a massive job submission by Atlas (>30'000 jobs).
        • <big>WLCG Service Interventions (with dates / times where known) </big>
          Link to CIC Portal (broadcasts/news), scheduled downtimes (GOCDB) and CERN IT Status Board

          1. Due to network maintenance SARA's 3D database, saradb, will be unavailable on 30/06/2008 starting 16:00 UTC until 18:00 UTC.

          Time at WLCG T0 and T1 sites.

        • <big> Status of deployment of FTM at tier-1 sites </big>
          Which LCG tier-1 sites have successfully deployed FTM?
          For those tier-1 sites which have not deployed FTM, when is this planned to take place?
          The reason the experiments want this is because the FTM publishes transfer logs to GridView (thanks Steve ;o)

          • ASGC: Already deployed and operational.
          • BNL: Already deployed and operational.
          • CNAF: Installed last week but still being tested.
          • DE-KIT (FZK/GridKa): Already deployed and operational.
          • IN2P3-CC: Not yet installed. Hope to have it in place during July.
          • NDGF: Not installed. Will take at least 3 weeks if needed.
          • PIC: A test instance is being deployed now and is planned to be in production by mid July
          • RAL: Already deployed and operational.
          • SARA: Intend to install FTM early in July.
          • TRIUMF: Already deployed and operational.
        • <big> WLCG Operational Review </big>
          Speaker: Harry Renshall / Jamie Shiers
        • <big> Alice report </big>
        • <big> Atlas report </big>
        • <big> CMS report </big>
          Speaker: Daniele Bonacorsi
        • <big> LHCb report </big>
          1. In2P3 gsidcap file access issue: https://gus.fzk.de/pages/ticket_details.php?ticket=36625&from=allt Problem has finally been understood (global GSI environment screwed up with multiple connections into the same gsidcap door). And a new patch (1.8.0-15p8 out next week) will cure this problem that has to be rolled out very, very quickly.

          2. SARA SRMv1: no pools configured. https://gus.fzk.de/pages/ticket_details.php?ticket=37712

        • <big>Recommended base versions for storage services:</big>
      • 17:00 17:30
        OSG Items 30m
        Speaker: Rob Quick (OSG - Indiana University)
        • Discussion of open tickets for OSG
          GGUS ticket
      • 17:30 17:35
        Review of action items 5m
        list of actions
      • 17:35 17:35
        Suggestion to use EVO rather than the CERN conferencing system in the future. We could use the EGEE community which exists in EVO: http://evo.caltech.edu/evoGate/