Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 78425 with code: 4880. Apologies: Sam, David, Mark, Stuart W, David Colling, Duncan, Elena
    • 11:00 AM 11:20 AM
      Meetings & updates 20m
      - EGI updates UPDATE 30 for gLite 3.2 is now ready for production use. The priority of the updates is: Normal The highlights of the update are: - New version of glite-BDII_top - New version of glite-CREAM - New version of glite-LB - New version of glite-SGE_utils All details of the update can be found in: http://glite.cern.ch/R3.2/sl5_x86_64/updates/30/ - ROD team update - Nagios status -- Note Steve Lloyd's email about his SAM pages. Who uses them? Problem yesterday "WN-RepCr SAM test failing across UK": "The problem with gridppnagios has been fixed and sites which have failed this test should be OK with in an hour. A little detail about the problem. We are changing network switches at Oxford site and made sure that gridppnagios and storage-monit.physics.ox.ac.uk which is used as primary storage for replication should not be affected but I missed the point that without site bdii jobs at WN would not be able to locate storage-monit machine. I am also using heplnx204.pp.rl.ac.uk as backup replication storage server but unfortunately it started failing for some other reason. I removed both storage server and added two new SE,s and I have checked that it is working". - Tier-1 update - Security update - WLCG update: A new WLCG Technology Evolution Work Group is being formed with Markus Schulz and Jeff Templon as chairs: “The overall goal is to ensure the long term support of the LHC community use cases, taking into account experiments, sites, and operational needs. Reducing where possible complexity and manpower needs for users, sites and developers. Improving functionality and performance where needed…. to define the vision for evolution according to the WLCG collaboration, and secondly to coordinate work being done… The group will cover topics such as: Security Model, Job Management, Virtualization, Data Management, Data Access, Information and Service Discovery etc. To get started we ask the Computing Coordinators to nominate for their experiments a permanent member and deputy. We will try to identify suitable site delegates, security and operations watchdogs.” -- T2 issues Please check the site data here under "Tier-2": http://wlcg-rebus.cern.ch/apps/topology/ - Specific question for Peter/Durham: Is 1920 logical CPUs correct? - Several sites still publishing "EGEE" -- General notes. - GGUS summary for UKI of open tickets: http://tinyurl.com/6a93yme - 10 red tickets (5 on hold)
    • 11:20 AM 11:40 AM
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other - Experiment blacklisted sites - Experiment known events affecting job slot requirements - Site performance/accounting issues - Metrics review
    • 11:40 AM 11:55 AM
      Open discussion 15m
      Some areas that could be covered: - glexec issues (https://gridppnagios.physics.ox.ac.uk/myegi/history/) [click simple/advance filter: select glexec from profile tab]. Today it shows RHUL; Liverpool; Brunel and Oxford. Last week we had 9 sites!? - perf-sonar work (http://tinyurl.com/6a7dshg) - topics we want explored at the WLCG workshop in July (https://indico.desy.de/conferenceTimeTable.py?confId=4019#all)
    • 11:55 AM 12:00 PM
      Actions 5m
      - http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
    • 12:00 PM 12:01 PM
      AOB 1m