Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 126540 with code: 4880. Apologies: Rob Harper. Daniela. Alessandra
Minutes
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS * Two new ticket last night for Durham and Lancaster: both SRM problems. * UCL: is back online in production. The GOCDB bug was resolved last week and AGIS syncronised. The site tests are back in automatic mode. Analysis will be kept in static test mode until production is stable for few weeks. * cvmfs timeout problem affecting few sites: This is a summary from Jakob of what we found so far. o) It is not a problem of the network or the Squid proxies. The logs do not show any I/O errors, fail-over actions, or exceptionally long response times. o) The problem is not caused by mount races of any kind. Cvmfs does not indicate that it has to wait to acquire its lock file. o) The problem is not related to automatic cleanups of the cache. o) It does not depend on a particular SL5 version. We have been able to trigger the problem by jumping back in time, although the problem does appear also with correct system time. It makes me believe that the cause of the problem might be the cvmfs "drainout mode". When a new catalog is applied, cvmfs switches for 60 seconds to "drainout mode" in which the Linux kernel caches are not used. This is necessary in order to avoid that stale entries are served from kernel caches. Perhaps there are circumstances that stop cvmfs from switching back from "drainout mode". Due to the large number of stat() calls in asetup, missing kernel caches combined with other load on the system might lead to a running time increased to the order of minutes. This is also in line with the fact the the problem arose at the time when the frequency of new repository revision increased. I will look into corresponding the code spots. It remains baffling that the problem does repeatedly appear on some, but not all sites, and neither have I been able to reproduce it on one of our machines. * ADC meeting last week there was a summary of the Technical interchange meeting Interesting points were about * Plan to expand xrootd federations building on the current US effort * Slowly phase out SRM although some MMS functionalities needed at T1s MMS cannot be replaced as yet. * Add xrootd as mandatory protocol together with gridftp and do more tests with http * Atlas/CMS common analysis project using glide-ins * Better integration with SSB * To reduce failures * Increase number of retry where possible, requires better error diagnostic in Athena (and not only). * Simplify job recovery so that it's easier for sites to use (it wasn't working infact) - Other
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation (reminder of the monthly NGI discussion next Tuesday) - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 12:00
      UK NGI - monthly discussion 20m
      * Helpdesk update - successfully tested setting the site notification emails. Propose sites now set in progress status. * UKI decommissioning and dteam * Virtual sites (to cover services such as VA and VOMS)
      Slides
    • 12:00 12:05
      GDB review 5m
      If time permits. See bulletin summary.
    • 12:05 12:06
      AOB 1m