Review of weekly issues by experiment/VO
- LHCb
- CMS
- ATLAS
* Two new ticket last night for Durham and Lancaster: both SRM problems.
* UCL: is back online in production. The GOCDB bug was resolved last week and AGIS syncronised. The site tests are back in automatic mode. Analysis will be kept in static test mode until production is stable for few weeks.
* cvmfs timeout problem affecting few sites: This is a summary from Jakob of what we found so far.
o) It is not a problem of the network or the Squid proxies. The logs do not show any I/O errors, fail-over actions, or exceptionally long response times.
o) The problem is not caused by mount races of any kind. Cvmfs does not indicate that it has to wait to acquire its lock file.
o) The problem is not related to automatic cleanups of the cache.
o) It does not depend on a particular SL5 version.
We have been able to trigger the problem by jumping back in time, although the problem does appear also with correct system time. It makes me believe that the cause of the problem might be the cvmfs "drainout mode". When a new catalog is applied, cvmfs switches for 60 seconds to "drainout mode" in which the Linux kernel caches are not used. This is necessary in order to avoid that stale entries are served from kernel caches. Perhaps there are circumstances that stop cvmfs from switching back from "drainout mode". Due to the large number of stat() calls in asetup, missing kernel caches combined with other load on the system might lead to a running time increased to the order of minutes. This is also in line with the fact the the problem arose at the time when the frequency of new repository revision increased. I will look into corresponding the code spots.
It remains baffling that the problem does repeatedly appear on some, but not all sites, and neither have I been able to reproduce it on one of our machines.
* ADC meeting last week there was a summary of the Technical interchange meeting
Interesting points were about
* Plan to expand xrootd federations building on the current US effort
* Slowly phase out SRM although some MMS functionalities needed at T1s MMS cannot be replaced as yet.
* Add xrootd as mandatory protocol together with gridftp and do more tests with http
* Atlas/CMS common analysis project using glide-ins
* Better integration with SSB
* To reduce failures
* Increase number of retry where possible, requires better error diagnostic in Athena (and not only).
* Simplify job recovery so that it's easier for sites to use (it wasn't working infact)
- Other