Review of weekly issues by experiment/VO
- LHCb
- CMS
- ATLAS
UKI-NORTHGRID-GLASGOW:
problems with aircon during the weekend. The site was put in downtime and the system set panda queues and storage offline.
Job Recovery:
recovery has now been tested both at RAL and Lancaster and it has caused no problems. Sites have to create a suitable space on the WN and monitor it with tmpwatch to delete data that have become too old. To avoid hardcoding the path in panda schedconfig they can set an env var pointing at the recovery path. When this is done they should contact cloud support so the recovery can be enabled in schedconfig. More information in Alaistair email and slides
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1207&L=TB-SUPPORT&F=&S=&P=107128
Memory leaks and how to control them:
I wrote a post on how to set the limits without killing everytihng off. The most important one is the limit on vmem because torque will kill jobs if the jobs exceed the allocated vmem, the limit on mem is enforced by torque only if the jobs arrives with memory requirements but not if it exceeds them to keep the memory in check you need to set pmem.
http://northgrid-tech.blogspot.co.uk/2012/07/atlas-jobs-with-memory-leaks-containment.html
- Other