- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
On Monday AAA machines all went from completely green SAM tests to constantly red. It was the 'federation' test failing, the error showing it was unable to connect to the global redirector. ARC-CE tests also failed constantly in the same period due to failures of the associated 'xrootd-access' test, which uses the AAA machines. This was on top of the intermittently failing ARC-CE SAM tests at submission, which is ongoing for some months but became worse over Christmas / New Year / January.
Tom Birkett has been following up a suspected network/firewall problem with DI. There was suspicion the intermittent ARC-CE test failures could be caused by this, along with many other observed problems, such as variable number of CMS jobs running despite work being available, lack of ATLAS jobs running, general slowness in Tier 1 machines, etc., etc.
On Tuesday morning around 10:30 DI made a change by removing one port from a network component. After this, many or all of the above problems seem miraculously fixed/improved immediately!
AAA tests went green; ARC-CE xrootd-access test went green; intermittent submission failures looking much much better.
UPDATE, Wed morning: AAA tests went red again last night. Jyothish did some clean up and restarts and tests are going green again.
Note, where CMS jobs have run, in general performance has been good, except Monday night into Tuesday there was a spike of Production job failures, attempting to read remotely and getting a FileOpen error.
AAA OOM errors when under high load still to be followed up. Also the problem with svc20 continuously dropping its monitoring in Vande.
CMS / CERN IT jumbo frames testing ongoing all week.
/ * Sorry, i am on leave 15.01, so the below data may be outdated; valid on the evening of 14.01*/
News:
Operational issues:
Low activity over Christmas
Networking between Butler DB and BatchFarm is an issue, and as changes are being made to the network soon, moved LSST to 2020 BF nodes on the new network last night so that they may have access to the Butler
After draining old jobs this morning, and disabling job types that were not effected, this has meant only the "DC2" jobs remain, but are currently long running and none have finished at this time (despite running for nearly 2 hours) due to remaining on older nodes, rather than the new ones specified