Update on last week's reported SAM test issues:
- Timeout failures on svc20 (AAA server) on Friday - Jyothish removed from cluster. Telegraf and Icinga were also down. Jyothish has ticket with Fabric. - UPDATE: was running out of RAM. Jyothish added memory limits which were missing and re-instated to the cluster just before the meeting.
- Network problems - continuing with several problem periods throughout the week.
- After 2. the other AAA servers and manager failed 'federation' test fairly consistently since. Restarts of the usual services by Katy and Jyothish has not fixed it. - UPDATE : restarts on the UK redirector helped with this
- ARC-CE xrootd-access test requires AAA. - Has not been a problem this week.
- New tokens tests for CEs are generally working, but the 'basic' test is in warning due to jobs almost entirely landing on 2018/9 WNs which do not have IPv6 (Tom Birkett might comment). UPDATE: CMS said they are ok with the test being yellow
- 'Connection' test for Antares endpoints in warning due to no IPv6 - how are the tests for the new EOS nodes going? UPDATE: perf tests ongoing but some improvement.
CMS took advantage of other VOs dropping out and claimed a huge number of WNs over the weekend. In general job performance has been good, with just a couple of clear efficiency drops or failure spikes throughout the week.
Transfers:
Periods of excellent transfer rate to buffer and tape. Some file exists errors likely due to network disruption - Katy investigating if clean-up is necessary if auto-mechanism is not effective.
Disk transfer failures have calmed with Echo as destination (could be other end of transfer in any case). With Echo as source errors still look bad - investigating.