- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
AAA was clearly sturggling with data accesses from remote sites. SAM tests were intermittently red and complaints were made by other sites. The files being accesed were 'premix library' files that are usually accessed only from CERN or FNAL. Katy found that these files had been stored on Antares, which is normal, but of course the only way they get to Antares is via Echo (multihop). By design of Rucio these files don't necessarily get deleted immediately...and then jobs around Europe start using them. This usage (popularity) may even reduce the chance of the files being removed, displaced by other files arriving on Echo.
Jyothish increased the throttling level on the AAA gateways (previously was a high level of throttling on these machines allowing only 100MB/s in total over 3 gateways) and everything looks much better now.
Job performance is commensurate with other T1s. However there is a drop in performance. Yet again CMS are running jobs that only use 1 core but request 8 - I complained again.
CE tokens seem to be passing SAM tests quite well now, but I'm told there are still problems on the end of the test.
Mini-DC for CMS UK sites next week - tests are for Tier 2s but Tier 1 will be used as a source and sink, making sure not to put unnecessary pressure on Tier 1 at this stage. (ATLAS already started their tests this week and will continue next week).
TO DO: Get RAL-FTS to configure for CMS token access.
INCIDENTS
Last weeks callouts in Opsgenie
Callouts over weekend:
No call-outs over the weekend.
There was a “Tier 1 External connectivity issue between Opsgenie and Icinga” this morning (02/12/24), however this was as the result of DI doing/re-doing their network intervention from Wednesday (27/11/24)
Callouts:
There were a number of call-outs on the morning of the 27/11/24 as the result of the failed DI network intervention (see report below).
No other call-outs over the week.
Antares
Despite the network problem on the 27/11/24 the Antares reboot intervention was completed without issue.
Batch Farm
Echo
There appears to have been some issues with CMS AAA Service, @Katy Ellis was reporting “intermittent but significant numbers“
image-20241202-101554.png
In the absence of @Thomas, Jyothish (STFC,RAL,SC) (he was on leave), @Brian Davies performed a reboot of gw10 and 11 to catch any updates and this did initially appear to help, however issues soon reappeared.
image-20241202-101942.png
The plot above shows the number of connections which indicates dramatically higher usage in the last few weeks. The number of connections seems high for the throughput, which remains around 100MB/s over the 3 gateways.
image-20241202-102253.png
More recently the number of connections has reduced although still significantly above the historical average. The memory usage has increased which is not yet understood.
Network
On Wednesday 27th November between 08:05 and 09:15 there was a network outage as a result of a firewall problem caused when a scheduled upgrade went wrong. This had a mixed impact on the Tier-1 as many routes bypass the firewall (e.g. LHCOPN and LHCONE) however a significant amount of control traffic appears to still go via the firewall and spikes of failures were seen by the VOs.
“During a routine Firewall upgrade this morning (27 November) we experienced significant networking issues. These started during the regular at risk period, but continued until approximately 09:15.
The team worked quickly to resolve the issues as soon as they became apparent and we believe them to now be resolved (reports from across campus are that service is now restored) but we will continue to monitor the situation. The network connection to the Internet is currently running at risk and will continue to be so until we are able to restore both firewalls in a stable manner. Additional work will be needed and this will be done out of hours and with notice.
These issues will have affected all network traffic to and from the RAL site, including VPN access from offsite.”
“Following the disruption to the RAL network on Wednesday 27th November, I wanted to provide you with an update on the causes and next steps.
The interruption happened following an upgrade of the firewalls connecting the RAL campus to the internet. During the upgrade, one of the two firewalls failed to complete successfully and as a result caused instability across the pair. This was resolved when the firewall that had failed was disconnected. At that stage, the network was stable, but was running at risk due to having no fallback option if the remaining firewall failed.
Since Wednesday we have successfully upgraded the disconnected firewall and on Monday 2nd December at 0730 we will be reconnecting the firewalls to bring back resilience. We anticipate that this should not have any impact but will be carrying it out early morning to minimise any potential disruption.“
The intervention on the morning, 02/12/24, appears to have gone without issue.
Martin BlyMartin Bly 9:56 AM
From DI, apropos the firewalls: "The change earlier this morning was successful and the firewalls are now working properly again. They've remained stable since, so the change freeze is now lifted."