INCIDENTS

Last weeks callouts in Opsgenie

Callouts over weekend:

No call-outs over the weekend.

There was a “Tier 1 External connectivity issue between Opsgenie and Icinga” this morning (02/12/24), however this was as the result of DI doing/re-doing their network intervention from Wednesday (27/11/24)

Callouts:

There were a number of call-outs on the morning of the 27/11/24 as the result of the failed DI network intervention (see report below).

No other call-outs over the week.

Antares

Despite the network problem on the 27/11/24 the Antares reboot intervention was completed without issue.

Batch Farm

Echo

There appears to have been some issues with CMS AAA Service, @Katy Ellis was reporting “intermittent but significant numbers“
image-20241202-101554.png

In the absence of @Thomas, Jyothish (STFC,RAL,SC) (he was on leave), @Brian Davies performed a reboot of gw10 and 11 to catch any updates and this did initially appear to help, however issues soon reappeared.
image-20241202-101942.png

The plot above shows the number of connections which indicates dramatically higher usage in the last few weeks. The number of connections seems high for the throughput, which remains around 100MB/s over the 3 gateways.

image-20241202-102253.png

More recently the number of connections has reduced although still significantly above the historical average. The memory usage has increased which is not yet understood.

Network

On Wednesday 27th November between 08:05 and 09:15 there was a network outage as a result of a firewall problem caused when a scheduled upgrade went wrong. This had a mixed impact on the Tier-1 as many routes bypass the firewall (e.g. LHCOPN and LHCONE) however a significant amount of control traffic appears to still go via the firewall and spikes of failures were seen by the VOs.

“During a routine Firewall upgrade this morning (27 November) we experienced significant networking issues. These started during the regular at risk period, but continued until approximately 09:15.

The team worked quickly to resolve the issues as soon as they became apparent and we believe them to now be resolved (reports from across campus are that service is now restored) but we will continue to monitor the situation. The network connection to the Internet is currently running at risk and will continue to be so until we are able to restore both firewalls in a stable manner. Additional work will be needed and this will be done out of hours and with notice.

These issues will have affected all network traffic to and from the RAL site, including VPN access from offsite.”

“Following the disruption to the RAL network on Wednesday 27th November, I wanted to provide you with an update on the causes and next steps.

The interruption happened following an upgrade of the firewalls connecting the RAL campus to the internet. During the upgrade, one of the two firewalls failed to complete successfully and as a result caused instability across the pair. This was resolved when the firewall that had failed was disconnected. At that stage, the network was stable, but was running at risk due to having no fallback option if the remaining firewall failed.

Since Wednesday we have successfully upgraded the disconnected firewall and on Monday 2nd December at 0730 we will be reconnecting the firewalls to bring back resilience. We anticipate that this should not have any impact but will be carrying it out early morning to minimise any potential disruption.“

The intervention on the morning, 02/12/24, appears to have gone without issue.

Martin BlyMartin Bly 9:56 AM
From DI, apropos the firewalls: "The change earlier this morning was successful and the firewalls are now working properly again. They've remained stable since, so the change freeze is now lifted."