IT-ASDF: Post-mortem of Datacenter network switch down, partial SSO unavailability and CERNphone service down

Europe/Zurich
513/1-024 (CERN)

513/1-024

CERN

50
Show room on map
Emil Kleszcz (CERN), Roberto Valverde Cameselle (CERN)
Videoconference
Zoom Meeting ID
63445832154
Description
IT Activities and Services Discussion Forum (ASDF)
Host
Jorge Garcia Cuervo
Alternative hosts
Charles Delort, Karolina Przerwa, Stefan Nicolae Stancu, Enrico Bocchi, Nikos Papakyprianou, Pablo Martin Zamora, Ismael Posada Trobo
Useful links
Join via phone
Zoom URL

ASDF 18/07/24 - Minutes

  • Participants: 51 (25 on-site, 26 online)

Postmortem of Datacenter network switch down (OTG0149898)

Speaker: Daniele Pomponi

  • No questions / comments

Postmortem of SSO (Single Sing On) partially unavailable (OTG0149903)

Speakers: Paul Van Uytvinck, Sebastian Lopienski

Q - Do you have a map of dependencies of SSO on external services?
A - Service is hosted in Kubernetes service but not in the main one but in a separate cluster managed by IT-PW-ARW section where other applications like EDMS, EDH, and other administrative applications are hosted. Other elements, like other databases like Keycloak main database (HA MySQL). (And Active Directory).

Q - Which kind of critical services depend on SSO?
A - Probably many services depend on SSO, and many critical ones. Depends on how critical services handle SSO unavailability. We are aware of many applications that we consider sensitive or critical and we know who to contact for announcing interventions for example. But there could be cases of non-critical services which are being used for critical workflows that we are not aware of.

Comment - Service dependencies is something the CERN-wide Enterprise Architecture Team are working on (with Tim Bell for IT). FYI they are using a tool called ABACUS that would be able to show dependencies in the future.

Postmortem of CERNphone services down (OTG0150847, OTG0150866)

Speaker: Ihor Olkhovskyi

Incident OTG0150847

Q - Was the issue spotted in the development environment during the CentOS migration and the transition from Python 2 to Python 3?
A - The issue happened in production during the first week after the migration. It was not spotted in the development phase because although we have functional tests, there was no stress testing involved.

Incident OTG0150866

Q - Do you rely on DHCP for the floating IP failover?
A - No, we use keepalived. It’s one interface that is floating from one server to another. DHCP is not involved. It just has a set of checks that it should pass to ensure that the server is operational. If the checks are failing, is just announcing this IP address from the other server which is already ready to take over.

Q - Could you add service side some checks to detect this misconfiguration?
A - We haven’t considered it but this is something you do only once so you don’t need to check consistently.
Comment - Maybe this is something to better document in the server installation procedures
Comment - These servers are outside of the datacenter (they are B58 and B866) and they are not installed with the normal datacenter registration process. This would rarely happen on normal datacenter installations but in this case, the nodes were configured manually.

There are minutes attached to this event. Show them.