IT-ASDF: PuppetDB Outage post-mortem

Europe/Zurich
31/3-004 - IT Amphitheatre (CERN)

31/3-004 - IT Amphitheatre

CERN

105
Show room on map
Videoconference
Zoom Meeting ID
63445832154
Description
IT Activities and Services Discussion Forum (ASDF)
Host
Jorge Garcia Cuervo
Alternative hosts
Charles Delort, Ismael Posada Trobo, Stefan Nicolae Stancu, Enrico Bocchi, Pablo Martin Zamora, Karolina Przerwa, Nikos Papakyprianou
Useful links
Join via phone
Zoom URL

PuppetDB Outage post-mortem and Summary of Impacted Services

28 - online
21 - in person

Minutes summary

  • For DBOD, there were two issues to be tackled. The first one was restoring the databases for Puppet manually, and the second was fixing the issue with ProxySQL.
  • On the presented timeline regarding the impact of the incident on Hadoop, there was a mismatch in times that has been fixed.
  • Every app behind SSO could be affected, at least partially, not only LanDB.
  • It seems that the infrastructure was well-prepared for certain failures but not entirely. We were not ready for this particular type of failure (empty fact); otherwise, simply failing the puppet run would have been sufficient. This has now been implemented.
  • Security concerns were reminded about relying on the data stored in PuppetDB.
There are minutes attached to this event. Show them.