Up until September 2017 LHCb Online was running on Puppet 3.5 Master/Server non redundant architecture. As a result, we had problem with outages, both planned and unplanned, as well as with scalability issues (How do you run 3000 nodes at the same time? How do you even run 100 without bringing down the Puppet Master). On top of that Puppet 5.0 was released, so we were running now 2 versions behind!
As Puppet 4.9 was the de facto standard, something had to be done right now, so a quick self inflicted three weeks long nonstop hackathon had to happen. This talk will cover the pitfalls, mistakes and architecture decisions we took when migrating our entire Puppet codebase nearly from scratch, to a more modular one, addressing both existing exceptions and anticipating arising ones in the future - All while our entire infrastructure was running in physics productions and on top of that causing 0 outages. We will cover mistakes we had made in our Puppet 3 installment and how we fixed them in the end, in order to lower cotalogue compile time and reduce our overall codebase around 50%.
We will cover how we setup a quickly scalable Puppet Core(Masters,CAs,Foreman,etc) infrastructure.