At Dropbox, with 1000s of MySQL servers, failures like hardware errors are normal, not exceptional. There is no day passing by without replacing at least 1 server with some kind of hardware error. Our on-call engineers are not alerted for these, they are alerted if the automation is not working properly.
This kind of automation is harder with stateful systems, so we wrote a general framework for that called Wheelhouse. In this framework, state machines are describing the good states of systems, and the transition steps between them.
In this talk we will show the following:
- What happens with a slave in case of hardware error
- What happens with a master in case of hardware error
- What happens when we would like to upgrade kernels
- How are we using this framework to coordinate schema changes between shards
- How are we using this framework to verify data consistency