30 January 2017 to 1 February 2017
SURFSara
Europe/Zurich timezone

Automated error handling at Dropbox

31 Jan 2017, 11:40
40m
Amsterdam (SURFSara)

Amsterdam

SURFSara

Science Park

Speakers

Karoly Nagy (Dropbox Inc.) Maxim Bublis (Dropbox Inc.)

Description

At Dropbox, with 1000s of MySQL servers, failures like hardware errors are normal, not exceptional. There is no day passing by without replacing at least 1 server with some kind of hardware error. Our on-call engineers are not alerted for these, they are alerted if the automation is not working properly.
This kind of automation is harder with stateful systems, so we wrote a general framework for that called Wheelhouse. In this framework, state machines are describing the good states of systems, and the transition steps between them.
In this talk we will show the following:

  • What happens with a slave in case of hardware error
  • What happens with a master in case of hardware error
  • What happens when we would like to upgrade kernels
  • How are we using this framework to coordinate schema changes between shards
  • How are we using this framework to verify data consistency

Primary authors

Karoly Nagy (Dropbox Inc.) Maxim Bublis (Dropbox Inc.)

Presentation materials