Apache Mesos is a resource management system for large data centres, initially developed by UC Berkeley, and now maintained under the Apache Foundation umbrella. It is widely used in the industry by companies like Apple, Twitter, and AirBnB and it's known to scale to 10'000s of nodes. Together with other tools of its ecosystem, like Mesosphere Marathon or Chronos, it provides an end-to-end solution for datacenter operations and a unified way to exploit large distributed systems.
We present the experience of the ALICE Experiment Offline & Computing in deploying and using in production the Apache Mesos ecosystem for a variety of tasks on a small 500 cores cluster, using hybrid OpenStack and bare metal resources.
We will initially introduce the architecture of our setup and its operation, we will then describe the tasks which are performed by it, including release building and QA, release validation, and simple MonteCarlo production.
We will show how we developed Mesos enabled components (a.k.a. Mesos Frameworks) to carry out ALICE specific needs. In particular we will illustrate our effort to integrate Workqueue, a lightweight batch processing engine developed by University of Notre Dame, which ALICE uses to run release validation.
Finally we will give an outlook on how to use Mesos as resource manager for DDS, a software deployment system developed by GSI which will be the foundation of the system deployment for ALICE next generation Online-Offline (O2).
|Primary Keyword (Mandatory)||Cloud technologies|
|Secondary Keyword (Optional)||Computing middleware|