The online computing environment at STAR has generated demand for high availability of services (HAS) and a resilient uptime guarantee. Such services include databases, web-servers, and storage systems that user and sub-systems tend to rely on for their critical workflows. Standard deployment of services on bare metal creates a problem if the fundamental hardware fails or loses connectivity. Additionally, the configuration of redundant fail-over nodes (a secondary Web service for example) requires constant syncing of configuration and content and sometimes manual interaction for switching hardware (DNS name comes to mind). Beyond those uses, and within any computing environment, over-provisioned systems with unused CPU and memory resources could be put to use with container or virtualization technologies. How to achieve HAS using OpenSource packages and our experience will be the objective of our presentation.
We will focus on two tools: oVirt and Ceph. For Ceph, we have presented in past conferences our testing experience, performance improvement attempts as well as the status of our production system. Growing in popularity as a distributed storage system, it appeared natural to leverage our experience for this project. oVirt is an OpenSource virtualization management application that enables central management of hardware nodes, storage, and network resources used to deploy and monitor virtual machines. oVirt supports the deployment of a virtual environment for your data center leveraging automatic provisioning, live migration, and the ability to easily scale the number of hypervisors. oVirt enables the use of multiple storage technologies where you can store virtual machines, images, and templates within one or multiple storage systems. STAR’s recent efforts focused on deploying a CephFS POSIX compliant distributed storage system, we would enable the ability to couple our Ceph storage with the oVirt virtualization management system.
When designing an intricate system for hosting critical services, it is a requirement to circumvent single points of failure. This work will involve the testing and viability of such an approach along with test cases for high availability, live migration, and service on demand.