(Universita e INFN (IT))
Over the last few years we have seen an increasing number of services and applications needed to manage and maintain cloud computing facilities. This is particularly true for computing in high energy physics which often requires complex configurations and distributed infrastructures. In this scenario a cost effective rationalization and consolidation strategy is the key to success in terms of scalability and reliability.
In this work, we describe an IaaS (Infrastructure as a Service) cloud computing system, with high availability and redundancy features which is currently in production at INFN-Naples and ATLAS Tier-2 data centre.
The main goal we intended to achieve was a simplified method to manage our computing resources and deliver reliable user services, reusing existing hardware without incurring heavy costs.
A combined usage of virtualization and clustering technologies allowed us to consolidate our services on a small number of physical machines, reducing electric power costs.
As a results of our efforts we developed a complete solution for data and computing centers that can be easily replicated using commodity hardware.
Our architecture mainly consists of 2 subsystems: a clustered storage solution, built on top of disk servers running Gluster file system, and a virtual machines execution environment. The hypervisor hosts use Scientific Linux and KVM as virtualization technology and run both Windows and Linux guests. Virtual machines have their root file systems on qcow2 disk-image files, stored on a Gluster network file system. Gluster is able to perform parallel writes on multiple disk servers (two, in our system), providing this way live replication of data. A failure of a disk server doesn't cause glitches or stops any of the running virtual guests as each hypervisor host still has full access to disk-image files. When the failing disk server returns to normal activity Gluster self-healing integrated mechanism performs a background transparent reconstruction of missing replicas.
High availability is also achieved via a network configuration using redundant switches and multiple paths between hypervisor hosts and disk servers. Linux channel bonding provides adaptive load balancing of network traffic over multiple links and dedicated VLANs guarantee isolation of the storage subsystem from the general-purpose network.
We also developed a set of management scripts to easily perform basic system administration tasks such as automatic deployment of new virtual machines, adaptive scheduling of virtual machines on hypervisor hosts, live migration and automated restart in case of hypervisor failures.
The work is organized as follows:
In the first part we identify the main requirements and the goal we want to achieve in terms of system reliability and availability. Then we introduce a set of currently available open-source technologies and we discuss the motivation of our choice. After that, we describe our cloud computing model: the architecture, all the features and the main aspects.
In the second part we show our implementation at INFN-Naples describing the hardware, the network topology, the storage configuration and the migration process of our services from physical machines to cloud infrastructure. The description is accompanied by some stress test benchmark results and a technical analysis of the system utilization during the last year.
In the last part we illustrate other possible application scenarios with a set of recommendations based on our local experience.