Giacinto Donvito (INFN) Vincenzo Spinoso (INFN)
INFN-Bari is involved in PRISMA and RECAS, two national projects aiming respectively at setting up an OpenStack-based cloud infrastructure for the public administration and the scientific data analysis, and upgrading the computing resources to a new T1-sized infrastructure. As Bari is also a T2 for the CMS and Alice experiments, setting up the cloud resources so that they can be used for high energy physics is one of the main goals: the PaaS+IaaS platform provided by PRISMA will be installed on the RECAS resources at the beginning of 2015, providing about 2000 cores of additional resources to play as regular worker nodes. It is fundamental to rethink all the monitoring infrastructure, to get a new elastic, scalable and automatic setup. In this work a new setup for monitoring the cloud resources will be shown: in particular, it allows to know the availability of the underlying IaaS infrastructure and the status of all the IaaS/PaaS services running on the OpenStack tesbed during the whole life of each virtual machine. Also, it is possible to get the history of all the sensors related to the infrastructure, together with graphs. Finally, users are provided with "monitoring as a Service" features: they can instanciate a service they wish together with a machine monitoring the service itself and showing its status to the user. The monitoring infrastructure is based on Zabbix, a powerful and flexible monitoring tool, together with OpenStack itself providing Ceilometer and APIs. It will be shown also how Zabbix serves the monitoring purposes of the whole remaining farm, basically set up with a classic batch system, cluster file system and grid middleware. All the nodes have sensors monitoring the services they run; also the network infrastructure is monitored against topological loops, high rates of packet collisions, generic unavailability. As the network topology is known in advance, we cross the information coming from the monitoring tools to build a dynamic map of the whole farm: if a machine is moved, the dynamic map moves the server as well after some time. Thanks to this new design, monitoring definitely helps the system administrator in providing stable services even when dealing with a new big infrastructure and new services.
Vincenzo Spinoso (INFN)