Speaker
Mr
Michael Poat
(Brookhaven National Laboratory)
Description
The STAR online computing environment is an intensive ever-growing system used for real-time data collection and analysis. Composed of heterogeneous and sometimes custom-tuned machine groups (Data Acquisition or DAQ computing, Trigger group, Sow Control or user-end data quality monitoring resources do not have the same requirements) the computing infrastructure was managed by manual configurations and inconsistently monitored by combinations of known tools (Ganglia for monitoring for example) and home-made scripts sending Email reports to the (many) administrators of the diverse systems. This situation lead to configuration inconsistencies and an overload of repetitive tasks that needed to be passed from system groups to system groups with no global configuration and view of problems. Worst off, as the need to communicate between systems increased, globally securing the cyber-infrastructure was not possible to achieve and as more resources are moving closer to where the data is generated (due to event filtering and so-called "High Level Trigger farms), an agile policy-driven system ensuring consistency was seek.
STAR has narrowed down its strategy toward deploying a versatile and sustainable solution by leveraging the configuration management tool CFEngine to automate configurations along with the deployment of the Infrastructure monitoring system Icinga providing a dashboard view of the system’s health. In its first incarnation, Icinga 1 can be seen as a fork of the more commonly known tool Nagios monitoring while its version 2 is a core framework replacement and rewrite. Conjointly, CFEngine and Icinga have strengthened automation and sanctioning intricate development of the monitoring system. STAR has over 150 online systems spanning over four major sub-systems, each becoming critical during Runs. With a bird’s eye view of each system, keeping track of the infrastructure becomes an ease. Similarly, STAR can now swiftly upgrade and modify the environment to our needs with ease as well as promptly react to cyber-security nodes whether it appears as a global patch for the Shellshock vulnerability or the reconfiguration of Secure shell. Though, as the DAQ resources do not need the same configurations than the user based resources (as one comparative example) but are all required to follow the same baseline, the infrastructure lay out is intricate and our strategy allows each system group (and sometimes machine) to be configured and monitored for its own particulars. Modular configuration is the key to consistency where differentiated plug-ins are distributed and configurations are updated ubiquitously. By creating a sustainable long term monitoring solution, the rate of failure detection has gone up from days to minutes, allowing rapid actions before the issue becomes a dire problem potentially causing loss of precious experimental data.
Alternatives to our configuration management choice of CFEngine such as Chef and Puppet have been proposed and evaluated. We have chosen an open source minimally dependent agile tool that is powerful and simple to use. Our monitoring tool has offered STAR administrators a crisp interface reporting each system’s state simultaneously and allowing historical system status lookup.
In this report, we will make a brief reminder and comparison of the diverse configuration management systems available on the market and justify our choice by requirements and functionalities. We will discuss the details and procedures for developing practical uses with configuration management and infrastructure monitoring. Our extensions to the community’s plugin have been re-integrated into the main development branch and we will provide examples of extension of Icinga and demonstrate through example the versatility of the framework for adding metric.
Authors
Dr
Jerome LAURET
(BROOKHAVEN NATIONAL LABORATORY)
Mr
Michael Poat
(Brookhaven National Laboratory)
Wayne Betts
(Brookhaven National Laboratory)