Speaker
Tomasz Wlodek
(Brookhaven National Laboratory)
Description
Managing large number of heterogeneous grid servers with different service
requirements posts great challenges. We describe a cost-effective integrated
operation framework which manages hardware inventory, monitors services, raises
alarms with different severity levels and tracks the facility response to them.
The system is based on open source components: RT (Request Tracking) tracks user
requests, AT (Asset Tracking) manages site inventory, while Nagios performs facility
monitoring. We will discuss the integration of those components.
The AT serves as central repository to store information about machines, services,
groups of machines and services, their interdependencies and configuration.
Problem reports sent to RT by users are reflected on asset history stored in AT
database.
Nagios system uses AT to obtain information about the components to be monitored.
Detected problems are classified according to their severity, reported to experts and
fed into RT system, where the progress towards their resolution is tracked.
The paper will describe the AT data model, integration between AT and Nagios and
interfacing the RT to other problem tracking systems.
The described system provides a scalable solution to commission grid servers,
automate the error-prone manual system configuration, and leverage the existing
ticket system for problem tracking. It allows BNL to operate Tier1 facility 7X24,
and meets service level agreements for each WLCG grid middleware component with
different class of service requirements.
Author
Tomasz Wlodek
(Brookhaven National Laboratory)
Co-authors
Carlos Gamboa
(Brookhaven National Laboratory)
Dantong Yu
(Brookhaven National Laboratory)
Jason Smith
(Brookhaven National Laboratory)
Robert Petkus
(Brookhaven National Laboratory)
Shigeki Misawa
(Brookhaven National Laboratory)
Tom Throwe
(Brookhaven National Laboratory)
Yingzi Wu
(Brookhaven National Laboratory)
Zhenping Liu
(Brookhaven National Laboratory)