Speaker
Dr
Scott Teige
(Indiana University)
Description
The Open Science Grid Operations (OSG) Team operates a distributed set of services and
tools that enable the utilization of the OSG by several HEP projects. Without these
services users of the OSG would not be able to run jobs, locate resources, obtain
information about the status of systems or generally use the OSG. For this reason these
services must be highly available. This paper describes the automated monitoring and
notification systems used to diagnose and report problems. Described here are the means
used by OSG Operations to monitor systems such as physical facilities, network
operations, server health, service availability and software error events.
Once detected, an error condition generates a message sent to, for example,
Email, SMS, Twitter, an Instant Message Server, etc.
The approach used to integrate these monitoring systems into a prioritized and configurable alarming
mechanism is particularly emphasized. This system along with the ability to quickly
restore interrupted services has allowed consistent operation of critical services with
near 100% availability.
Author
Dr
Scott Teige
(Indiana University)
Co-authors
Robert Quick
(Indiana University)
Soichi Hayashi
(Indiana University)