21-25 May 2012
New York City, NY, USA
US/Eastern timezone

The Event Notification and Alarm System for the Open Science Grid Operations Center

22 May 2012, 13:30
4h 45m
Dr Scott Teige (Indiana University)


The Open Science Grid Operations (OSG) Team operates a distributed set of services and tools that enable the utilization of the OSG by several HEP projects. Without these services users of the OSG would not be able to run jobs, locate resources, obtain information about the status of systems or generally use the OSG. For this reason these services must be highly available. This paper describes the automated monitoring and notification systems used to diagnose and report problems. Described here are the means used by OSG Operations to monitor systems such as physical facilities, network operations, server health, service availability and software error events. Once detected, an error condition generates a message sent to, for example, Email, SMS, Twitter, an Instant Message Server, etc. The approach used to integrate these monitoring systems into a prioritized and configurable alarming mechanism is particularly emphasized. This system along with the ability to quickly restore interrupted services has allowed consistent operation of critical services with near 100% availability.

Dr Scott Teige (Indiana University)


Robert Quick (Indiana University) Soichi Hayashi (Indiana University)

