Speaker
Dr
Alfredo Pagano
(INFN/CNAF, Bologna, Italy)
Description
Worldwide grid projects such as EGEE and WLCG need services with high availability,
not only for grid usage, but also for associated operations. In particular, tools
used for daily activities or operational procedures are considered critical.
In this context, the goal of the work done to solve the EGEE failover problem is to
propose, implement and document well-established mechanisms and procedures to limit
service outages for the operations and monitoring tools used by regional and global
grid operators to control the status of the EGEE grid.
The operations activity of EGEE relies on different tools developed by teams from
different countries. For each tool, only one instance was deployed prior to this
work, thus representing single points of failure. In our work, we solved the problem
by replicating tools in different sites, using specific DNS features to automatically
swap a given service instance in case of failures.
After a DNS test phase in a virtual machine (vm) environment focused on nsupdate,
NS/zone configuration and fast TTLs, a new domain for grid operations (gridops.org)
was registered. In addition, replication of databases, web servers and web services
have also been investigated and configured.
In this paper, we describe the technical mechanism used in our approach. We also show
the replication procedure implemented for the EGEE/WLCG CIC Operations Portal use
case. Furthermore, we present the interest in failover procedures in the context of
other grid projects and grid services. Future plans for improvements of the
procedures are also described.
Authors
Mr
Alessandro Cavalli
(INFN/CNAF, Bologna, Italy)
Dr
Alfredo Pagano
(INFN/CNAF, Bologna, Italy)
Mr
Cyril L'Orphelin
(IN2P3/CNRS Computing Centre, Lyon, France)
Mr
Gilles Mathieu
(IN2P3/CNRS Computing Centre, Lyon, France)
Mr
Osman Aidel
(IN2P3/CNRS Computing Centre, Lyon, France)
Co-author
Mr
Rafal Lichwala
(PSNC, Poznan, Poland)