Speaker
Dr
Marc Dobson
(CERN)
Description
The ATLAS experiment will use of order three thousand nodes for the online processing farms.
The administration of such a large cluster is a challenge especially due to high impact of any
down time. The ability to quickly and remotely turn on/off machines, especially following a power
cut, and the ability to monitor the hardware health whether the machine be on or off are some
of the major issues which the ATLAS SysAdmin Team faced. To solve these problems ATLAS has
decided wherever possible to use Intelligent Platform Management Interfaces (IPMI) for its
nodes.
This paper will present the mechanisms which were developed to allow the distribution of
management and monitoring commands to the cluster machines in parallel. These commands
were run simultaneously on the prototype farm and on the small scale final farm already
purchased. The commands and their distribution take into account the specificities of the
different IPMI versions and implementations, and the network topology of the ATLAS Online
system.
Results from timing measurements for the distribution of commands to many nodes will be
shown. These measurements will cover the times for booting and for shutting down of the nodes
and will be extrapolated to the final cluster size.
Authors
Dr
Marc Dobson
(CERN)
Dr
Usman Ahmad MALIK
(NCP, Quaid-E-Azam University)