Speaker
Larry Pezzaglia
(LBNL)
Description
This talk will provide a case study of cluster consolidation at NERSC.
In 2012, NERSC began deployment of "Mendel", a 500+ node,
Infiniband-attached, Linux "meta-cluster" which transparently
expands NERSC production clusters and services in a scalable and
maintainable fashion. The success of the software automation
infrastructure behind the Mendel multi-clustering model encouraged
investigation into even more aggressive consolidation efforts.
This talk will detail one such effort: under the constraints of a
24x7, disruption-sensitive environment, NERSC staff merged a 400-node
legacy production cluster, consisting of multiple hardware generations
and ad-hoc software configurations, into Mendel's automation
infrastructure. By leveraging the hierarchical management features of
the xCAT software package in combination with other open-source and
in-house tools, such as Cfengine and CHOS, NERSC abstracted the
unique characteristics of both clusters away below a unified management
interface. Consequently, both cluster components are now managed as a
single, albeit complex, integrated system.
Additionally, this talk will provide an update on the PDSF system at
NERSC, including improvements to trending data collection and ongoing
CHOS development.
Primary author
Larry Pezzaglia
(LBNL)