This talk will provide a case study of cluster consolidation at NERSC.
In 2012, NERSC began deployment of "Mendel", a 500+ node,
Infiniband-attached, Linux "meta-cluster" which transparently
expands NERSC production clusters and services in a scalable and
maintainable fashion. The success of the software automation
infrastructure behind the Mendel multi-clustering model encouraged
investigation into even more aggressive consolidation efforts.
This talk will detail one such effort: under the constraints of a
24x7, disruption-sensitive environment, NERSC staff merged a 400-node
legacy production cluster, consisting of multiple hardware generations
and ad-hoc software configurations, into Mendel's automation
infrastructure. By leveraging the hierarchical management features of
the xCAT software package in combination with other open-source and
in-house tools, such as Cfengine and CHOS, NERSC abstracted the
unique characteristics of both clusters away below a unified management
interface. Consequently, both cluster components are now managed as a
single, albeit complex, integrated system.
Additionally, this talk will provide an update on the PDSF system at
NERSC, including improvements to trending data collection and ongoing
CHOS development.