19-23 May 2014
Cluster Consolidation at NERSC

22 May 2014, 17:25
Auditorium Marcel Vivargent (LAPP)

Larry Pezzaglia (LBNL)


This talk will provide a case study of cluster consolidation at NERSC. In 2012, NERSC began deployment of "Mendel", a 500+ node, Infiniband-attached, Linux "meta-cluster" which transparently expands NERSC production clusters and services in a scalable and maintainable fashion. The success of the software automation infrastructure behind the Mendel multi-clustering model encouraged investigation into even more aggressive consolidation efforts. This talk will detail one such effort: under the constraints of a 24x7, disruption-sensitive environment, NERSC staff merged a 400-node legacy production cluster, consisting of multiple hardware generations and ad-hoc software configurations, into Mendel's automation infrastructure. By leveraging the hierarchical management features of the xCAT software package in combination with other open-source and in-house tools, such as Cfengine and CHOS, NERSC abstracted the unique characteristics of both clusters away below a unified management interface. Consequently, both cluster components are now managed as a single, albeit complex, integrated system. Additionally, this talk will provide an update on the PDSF system at NERSC, including improvements to trending data collection and ongoing CHOS development.

