Description
After two years of maintenance and upgrade, the Large Hadron Collider (LHC) has started its second four year run. In the mean time, the CMS experiment at the LHC has also undergone two years of maintenance and upgrade, especially in the field of the Data Acquisition and online computing cluster, where the system was largely redesigned and replaced. Various aspects of the supporting computing system will be addressed here.
The increasing processing power and the use of high end networking technologies (10/40Gb/s Ethernet and 56Gb/s Infiniband) has reduced the number of DAQ event building nodes, since the performance of the individual nodes has increased by an order of magnitude since the start of LHC. The pressure on using the systems in an optimal way has increased accordingly, thereby also increasing the importance of proper configuration and careful monitoring to catch any deviation from standard behaviour. The upgraded monitoring system based on Ganglia and Icinga2 will be presented with the different mechanisms used to monitor and troubleshoot the crucial elements of the system.
The evolution of the various sub-detector applications, the data acquisition and high level trigger, following their upgraded hardware and designs over the upgrade and running periods, require a performant and flexible management and configuration infrastructure. The puppet based configuration and management system put in place for this phase, will be presented, showing it's flexibility to support a large heterogeneous system, as well as, it's ability to do bulk installations from scratch or rapid installations of CMS software cluster wide. A number of custom tools have been developed to support the update of rpm based installations by the end users, a feature not typically supported in a datacenter environment. The performance of the system will also be presented with insights into its scaling with the increasing farm size over this data taking run.
Such a large and complex system requires redundant, flexible core infrastructure services to support them. Details will be given on how a flexible and highly available infrastructure has been put in place, leveraging various high availability technologies, from network redundancy, through virtualisation, to high availability services with Pacemaker/Corosync.
To conclude, a roundup of the different tools and solutions used in the CMS cluster administration will be given, pulling all the above into a coherent, performant and scalable system.
Primary Keyword (Mandatory) | Computing facilities |
---|---|
Secondary Keyword (Optional) | DAQ |