Speaker
Description
The CMS collaboration operates a large distributed computing infrastructure to meet the computing requirements of the experiment. About half a million CPU cores and an exabyte of storage are utilized to reconstruct the recorded data, simulate signals of physics processes, and analyze data. Computing resources are located at about one hundred sites around the world.
Monitoring the performance of the computing resources and alerting local administrators promptly of any issues is paramount for smooth operation and the success of the experiment. CMS utilizes three tools to test and monitor services in its computing grid: the Service Availability Monitor (SAM) of the Worldwide LHC Computing Grid (WLCG), HammerCloud (HC), and File Transfer Service (FTS) test transfers.
In this presentation, we show recent updates to the HammerCloud test job system and provide detailed site performance analyses based on the HC jobs that serve as a standard reference from various perspectives. Some of the uncovered issues are not specific to CMS and are the result of site configuration choices made many years ago.