We describe the development of a tool (Trident) using a three pronged approach to analysing node utilisation while aiming to be user friendly. The three areas of focus are data IO, CPU core and memory.
Compute applications running in a batch system node will stress different parts of the node over time. It is usual to look at metrics such as CPU load average and memory consumed. However, this often does not provide enough information to form a detailed picture of how the system is performing and in most cases detecting performance problems is impossible.
Monitoring and collecting further performance metrics at near real time is intended to understand compute demands better and which changes can improve utilisation. We are investigating methodologies at CERN Tier-0 to allow collection of metrics such as memory bandwidth, detailed CPU core utilisation and active processor cycles. This is done with minimal overhead and without instrumenting the user code. When combined with modern analytics the metrics can provide information relevant to the users, developers and site administrators. The raw metrics are often difficult to interpret, hence development of a tool to allow the target communities to both collect and interpret resource utilisation data more easily.