Speaker
Prof.
David Britton
(University of Glasgow)
Description
Modern Linux Kernels include a feature set that enables the control
and monitoring of system resources, called Cgroups. Cgroups have been
enabled on a production HTCondor pool located at the Glasgow site of the
UKI-SCOTGRID distributed Tier-2. A system has been put in place to
collect and aggregate metrics extracted from Cgroups on all worker nodes
within the Condor pool. From this aggregated data, memory and CPU usage
footprints are extracted. From the extracted footprints the resource
usage for each type of ATLAS and GridPP workload can be obtained and
studied. This system has been used to identify broken payloads,
real-world memory usage, job efficiencies etc.
The system has been running in production for 1 year and a large amount of data
has been collected. From these statistics we can see the difference
between the original memory requested and the real world memory usage of different
types of jobs. These results were used to reduce the amount of memory requested (for scheduling purposes) from the batch system and an increase in cluster utilisation was observed, at around the 10% level. By analysing the overall real world job performance we have been able to increase the utilisation of the Glasgow site of the UKI-SCOTGRID distributed Tier-2.
Author
Gang Qin
(University of Glasgow (GB))
Co-authors
David Britton
(University of Glasgow (GB))
David Crooks
(University of Glasgow (GB))
Gareth Douglas Roy
(University of Glasgow (GB))
Dr
Gordon STEWART
(University of Glasgow)
Samuel Cadellin Skipsey