Active job monitoring in pilots

Apr 14, 2015, 2:45 PM
C209 (C209)



oral presentation Track6: Facilities, Infrastructure, Network Track 6 Session


Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE))


Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to correctly design and upgrade computing clusters. Especially networking is a critical component as the increased usage of data federations relies on WAN connectivity and availability as a fallback to access data. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-specific scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. As the original batch system has no access to internal information about the scheduling process inside the pilots, there is an unpredictable number of jobs being executed. Therefore, the comparability of jobs and pilots cannot be ensured to predict runtime behaviour or network performance. Hence, the identification of the actual payload is of interest. At the GridKa Tier 1 centre a specific monitoring tool is in use, that allows the monitoring of network traffic information at batch job level. A first analysis using machine learning algorithms showed the relevance of the measured data, but indicated a possible improvement by subdividing pilots into separate jobs. This contribution will present the current monitoring approach and will discuss recent efforts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated and adapted to the former analysis and the results are presented.

Primary author

Eileen Kuhn (KIT - Karlsruhe Institute of Technology (DE))


Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE)) Christopher Jung (KIT - Karlsruhe Institute of Technology (DE)) Manuel Giffels (KIT - Karlsruhe Institute of Technology (DE)) Max Fischer (KIT - Karlsruhe Institute of Technology (DE))

Presentation materials