9-13 July 2018
Sofia, Bulgaria
Europe/Sofia timezone

A Feasibility Study about Integrating HTCondor Cluster Workload with SLURM Cluster Workload

12 Jul 2018, 11:30
15m
Hall 10 (National Palace of Culture)

Hall 10

National Palace of Culture

presentation Track 8 – Networks and facilities T8 - Networks and facilities

Speaker

Ran Du

Description

There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with SLURM as the workload manager. The resources of the HTCondor cluster are provided by multiple experiments, and the resource utilization has reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there will be a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the SLURM cluster reflect some specific attributes, such as high parallel degree, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCondor cluster to the SLURM cluster transparently, it would improve the resource utilization both for two clusters. HTCondor provides HTCondor-C to schedule jobs to other clusters managed by different workload managers, for example, SLURM. However, it's not enough if we would like to decide which, when and where jobs are allowed to schedule by SLURM. Also, how to manage the re-scheduled jobs running on the SLURM cluster will be a problem. Furthermore, design philosophy and application scenes are different between HTCondor and SLURM, large quantity of jobs in a short period may bring extra scheduling load for SLURM. In this paper, after a brief background introduction, we will describe the problems to integrate two cluster workloads, and we will also present possible solutions to these problems.

Primary authors

Ran Du Jingyan Shi (IHEP) Mr Xiaowei Jiang (IHEP(中国科学院高能物理研究所)) Jiaheng Zou (IHEP) Mr Zhenyu Sun Ms Hongnan Tan

Presentation materials