Speaker
Description
There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with SLURM as the workload manager. The resources of the HTCondor cluster are provided by multiple experiments, and the resource utilization has reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there will be a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the SLURM cluster reflect some specific attributes, such as high parallel degree, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCondor cluster to the SLURM cluster transparently, it would improve the resource utilization both for two clusters. HTCondor provides HTCondor-C to schedule jobs to other clusters managed by different workload managers, for example, SLURM. However, it's not enough if we would like to decide which, when and where jobs are allowed to schedule by SLURM. Also, how to manage the re-scheduled jobs running on the SLURM cluster will be a problem. Furthermore, design philosophy and application scenes are different between HTCondor and SLURM, large quantity of jobs in a short period may bring extra scheduling load for SLURM. In this paper, after a brief background introduction, we will describe the problems to integrate two cluster workloads, and we will also present possible solutions to these problems.