9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

Research and Exploit of Resource Sharing Strategy at IHEP

10 Jul 2018, 16:00
1h
Sofia, Bulgaria

Sofia, Bulgaria

National Culture Palace, Boulevard "Bulgaria", 1463 NDK, Sofia, Bulgaria
Poster Track 3 – Distributed computing Posters

Speakers

Xiaowei Jiang (IHEP(中国科学院高能物理研究所)) Jingyan Shi (IHEP) Jiaheng Zou (IHEP)

Description

At IHEP, computing resources are contributed by different experiments including BES, JUNO, DYW, HXMT, etc. The resources were divided into different partitions to satisfy the dedicated experiment data processing requirements. IHEP had a local torque maui cluster with 50 queues serving for above 10 experiments. The separated resource partitions leaded to resource imbalance load. Sometimes, BES resource partition was quite busy without free slot but with lots of jobs in queue, while JUON resources kept idle for a long time. However, sometimes the situations is contrary.
After migrating resources from torque maui to HTCondor in 2016, job scheduling efficiency has been improved a lot. To aim at imbalance resource load, we designed and presented an efficient sharing strategy to improve the overall resource utilization. We created a sharing pool to support all experiments. Resources of each experiment was divided into two parts: dedicated resource and sharing resource. The slots in dedicated resource only run jobs of own experiment, and the slots in sharing resource are shared by jobs of all experiments. Default ratio of dedicated resource to sharing resource is 1:4. To maximize sharing effect, the ratio is dynamically adjusted between 0:5 and 4:1 based on amount of jobs from each experiment.
We have developed a central control system to allocate resources for each experiment group. This system is consist of two parts: server side and client side. A management database is built at server side, which is storing resource, group and experiment information. Once the sharing ratio needs to be adjusted, resource group will be changed and updated into database. The resource group information is published to the server buffer in real time. The Client periodically pulls resource group information from server buffer via https protocol. And resource scheduling conditions at client side is changed based on the dynamic resource group information. By this process, share ratio can be regulated dynamically.
We have implemented resource sharing strategy by combining central control system with HTCondor. ClassAd mechanism and accounting-group provided by HTCondor facilitate to utilize our sharing strategy at IHEP computing cluster. With sharing strategy, overall resource utilization of IHEP computing cluster has dramatically increased from about 50% to more than 90%. The total wall-time without sharing strategy in 2016 is 40,645,124 CPU hours, while it’s 73,341,585 CPU hours with sharing strategy in 2017, increasing by 80.44%. The results indicate sharing strategy is efficient and integrally promotes experiment data processing.

Primary authors

Xiaowei Jiang (IHEP(中国科学院高能物理研究所)) Jingyan Shi (IHEP) Jiaheng Zou (IHEP) Qingbao Hu (IHEP) Ran Du Mr Zhenyu Sun

Presentation materials