Nov 4 – 8, 2019
Adelaide Convention Centre
Australia/Adelaide timezone

An automatic solution to make HTCondor more stable and easier

Nov 7, 2019, 3:30 PM
1h
Hall F (Adelaide Convention Centre)

Hall F

Adelaide Convention Centre

Poster Track 7 – Facilities, Clouds and Containers Posters

Speaker

Mr Xiaowei Jiang (IHEP)

Description

HTCondor, with high scheduling performance, has been widely adopted for HEP clusters. Unlike other schedulers, HTCondor provides loose management functions to the work-nodes. We developed a Maintenance Automation Tool acronym as “HTCondor MAT“, focusing on resource management dynamically and error handing automatically.
A central database is used to record various attributes of all computing resource and experiment requirements. Each worknode is configured by MAT automatically. The worknode status is collected and analyzed in real time. If the result shows error happened, the worknode would be reconfigured at once to avoid the bad effect to jobs. A smart wrapper script deployed at worknode monitors each job running period.
HTCondor MAT has been deployed to the IHEP HTC cluster, which has more than 14,000 CPU cores. It decreases routine maintenance work of admin and anomaly happen to the cluster could be evicted automatically in time.

Consider for promotion No

Primary authors

Dr Jingyan Shi (IHEP) Dr Jiaheng Zou (IHEP) Mr Qingbao Hu (IHEP) Mr Xiaowei Jiang (IHEP)

Presentation materials

There are no materials yet.