HTCondor, with high scheduling performance, has been widely adopted for HEP clusters. Unlike other schedulers, HTCondor provides loose management functions to the work-nodes. We developed a Maintenance Automation Tool acronym as “HTCondor MAT“, focusing on resource management dynamically and error handing automatically.
A central database is used to record various attributes of all computing resource and experiment requirements. Each worknode is configured by MAT automatically. The worknode status is collected and analyzed in real time. If the result shows error happened, the worknode would be reconfigured at once to avoid the bad effect to jobs. A smart wrapper script deployed at worknode monitors each job running period.
HTCondor MAT has been deployed to the IHEP HTC cluster, which has more than 14,000 CPU cores. It decreases routine maintenance work of admin and anomaly happen to the cluster could be evicted automatically in time.
|Consider for promotion||No|