Speaker
Ms
Bowen Kan
(Institute of High Physics Chinese Academy of Sciences)
Description
Scheduler is one of the most important components of high performance cluster. This paper introduces a self-adaptive dispatching system (SAPS) based on torque/maui which increases the resources utilization of cluster effectively and guarantees the high reliability of the computing platform. It provides great convenience for users to run various tasks on the computing platform. First of all, the SAPS implements the GPU scheduling with multi-core. This provides the basis for effective integration and utilization of computing resources, improves the ability of the cluster computing greatly. Secondly, SAPS analysis the relationship between the number of jobs queueing and the idle resources left, tune the priority of users’ job dynamically. In this way, more resources are provided for jobs running and less resources idle. Thirdly, integrated the on-line error detection with work nodes, the SAPS can excluded error nodes and include the recovered nodes automatically.
In addition, SAPS provides a monitoring management with fine granularity, a comprehensive scheduling accounting module and a scheduling real-time alarm function, and all of those ensure the cluster runs more high-efficiently, and reliably.
Currently, the SAPS has been running stable on IHEP local cluster (more than 10,000 cores and 30,000 jobs every day) and resource utilization has been improved more than 26%, and the SAPS has reduced costs for both administrator and users greatly.
Author
Ms
Bowen Kan
(Institute of High Physics Chinese Academy of Sciences)
Co-author
Jingyan Shi
(IHEP)