10–14 Oct 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Design and Implementation of Elastic Computing Resource Management System with HTCondor on Openstack

12 Oct 2016, 11:45
15m
Sierra C (San Francisco Mariott Marquis)

Sierra C

San Francisco Mariott Marquis

Oral Track 7: Middleware, Monitoring and Accounting Track 7: Middleware, Monitoring and Accounting

Speaker

qiulan huang (Institute of High Energy Physics, Beijing)

Description

As a new approach to manage resource, virtualization technology is more and more widely applied in high-energy physics field. A virtual computing cluster based on Openstack was built at IHEP, and with HTCondor as the job queue management system. An accounting system which can record the resource usages of different experiment groups in details was also developed. There are two types of the virtual computing cluster, static and dynamic. In traditional static cluster, fixed number of virtual machines are pre-allocated to the job queue of different experiments. But it cannot meet peak requirements of different experiments gradually. To solve this problem, we designed and implemented an elastic computing resource management with HTCondor on Openstack.
This system performs unified management of virtual computing nodes on the basis of job queue in HTCondor. It is consisted of four loosely-coupled components, including job status monitoring, computing node management, load balance system and the daemon. Job status monitoring system communicates with HTCondor to get the current status of each job queue and each computing node of one specific experiment. Computing node management component communicates with Openstack to launch or destroy virtual machines. After a VM is created, it will be added to the resource pool of corresponding experiment group. Then the job will run at the virtual machine. After the job finishes, the virtual machine will be shutdown. When the VM shutdown in Openstack, it will be removed from the resource pool. Meanwhile, the computing node management system provides an interface to query virtual resources usage. Load balance system provides an interface to get the information of available virtual resources for each experiment. The daemon component asks load balance system to decide how much available virtual resources. It also communicates with job status monitoring system to get the number of queued jobs. Finally, it calls computing node management system to launch or destroy a few of virtual computing nodes.
This paper will present several use cases of LHAASO and JUNO experiments. The results show virtual computing resource dynamic expanded or shrunk while computing requirements change. Additionally, CPU utilization ratio of computing resource is significantly increased compared with traditional resource management. The system also has good performance when there are multiple condor schedulers and multiple job queues. And it is stable and easy to maintain as well.

Primary Keyword (Mandatory) Cloud technologies
Secondary Keyword (Optional) Virtualization

Author

Co-authors

Gang CHEN (INSTITUTE OF HIGH ENERGY PHYSICS) Yaodong Cheng (IHEP) qiulan huang (Institute of High Energy Physics, Beijing)

Presentation materials