Multi-VO supports based on DIRAC have been set up to provide workload and data management for several high energy experiments in IHEP. The distributed computing platform has 19 heterogeneous sites including Cluster, Grid and Cloud. The heterogeneous resources belong to different Virtual Organizations. Due to scale and heterogeneity, it is complicated to monitor and manage these resources manually. Moreover, the experts who have a rich knowledge about the underlying system are precious. For the above reasons, the requirement of an easy-to-use monitoring system to monitor and manage these resources accurately and effectively is proposed. The system should take all the heterogeneous resources into account and be suitable for multi-VO. Adopting the idea from Resource Status System (RSS) of DIRAC, this paper will present the designs and implementation of resources monitoring and automatic management system.
The system is composed of three parts: information collection, status decision and automatic control, and information display. The information collection includes active and passive way of gathering status from different sources and stores them in databases. For passive information, the system got information from third party systems periodically, such as storage occupancies, user job efficiency. For active collecting, periodical testing services have been designed and developed to send standard jobs to all sites to know their availability and status. These tests are well defined and classified for each VO according to their special requirements. The status decision and automatic control is used to evaluate the resources status and take control actions on resources automatically. Policies have been pre-defined to set rules to judge the status in different situations. Combined with collected information and policies, the decision can be made and the appropriate actions will be automatically taken to send out alarm and give controls. A web portal has been designed to display both monitoring and control information. A summary page gives a quick view of all sites status and the detail information can be obtained by tracking down from the top. Besides the real-time information, the historical information is also recorded and displayed give a global view of resources status for certain period of time. All the implementations are based on DIRAC framework. The information and control including sites, policies, web portal for different VOs can be well defined and distinguished within DIRAC user and group management infrastructure.
|Primary Keyword (Mandatory)||Monitoring|