Speaker
Bogdan Lobodzinski
(DESY, Hamburg, Germany)
Description
Small Virtual Organizations (VO) employ all components of the EMI or gLite Middleware. In this framework, a monitoring system is designed for the H1 Experiment to identify and recognize within the GRID the best suitable resources for execution of CPU-time consuming Monte Carlo (MC) simulation tasks (jobs). Monitored resources are Computer Elements (CEs), Storage Elements (SEs), WMS-servers (WMSs), CernVM File System (CVMFS) available to the VO "hone" and local GRID User Interfaces (UIs).
The general principle of monitoring of the GRID elements is based on the execution of short test jobs on different CE queues using submission through various WMSs and directly to the CREAM-CEs as well. Real H1 MC Production jobs with a small number of events are used to perform the tests. Test jobs are periodically submitted into GRID queues, the status of these jobs is checked, output files of completed jobs are retrieved, the result of each job is analyzed and the waiting time and run time are derived. Using this information, the status of the GRID elements is estimated and the most suitable ones are included in the automatically generated configuration files for use in the H1 MC production. Monitored information is stored in a MySQL database and is presented in detail on web pages and the MonAlisa visualisation system.
The monitoring system allows for identification of problems in the GRID sites and promptly reacts on it (for example by sending GGUS trouble tickets). The system can easily be adapted to identify the optimal resources for tasks other than MC production, simply by changing to the relevant test jobs. The monitoring system is written mostly in Python and Perl with insertion of a few shell scripts.
In addition to the test monitoring system we additionally use information from real production jobs to monitor the availability and quality of the GRID resources. The monitoring tools register the number of job resubmissions, the percentage of failed and finished jobs relative to all jobs on the CEs and determine the average values of waiting and running time for the involved GRID queues. CEs which do not meet the set criteria can be removed from the production chain by including them in an exception table. All of these monitoring actions lead to a more reliable and faster execution of MC requests.
Primary authors
Alexander Fomenko
(Lebedev Institute, Moscow, Russia)
Bogdan Lobodzinski
(DESY, Hamburg, Germany)
Lena Bystritskaya
(ITEP Moscow, Russia)
Nelly Gogitidze
(Lebedev Institute, Moscow, Russia)