Monitoring CREAM Jobs in RTM via L&B
Presented by Mr. Aleš KřENEK on 12 Apr 2010 from 18:15 to 18:18
Session: Poster session
Track: Software services exploiting and/or extending grid middleware (gLite, ARC, UNICORE etc)
The Real Time Monitor (RTM) is a high-level monitoring tool which aggregates information on grid jobs and presents it in a suitable form. The success of this tool depends on the accuracy of information it is able to receive from lower layers. However, withthe recent increase of the number of jobs submitted directly to Computing Elements (CE) the fraction of jobs seen by RTM decreases. gLite Logging and Bookkeeping service (L&B), with its recent extension to CREAM jobs, provides the necessary level of abstraction and a glue between different job flavours to allow high-level tools to see full range of grid jobs.
Recently we introduced extensions to the CREAM Computing Element and the L&B service, which unify the view on jobs executed by CREAM regardless of their submission path (via gLite WMS or directly to CREAM). CREAM is able to distinguish direct submission; in this case the incoming job is registered with L&B first. In both scenarios CREAM logs events on progress of job execution as well as possible failures to L&B. This information is processed at L&B into a view on overall job state which is consistent between WMS and CREAM-only jobs. On the other hand, the RTM was modified to receive notifications on job state changes from L&B, rather than extracting job state information from raw data in L&B database as before. Besides improving overall reliability, this binding is done on the level of L&B job state which is already common to both WMS and CREAM-only jobs. Therefore the RTM needn't make any further distinction between different job types.
Some of the grid users prefer using their workload management systems (e.g. Atlas Panda) bypassing gLite WMS and submitting jobs directly to CEs. The amount of workload distributed to grid sites in this way is not negligible. Our work, by unifying the view on all grid jobs (going through WMS or directly to CE) at the level of L&B, enables the uniform monitoring of al lgrid jobs with high-level tools like RTM . Consequently, the real-time view on the grid state is considerably improved. Additional benefit is extending the time span of CE job data (CREAM purges them soon after job completion), enabling better post-mortem analysis of problems etc. The RTM is to date the only system which has access to distributed L&B servers worldwide. This makes in an important tool not only for dissemination purposes and individual users, but also for large experimental communities delivering a single monitoring entry point. Data collected by the RTM may also be analysed off-line providing an opportunity to study performance of the GRID in greater detail. Adding direct submissions to the RTM monitoring system will make this tool more attractive also to communities not using WMS resources in their work.
The described work is mostly integration. With a relatively small effort we were able to put recently developed pieces of code together to bring additional considerable benefits. Besides the practical desirable results of a more accurate RTM view on the EGEE grid, this is a positive improvement on the overall architecture, with L&B as the glue monitoring service between various job types and the massively exposed high-level tools. Therefore in the near future we will concentrate on hardening the existing prototype towards production quality, and on its wide-scale deployment on the infrastructure.
CREAM, L&B, RTM