Sep 2 – 9, 2007
Victoria, Canada
Europe/Zurich timezone
Please book accomodation as soon as possible.

glideinWMS - A generic pilot-based Workload Management System

Sep 4, 2007, 11:40 AM
20m
Carson Hall C (Victoria, Canada)

Carson Hall C

Victoria, Canada

oral presentation Grid middleware and tools Grid middleware and tools

Speaker

Mr Igor Sfiligoi (FNAL)

Description

Grids are making it possible for Virtual Organizations (VOs) to run hundreds of thousands of jobs per day. However, the resources are distributed among hundreds of independent Grid sites. A higer level Workload Management System (WMS) is thus necessary. glideinWMS is a pilot-based WMS, inheriting several useful features: 1) Late binding: Pilots are sent to all suitable Grid sites. Only once pilots start are real jobs selected for that resources. No forecasting is needed. 2) Reliability: A broken Grid site will either kill pilot jobs or pilots will detect the problem at startup. Real jobs only start on well-behaved resources. 3) Grid-wide fair share: The relative priorities between jobs of the same VO are set inside the WMS. Grid sites only manage priorities between different VOs. glideinWMS is based on the Condor glidein concept, i.e. a regular Condor pool, with the Condor daemons (startd) being started by pilot jobs. The real jobs are vanilla, standard or MPI universe jobs. glideinWMS is composed of Glidein Factories and VO Frontends, communicating using Condor ClassAds: * Factories publish the available Grid sites, * Frontends match the Grid attributes to job attributes and publish a request for a stream of glideins to suitable Grid sites * Factories pick up the requests and submit the glideins A detailed description of the system will be presented, along with the currently deployed systems inside USCMS production and user analysis frameworks. Integration with frameworks of other VOs will also be presented, as well as the measured scalability limits.

Primary author

Mr Igor Sfiligoi (FNAL)

Presentation materials