3rd EGEE User Forum

Name: 3rd EGEE User Forum
Start: 2008-02-11T13:30:00+01:00
End: 2008-02-15T18:00:00+01:00
Location: Le Polydôme , Clermont-Ferrand, FRANCE

11–14 Feb 2008

<a href="http://www.polydome.org">Le Polydôme</a>, Clermont-Ferrand, FRANCE

Europe/Zurich timezone

Support

egee-uf3@healthgrid.org

Modeling the EGEE latency to optimize job time-outs

12 Feb 2008, 11:10

20m

Champagne (<a href="http://www.polydome.org">Le Polydôme</a>, Clermont-Ferrand, FRANCE)

Champagne

<a href="http://www.polydome.org">Le Polydôme</a>, Clermont-Ferrand, FRANCE

Oral Existing or Prospective Grid Services From research to production grids: interaction with the Grid'5000 initiative

Dr Diane Lingrand (UNSA)

Jobs submitted to a production grid infrastructure are impacted by a variable delay resulting from the grid submission cost (middleware overhead and queuing time). The actual execution time of a job will depend on the process execution time, which can be known through benchmarks, and the variable grid latency duration, which is difficult to anticipate due to the complexity of the grid infrastructure and the variable load patterns it is enduring. We aim at estimating the grid latency through a probabilistic approach that is well adapted to complex system modeling. We derived a model of the expected execution time of a job function of the time-out value in a time-outing and resubmission setting. To follow on the variable load conditions, a monitoring service sends regular probe jobs to the infrastructure and measures their latency duration. This information is injected into the model and a numeric minimization provides the time-out value that minimizes the expected execution time.

Provide a set of generic keywords that define your contribution (e.g. Data Management, Workflows, High Energy Physics)

Workload management, time-outing strategy

URL for further information:

http://www.i3s.unice.fr/~glatard/publis/ccgrid07.pdf

4. Conclusions / Future plans

The model was tested on the EGEE production infrastructure using thousands of probe jobs over hours of execution. A 2.5% faulty jobs ratio was measured. Recovering from these faults by time-outing protects the application from unbounded execution time. The model can be adapted to more or less reliable system by varying the outlier ratio. Simulation of a fault-less system (e.g. cluster) with similar load conditions than the grid show that a minimum speed-up of 1.36 is achieved.

3. Impact

A study on the EGEE grid workload pattern has shown that the latency endured by jobs follows heavy-tail distributions. Consequently, a non-negligible fraction of the jobs duration is likely to encounter very long latencies which are penalizing multi-job applications dramatically. Setting up the time-out strategies protects the application from these faults while introducing a very light overhead. The time-out estimation service is currently a prototype deployed on top of the EGEE middleware. The monitoring activity uses the workload management system to submit and monitor the jobs durations. The model computation is a lightweight numerical integration that can be integrated in any application. When their time-out expires, the application has to cancel and resubmit the faulty jobs to avoid abnormal computation times. An interesting perspective would be direct access to the RB logs to avoid application-level probing of the infrastructure.

1. Short overview

Applications submitting a large number of jobs to the grid infrastructure have to consider and recover from faulty jobs that are due to system failures or abnormally long job durations. A simple time-outing and resubmission strategy protects the application from very long durations in case outliers happen. However, determining the time-out value is not straight forward, especially for shorter jobs, as their execution time significantly depends on the grid workload conditions.

Dr Diane Lingrand (UNSA) Dr Johan Montagnat (CNRS) Dr Tristan Glatard (CNRS)

Slides

UF3_EGEE_Model_080212.pdf

3rd EGEE User Forum

Support

Modeling the EGEE latency to optimize job time-outs

Champagne

<a href="http://www.polydome.org">Le Polydôme</a>, Clermont-Ferrand, FRANCE

Speaker

Description

Provide a set of generic keywords that define your contribution (e.g. Data Management, Workflows, High Energy Physics)

URL for further information:

4. Conclusions / Future plans

3. Impact

1. Short overview

Authors

Presentation materials