Speakers
Description
1. Short overview
This year, the IV Grid Plugtests took place in Beijing, China from October the 29th to November the 1st, 2007. Organized by ETSI and INRIA, it proposed a contest on the N-Queens problem in order to test grid technologies.
We offer a feed-back about our experience of running efficiently our N-Queens application on a whole computing grid like Grid'5000, composing tools from reservation and deployment to tasks scheduling.
3. Impact
To run our N-Queens application on the grid, we composed three tools : ProActive, TakTuk and Kaapi. The grid's architecture was provided by Plugtests organizers through a deployment descriptor file which contains required information to reserve and contact nodes (gateways, resources managers). ProActive was in charge of reserving all the nodes and creating a tunnel to each cluster of the grid. Then Taktuk just used these tunnels to connect all the nodes of all the clusters and started the Kaapi processes.
Our N-Queens application ran successfully during this Plugtests. We deployed our Kaapi processes on 1364 nodes of Grid5000 (one process by node) in less than 3 minutes. The computation used 3654 cores (each Kaapi process creates one computation thread by core). Using this deployment during the one-hour slot, we computed all the solutions of one 23-Queens (35min 7s) and of six 22-Queens (about 2min 21s each). These results gave us the first place of the contest.
Provide a set of generic keywords that define your contribution (e.g. Data Management, Workflows, High Energy Physics)
Grid, Deployment, Work-stealing scheduling, Tools for the grids
4. Conclusions / Future plans
We learnt two main lessons from these experiences:
- Kaapi middleware allows us to scale up to thousands of heterogeneous cores while the efficiency is preserved. On going work is to increase the scalability on highly heterogeneous networks.
- Fault tolerance is essential to run application at such a scale. Many times during the contest, our application crashed because some nodes in the grid failed. Two fault tolerance protocols are currently in development for Kaapi.
URL for further information:
http://www-id.imag.fr/Laboratoire/Membres/Besseron_Xavier/IV_Grid_Plugtests/