In December 2009, over 100 sites were publishing support for MPI on the EGEE infrastructure. Some 50 of these sites pass the MPI validations tests; these probe the range of basic MPI distributions available at each site. Most sites successfully support at least one version of MPI correctly. The introduction of critical SAM validation tests is expected to help isolate problems at the sites and contribute to the development of a "trouble-shooting" guide. Updated user documentation covering numerous use cases and recipes will help users and developers avoid unforeseen issues and in turn help get the best out of the infrastructure. Similarly, much needed updates to the MPI/gLiite related middleware will accommodate the diverse systems setups currently support for non-MPI job execution.
The MPI-TF consists of representatives from multiple EGEE-III activities, including JRA1, NA4, SA1 and SA3, with oversight from the EGEE TMB. We shall provide an evaluation of the impact of the MPI-TF by measuring improvements to the number of sites actively supporting MPI, sites successfully passing MPI SAM tests, and by seeking feedback from the user community through their virtual organisations.
Conclusions and Future Work
The MPI-WG delivered a set of recommendations to improve MPI support on EGEE based on user and site administrator feedback. The MPI-TF was established to deliver many of the short term improvements which could be dealt with within the lifetime of EGEE-III. However, some issues cannot be easily concluded within this timeframe. Firstly, full interoperability with other grid infrastructures and middlewares such as ARC and UNICORE. Secondly, the emergence of multi-core and hybrid systems needs to be evaluated - this require a generic "Parallel" job job type. These will be examined in the EGI era
The reported high failure rates for MPI jobs on the EGEE grid is a source of frustration for many of its users. At the EGEE09 conference, the Earth Science community reported that only 7 sites from 26 supporting both MPI and its virtual organisation ran their applications without errors . Computational Chemistry reported that only 53% of jobs ran successfully at the CompChem enabled sites when requiring 8 processors, and this dropped to 21% when requesting 16 to 64 processors. Similar issues were reported by the Astronomy and Astrophysics community.
The site administrators reported on problems associated with installing and configuring MPI at their institutes.
This was backed up with a report from the MPI Working Group (MPI-WG), who gathered extensive feedback from a large sample of users and site administrators.
The provision of a single MPI solution for gLite is non-trivial. This is in part due to lack of good solid support from upstream OS providers, and in part due to the diverse site system setups, such as batch systems, multi-core worker nodes, and shared versus non-shared filesystems. We present a pragmatic approach to systems support.
|URL for further information||https://twiki.cern.ch/twiki/bin/view/EGEE/MpiTools|
|Keywords||MPI, User Support|