Describe the scientific/technical community and the scientific/technical activity using (planning to use) the EGEE infrastructure. A high-level description is needed (neither a detailed specialist report nor a list of references).
The Xi Computer Architecture Lab (University of Cyprus) research projects are memory hierarchy optimizations, speculation techniques, reliability methods for hard errors and mechanisms to reduce temperature problems on multi-cores. A routine activity in our group, and most architectural research teams, is the running of simulation experiments to investigate the potential of new techniques we design.The EGEE provides a high throughput powerful computing platform that matches our simulation needs.
Describe the added value of the Grid for the scientific/technical activity you (plan to) do on the Grid. This should include the scale of the activity and of the potential user community and the relevance for other scientific or business applications
The grid provides a powerful a computing resource to perform more quickly and comprehensively design space exploration that is crucial for determining both timely and good solution points in the design space we explore. In general, every set of experiments in our projects may require several hundreds of simulations, due to a plethora of interacting parameters, with individual runs requiring several anything from several hours to few days.
There are jobs that require certain hardware configurations such as large memory size and number of processors. With grid functionality, we can define criteria for the resources that best match our needs and then run in parallel as many jobs as available resources to achieve a high throughput simulation methodology. The usual quick turnaround from submission to completion enables us to become more efficient to
determine good solution points.
Report on the experience (or the proposed activity). It would be very important to mention key services which are essential for the success of your activity on the EGEE infrastructure.
A good interface for job submission, management (resubmit, cancel e.t.c) and retrieving results is necessary. We developed scripts for this but a complete suite of such services can be very useful since the grid intend for high throughput that assumes 1000 of jobs running simultaneously by each user.Using the Storage Element to save the results has the advantage of not being affected by any user interface machine failures but it requires the user to delete any unused files to avoid flooding the SE. Initializing the proxy on every submission avoids any unexpected proxy expiration and job failures. Specifying different requirements for different job categories can increase throughput if slow jobs go to faster but smaller clusters and faster jobs to slower but bigger clusters (more nodes).A more detailed report on job failures will be very useful to understand and deal with it.Sometimes grid seemed very unreliable.Usually only 80% of the jobs submitted were finished in less than 12 hours.