Speaker
Description
In the realm of Grid middleware, efficient job matching is paramount, ensuring that tasks are seamlessly assigned to the most compatible worker nodes. This process hinges on meticulously evaluating a worker node's suitability for the given task, necessitating a thorough assessment of its infrastructure characteristics. However, adjusting job matching parameters poses a significant challenge due to the involvement of both central and site services within the Grid middleware. This necessitates deploying a new middleware version across the entire Grid, introducing potential bugs and raising the risk of a single point of failure.
Furthermore, the inherent limitations in the number of available job matching parameters, stemming from insufficient infrastructure monitoring in pilot jobs, further complicate the task for Grid middleware developers.
This paper introduces an entirely new approach for dynamically adding and modifying job matching parameters in Grid middleware, leveraging the Site Sonar Grid Infrastructure monitoring framework. This solution empowers Grid administrators to seamlessly add or modify job matching parameters without altering the core middleware code. This flexibility enables dynamic job matching based on diverse infrastructure properties of worker nodes. By decoupling job matching parameters from the Grid middleware, the proposed approach enhances flexibility, mitigates complexities, and reduces risks associated with introducing and changing job matching parameters.
This transformative approach bolsters the adaptability of Grid middleware for heterogeneous systems, fostering optimized resource allocation.
References
https://indico.jlab.org/event/459/contributions/11495/ (Previous presentation on introduction of Site Sonar)
Significance
Currently, the job matching in Grid middleware is done by using a set of hardcoded parameters in all the popular Grid middleware systems. This causes a major problem when introducing or changing and existing matching parameters because it requires updating both central and site services to accommodate this change. Hence the Grid administrators limit the number of matching parameters to a limited number and avoid changing them often.
Another reason for this is that a pilot job submitted to a site does not have a lot of capabilities in terms of collecting infrastructure information and hence most of the information to be collected is hardcoded. Therefore these parameters cannot be changed dynamically when necessary.
But, if we can make this approach more flexible and allow the Grid administrators to dynamically define job matching parameters, we can have an improved job matching process that will lead to a more efficient use of available worker nodes in a Grid, that could also help in reducing job failure rates.
We have previously introduced Site Sonar - a new Grid infrastructure monitoring system that can collect a lot of information from a Grid worker node. In this project we have integrated Site Sonar with the Grid middleware to facilitate the use of any infrastructure information collected through Site Sonar in the job matching process. Further, we have changed how central services handle the job matching parameters to allow dynamically changing parameters. This approach can be used by any Grid middleware to improve their job matching process and allow a more optimized resource allocation.
Experiment context, if any | The project was conducted on ALICE Computing Grid in ALICE experiment at CERN (https://alice-collaboration.web.cern.ch) |
---|