8โ€“12 Sept 2025
Hamburg, Germany
Europe/Berlin timezone

Simulating the ATLAS Distributed Computing Infrastructure to Optimize Workload Allocation Strategies

Not scheduled
30m
Hamburg, Germany

Hamburg, Germany

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Speaker

Raees Ahmad Khan (University of Pittsburgh (US))

Description

In large-scale distributed computing systems, workload dispatching and the associated data management are critical factors that determine key metrics such as resource utilization of distributed computing and resilience of scientific workflows. As the Large Hadron Collider (LHC) advances into its high luminosity era, the ATLAS distributed computing infrastructure must improve these metrics to manage exponentially larger data volumes (exceeding ExaBytes) and support the demanding needs of high-energy physics research.
To improve the distributed computing operation, the existing workload allocation strategies can be optimized, or novel strategies can be designed. However, in practice, it is not viable to test new workload allocation strategies on the actual ATLAS distributed computing. To address this, we have developed an agile simulation framework of the ATLAS distributed computing system using the SimGrid toolkit to evaluate and refine workload dispatching strategies for heterogeneous computing infrastructure. Moreover, it is crucial to also address the inherent overhead and potential bottlenecks associated with the management of the large data volumes required by these workloads. Therefore, we extensively analyze the historical remote transfers to understand the root causes of slowdowns, vulnerabilities, and inefficient resource utilization linked to data movement.
To ensure the accuracy and reliability of our framework, we calibrate and validate the ATLAS distributed computing implementation in the simulation framework by testing real workloads from historical ATLAS data. This calibrated simulation framework will serve as the testbed for evaluating custom allocation algorithms and also generate the datasets required to train ML surrogates to enable fast and scalable simulations. In addition, an interactive monitoring interface is being developed to visualize the workload dispatching and the resource utilization. Apart from serving as a platform for testing and executing new strategies that can improve the resilience of the ATLAS distributed computing, our framework is ultimately experiment-agnostic and open sourced, providing an example case that can enable users to configure large-scale distributed computing grids and implement custom workload allocation algorithms through dynamic plugins.

Authors

Kuan-Chieh Hsu (Brookhaven National Laboratory (US)) Raees Ahmad Khan (University of Pittsburgh (US)) Sairam Sri Vatsavai (Brookhaven National Laboratory (US))

Co-authors

Adolfy Hoisie (Brookhaven National Laboratory (US)) Alexei Klimentov (Brookhaven National Laboratory (US)) David Park (Brookhaven National Laboratory) Fatih Furkan Akman (University of Massachusetts (US)) Frederic Suter Mr Jaehyung Kim (Carnegie Mellon University) John Rembrandt Steele (University of Massachusetts (US)) Joseph Boudreau (University of Pittsburgh) Norbert Podhorszki (Oak Ridge National Laboratory) Ozgur Ozan Kilic (Brookhaven National Laboratory) Paul Nilsson (Brookhaven National Laboratory (US)) Ray Ren (Brookhaven National Laboratory (US)) Sankha Baran Dutta (Brookhaven National Laboratory (US)) Scott Klasky Shengyu Feng Shinjae Yoo Tadashi Maeno (Brookhaven National Laboratory (US)) Tania Korchuganova Dr Tasnuva Chowdhury (Brookhaven National Laboratory (US)) Verena Ingrid Martinez Outschoorn (University of Massachusetts (US)) Wei Yang (SLAC National Accelerator Laboratory (US)) Yiming Yang

Presentation materials

There are no materials yet.