8โ€“12 Sept 2025
Hamburg, Germany
Europe/Berlin timezone

Evaluating HEP Workflow Portability and Performance using Liquid Argon TPC (LAr TPC) detector simulations Across HPC Systems

Not scheduled
30m
Hamburg, Germany

Hamburg, Germany

Poster Track 1: Computing Technology for Physics Research Poster session with coffee break

Speaker

Ozgur Ozan Kilic (Brookhaven National Laboratory)

Description

This study evaluates the portability, performance, and adaptability of the Liquid Argon TPC (LAr TPC) detector simulations on different HPC platforms, specifically Polaris, Frontier, and Perlmutter. Lar TCP workflow is a computationally complex workflow which mimics neutrino interactions and the resultant detector responses in a modular liquid argon TPC, integrating various subsystems to validate design choices and refine analysis tools. Computaional complexity comes from integrating multiple high-fidelity simulation modules, reconstruction algorithms, and real-time calibration processes across heterogeneous and massive data streams using parallel processing techniques. We explore the diverse challenges of deploying the simulation workflow, noting that the issues encountered vary significantly across different platforms. For example, we investigate how constrained network conditions on Polaris impact software distribution technologies such as CVMFS and container solutions; on Frontier, we studied unique complications arising from CUDA-specific libraries, incompatible with its AMD GPUs. We also describe effective mitigating strategies: to address the issues on Polaris we used portable solutions such as CVMFSExec, proxies (e.g., Squid), and container technologies (e.g., Singularity); on Frontier we investigate CPU-only execution strategies and explore alternative libraries such as HIP and CuPy. Conversely, deploying on Perlmutter using the Superfacility API proved more straightforward, highlighting the potential of standardized HPC APIs to simplify workflow management.

Our experiences further highlight the value of closely coordinating the development of facility-specific APIs, such as NERSC's Superfacility API, Globus Compute and the Integrated Research Infrastructure (IRI) initiative, alongside scientific workflows. We discuss the benefits of evolving these APIs based on the real-world challenges and practical demands encountered in complex workflows. By generalizing these issues beyond the specific (LAr TPC) context, we emphasize adaptable strategies such as container overlays, environment bridging scripts, and comprehensive dependency documentation to achieve sustainable and facility-independent scientific workflows.

Significance

This presentation provides novel insights into addressing portability barriers that significantly impact scientific workflows in HPC environments. Rather than merely reporting status, we offer actionable solutions and generalized strategies to overcome common challenges like heterogeneous GPU support, complex software dependencies, and constrained external access. These incremental yet substantial developments facilitate broader adoption of adaptive, sustainable workflows across diverse HPC platforms.

Author

Ozgur Ozan Kilic (Brookhaven National Laboratory)

Co-authors

Bruno Moreira Coimbra (Fermi National Accelerator Lab. (US)) Dr Charles Leggett (Lawrence Berkeley National Lab (US)) Meifeng Lin (Brookhaven National Laboratory (US))

Presentation materials

There are no materials yet.