Speaker
Description
This study evaluates the portability, performance, and adaptability of the Liquid Argon TPC (LAr TPC) detector simulations on different HPC platforms, specifically Polaris, Frontier, and Perlmutter. Lar TCP workflow is a computationally complex workflow which mimics neutrino interactions and the resultant detector responses in a modular liquid argon TPC, integrating various subsystems to validate design choices and refine analysis tools. Computaional complexity comes from integrating multiple high-fidelity simulation modules, reconstruction algorithms, and real-time calibration processes across heterogeneous and massive data streams using parallel processing techniques. We explore the diverse challenges of deploying the simulation workflow, noting that the issues encountered vary significantly across different platforms. For example, we investigate how constrained network conditions on Polaris impact software distribution technologies such as CVMFS and container solutions; on Frontier, we studied unique complications arising from CUDA-specific libraries, incompatible with its AMD GPUs. We also describe effective mitigating strategies: to address the issues on Polaris we used portable solutions such as CVMFSExec, proxies (e.g., Squid), and container technologies (e.g., Singularity); on Frontier we investigate CPU-only execution strategies and explore alternative libraries such as HIP and CuPy. Conversely, deploying on Perlmutter using the Superfacility API proved more straightforward, highlighting the potential of standardized HPC APIs to simplify workflow management.
Our experiences further highlight the value of closely coordinating the development of facility-specific APIs, such as NERSC's Superfacility API, Globus Compute and the Integrated Research Infrastructure (IRI) initiative, alongside scientific workflows. We discuss the benefits of evolving these APIs based on the real-world challenges and practical demands encountered in complex workflows. By generalizing these issues beyond the specific (LAr TPC) context, we emphasize adaptable strategies such as container overlays, environment bridging scripts, and comprehensive dependency documentation to achieve sustainable and facility-independent scientific workflows.
Significance
This presentation provides novel insights into addressing portability barriers that significantly impact scientific workflows in HPC environments. Rather than merely reporting status, we offer actionable solutions and generalized strategies to overcome common challenges like heterogeneous GPU support, complex software dependencies, and constrained external access. These incremental yet substantial developments facilitate broader adoption of adaptive, sustainable workflows across diverse HPC platforms.