Speakers
Description
High energy physics (HEP) workflows are approaching the throughput limits of traditional grid/HTC computing, as LHC and DUNE are driving O(10–100)× data growth and increased GPU demand. This motivates a practical path to routinely use leadership-class HPC resources remotely. One of the challenges is the varied authentication, authorization and job submission mechanisms at different HPC centers. In this work, we extend the PanDA/Harvester workflow management system with an edge service model that keeps Harvester as a centralized control plane while deploying small, site-resident clients to interface with facility services and batch schedulers. We evaluate three complementary edge mechanisms: (1) Globus Compute (\textbf{GC}) via a Multi-user Endpoint for site-agnostic submission and control; (2) NERSC Superfacility API (\textbf{SFAPI}); and (3) OLCF Secure Scientific Service Mesh (\textbf{S3M}) for facility-native batch job control. Starting from Perlmutter and OLCF, we run PanDA pilots with CVMFS-based runtime delivery and demonstrate practical resource acquisition, robust environment setup, and a clean association between Slurm allocations, Harvester workers, and pilot execution units. To simplify operations for remote and multi-site deployments, we add a remote credential manager that supports controlled issuance, renewal, and isolation of credentials needed for edge side execution. We also strengthen launch and control paths for parallel workloads by providing scheduler-aware wrappers that support three levels of parallelism: multiple pilots per node, multi-node allocations per job, and multiple concurrent jobs/allocations, and we improve observability with a monitoring plugin that cross-checks edge service state with native scheduler queries for accurate task lifecycle tracking and failure diagnosis. Our results highlight the expected trade-off between portability and depth of integration: GC offers a uniform interface across sites, while SFAPI/S3M enables tighter coupling with NERSC/OLCF capabilities. This indicates a clear path to port from Grid to HPC within the PanDA/Harvester ecosystem. This work is also aligned with Project Genesis as a representative use case for API-driven, multi-facility workflow automation.