Speaker
Description
The development of Neural Simulation-Based Inference (NSBI) algorithm requires training a large ensemble of neural networks, on the order of one thousand, which makes a serial single-node approach impractical. To address this, we are developing a scalable high-throughput training workflow built around Snakemake[1] and deployed on an HTCondor-based GPU facility. Each neural network training task is treated as an independent job within a well-defined directed acyclic graph, enabling efficient parallel execution while preserving reproducibility and fault tolerance.
The workflow is designed to automatically map job-level resource requirements, such as GPU and CPU requests, to the underlying cluster, allowing the system to fully utilize available hardware without manual intervention. All training runs are isolated and fully documented, with model weights, performance metrics, and resolved configuration files stored as explicit artifacts. This structure naturally supports a scatter-gather pattern, in which large numbers of models are trained independently and their outputs are later aggregated for downstream analysis. This approach provides a robust and reproducible foundation for large-scale neural network training campaigns critical for applications like NSBI. The workflow is being developed as part of the IRIS-HEP ecosystem of tools for NSBI analysis at the LHC [2].
[1] https://snakemake.github.io/
[2] https://github.com/iris-hep/NSBI-workflow-tutorial/tree/main