Skip to main content
19–25 Oct 2024
Europe/Zurich timezone

Efficiency, Reproducibility, and Portability in HEP Machine Learning Training - ML Training Facility at Vanderbilt University

24 Oct 2024, 14:42
18m
Large Hall B

Large Hall B

Talk Track 9 - Analysis facilities and interactive computing Parallel (Track 9)

Speaker

Andrew Malone Melo (Vanderbilt University (US))

Description

The success and adoption of machine learning (ML) approaches to solving HEP problems has been widespread and fast. As useful a tool as ML has been to the field, the growing number of applications, larger datasets, and increasing complexity of models creates a demand for both more capable hardware infrastructure and cleaner methods of reproducibilty and deployment. We have developed a prototype ML Training facility (MLTF) with the goal of meeting these demands. The proof-of-concept MLTF is based at ACCRE, Vanderbilt's computing cluster, with sufficient GPU storage and networking to efficiently test very large models.The software component of MLTF is developed with an eye on reproducibility and portability. We adapt MLflow as an end-to-end ML solution for its capabilities as a user-friendly job submission interface; as a tracking server for model and run details, arbitrary metrics logging, and system diagnostics logging; and as an inference server.

Author

Jethro Taylor Gaglione (Vanderbilt University (US))

Presentation materials