Speaker
Description
The success and adoption of machine learning (ML) approaches to solving HEP problems has been widespread and fast. As useful a tool as ML has been to the field, the growing number of applications, larger datasets, and increasing complexity of models creates a demand for both more capable hardware infrastructure and cleaner methods of reproducibilty and deployment. We have developed a prototype ML Training facility (MLTF) with the goal of meeting these demands. The proof-of-concept MLTF is based at ACCRE, Vanderbilt's computing cluster, with sufficient GPU storage and networking to efficiently test very large models.The software component of MLTF is developed with an eye on reproducibility and portability. We adapt MLflow as an end-to-end ML solution for its capabilities as a user-friendly job submission interface; as a tracking server for model and run details, arbitrary metrics logging, and system diagnostics logging; and as an inference server.