2019 Meeting of the Division of Particles & Fields of the American Physical Society

Name: 2019 Meeting of the Division of Particles & Fields of the American Physical Society
Start: 2019-07-29T08:00:00-04:00
End: 2019-08-02T17:00:00-04:00
Location: Northeastern University

29 July 2019 to 2 August 2019

Northeastern University

US/Eastern timezone

Contact

n.wong@northeastern.edu

Large-scale HPC deployment of Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN)

1 Aug 2019, 17:00

20m

Shillman 425 (Northeastern University)

Shillman 425

Northeastern University

Oral Presentation Computing, Analysis Tools, & Data Handling Computing, Analysis Tools, & Data Handling

Mike Hildreth (University of Notre Dame (US)) Kenyi Paolo Hurtado Anampa (University of Notre Dame (US))

The NSF-funded Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN) project aims to develop and deploy artificial intelligence (AI) and likelihood-free inference (LFI) techniques and software using scalable cyberinfrastructure (CI) built on top of existing CI elements. Specifically, the project has extended the CERN-based REANA framework, a cloud-based data analysis platform deployed on top of Kubernetes clusters that was originally designed to enable analysis reusability and reproducibility. REANA is capable of orchestrating extremely complicated multi-step workflows, and uses Kubernetes clusters both for scheduling and distributing container-based workloads across a cluster of available machines, as well as instantiating and monitoring the concrete workloads themselves.

This work describes the challenges and development efforts involved in extending REANA and the components that were developed in order to enable large scale deployment on High Performance Computing (HPC) resources, including the development of an abstraction layer that allows the support of different container technologies and different transfer protocols for files and directories between the HPC facility and the REANA cluster edge service from the user's workflow application.

Using the Virtual Clusters for Community Computation (VC3) infrastructure as a starting point, we implemented REANA to work with a number of differing workload managers, including both high performance and high throughput, while simultaneously removing REANA's dependence on Kubernetes support at the workers level. Performance results derived from running AI/LFI training workflows on a variety of large HPC sites will be presented.

Mike Hildreth (University of Notre Dame (US)) Kenyi Paolo Hurtado Anampa (University of Notre Dame (US)) Tibor Simko (CERN) Mr Cody Kankel (University of Notre Dame) Dr Paul Brenner (University of Notre Dame) Mr Scott Hampton (University of Notre Dame) Ms Irena Johnson (University of Notre Dame)

Hildreth_DPF_2019.pdf

2019 Meeting of the Division of Particles & Fields of the American Physical Society

Contact

Large-scale HPC deployment of Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN)

Shillman 425

Northeastern University

Speakers

Description

Primary authors

Presentation materials

Choose timezone

2019 Meeting of the Division of Particles & Fields of the American Physical Society

Contact

Speakers

Description

Primary authors

Presentation materials