24th International Conference on Computing in High Energy & Nuclear Physics

Name: 24th International Conference on Computing in High Energy & Nuclear Physics
Start: 2019-11-04T08:00:00+10:30
End: 2019-11-08T13:00:00+10:30
Location: Adelaide Convention Centre

4–8 Nov 2019

Adelaide Convention Centre

Australia/Adelaide timezone

Contact us

Advancing physics simulation and analysis workflows from customized local clusters to Cori - the HPC optimized sub-million cores system at NERSC

7 Nov 2019, 12:00

15m

Riverbank R1 (Adelaide Convention Centre)

Riverbank R1

Adelaide Convention Centre

Oral Track 9 – Exascale Science Track 9 – Exascale Science

Jan Balewski (Lawrence Berkeley National Lab. (US))

Abstract: Over the last few years, many physics experiments migrated their computations from customized locally managed computing clusters to orders of magnitude larger multi-tenant HPC systems often optimized for highly parallelizable long-runtime computations. Historically, physics simulations and analysis workflows were designed for a single core CPUs with abundant RAM, plenty of local storage, direct control of the software stack and job scheduler, exclusive access to physically localized hardware, and predictable steady throughput. We will discuss what changes needed to happen in terms of the data pipeline organization, software, and user habits when computations are executed at scale on Cori, where none of those assumptions are true anymore.

STAR experiment at BNL took on the challenge as one of the first. We will discuss the efficient solutions for sustainable processing of experimental data at HPC system 5000 miles away from an experiment, with 2-way just-in-time data transfer. Due to limited administrative privileges at HPC machines, Docker/Shifter become one of the main vehicles to transport custom vetted code to the HPC environment, supplemented by CVMFS software delivery system mounted via DVS servers providing local cache. DayaBay, LZ, ATLAS, and Majorana experiments followed this journey of transformations, by developing schemes for injecting short leaving single core tasks to multi-node, 1000s-core, long run-time jobs scheduled on Cori - the most efficient way to compete with other tenants for CPU cycles. The high variability of scheduling required assembly of ‘convoy’ jobs composed of many nodes dedicated to the execution of tasks and designating one node to carry a read-only clone of the database. The computing capability of one ‘convoy’ can be compared to the whole PDSF and one user can schedule 10s of such jobs to run concurrently on Cori. The fine-tuning of tasks concurrency per node to maximize the output for per-node charge-hour given a rather low RAM/CPU ratio and benefiting from 2- or 4-threading capability will be also discussed.

Consider for promotion	No

Jan Balewski (Lawrence Berkeley National Lab. (US))

Mr Matthew Kramer Mustafa Mustafa (Lawrence Berkeley National Laboratory) Rei Lee Porter Jeff Vakho Tsulaia (Lawrence Berkeley National Lab. (US))

balewski-chep-v2.pdf

24th International Conference on Computing in High Energy & Nuclear Physics

Contact us

Advancing physics simulation and analysis workflows from customized local clusters to Cori - the HPC optimized sub-million cores system at NERSC

Riverbank R1

Adelaide Convention Centre

Speaker

Description

Author

Co-authors

Presentation materials

Choose timezone

24th International Conference on Computing in High Energy & Nuclear Physics

Contact us

Speaker

Description

Author

Co-authors

Presentation materials