CHEP 2016 Conference, San Francisco, October 8-14, 2016

Name: CHEP 2016 Conference, San Francisco, October 8-14, 2016
Start: 2016-10-10T08:00:00-07:00
End: 2016-10-14T18:00:00-07:00
Location: San Francisco Marriott Marquis

10–14 Oct 2016

San Francisco Marriott Marquis

America/Los_Angeles timezone

Integration of the Titan supercomputer at OLCF with the ATLAS Production System

13 Oct 2016, 15:15

15m

GG C2 (San Francisco Mariott)

GG C2

San Francisco Mariott

Oral Track 7: Middleware, Monitoring and Accounting Track 7: Middleware, Monitoring and Accounting

Sergey Panitkin (Brookhaven National Laboratory (US))

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment.
PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year.
While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide.
Additional computing and storage resources are required.
Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers.
In this talk we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF).
Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan.
The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titan’s utilization efficiency.
We will discuss details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency.

Primary Keyword (Mandatory)	High performance computing
Secondary Keyword (Optional)	Distributed workload management

Sergey Panitkin (Brookhaven National Laboratory (US))

highlights-194.pdf

Oral-194_v2.pdf

CHEP 2016 Conference, San Francisco, October 8-14, 2016

Integration of the Titan supercomputer at OLCF with the ATLAS Production System

GG C2

San Francisco Mariott

Speaker

Description

Author

Presentation materials

Choose timezone

CHEP 2016 Conference, San Francisco, October 8-14, 2016

Speaker

Description

Author

Presentation materials