ACAT 2014

Name: ACAT 2014
Start: 2014-09-01T08:00:00+02:00
End: 2014-09-05T18:00:00+02:00
Location: Faculty of Civil Engineering

1–5 Sept 2014

Faculty of Civil Engineering

Europe/Prague timezone

Secretary

acat2014@particle.cz

Integration of PanDA workload management system with Titan supercomputer at OLCF

2 Sept 2014, 08:00

Faculty of Civil Engineering

Faculty of Civil Engineering, Czech Technical University in Prague Thakurova 7/2077 Prague 166 29 Czech Republic

Board: 113

Poster Computing Technology for Physics Research Poster session

Sergey Panitkin (Brookhaven National Laboratory (US))

Experiments at the Large Hadron Collider (LHC) face unprecedented computing challenges. Heterogeneous resources are distributed worldwide, thousands of physicists analyzing the data need remote access to hundreds of computing sites,the volume of processed data is beyond the exabyte scale, and data processing requires more than billions of hours of computing usage per year. The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently uses more than 100,000 cores at well over 100 Grid sites with a peak performance of 0.3 petaFLOPS, next LHC data taking run will require more resources than Grid computing can possibly provide. The Worldwide LHC Computing Grid (WLCG) infrastructure will be sufficient for the planned analysis and data processing, but it will be insufficient for Monte Carlo (MC) production and any extra activities. Additional computing and storage resources are therefore required. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In turn this activity drives evolution of the PanDA WMS. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA pilot framework for job submission to Titan's batch queue and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan's multi-core worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows to precisely define the size and duration of jobs submitted to Titan according to available free resources. This capability can reduce job wait time and improve Titan’s utilization efficiency. This implementation was tested with Monte Carlo simulation jobs and is suitable for deployment with many other supercomputing platforms.

Sergey Panitkin (Brookhaven National Laboratory (US))

There are no materials yet.

ACAT 2014

Secretary

Integration of PanDA workload management system with Titan supercomputer at OLCF

Faculty of Civil Engineering

Speaker

Description

Primary author

Presentation materials

Choose timezone

ACAT 2014

Secretary

Speaker

Description

Primary author

Presentation materials