The second generation of the ATLAS production system called ProdSys2 is a
distributed workload manager that runs daily hundreds of thousands of jobs,
from dozens of different ATLAS specific workflows, across more than
hundred heterogeneous sites. It achieves high utilization by combining
dynamic job definition based on many criteria, such as input and output
size, memory requirements and CPU consumption, with manageable scheduling
policies and by supporting different kind of computational resources, such
as GRID, clouds, supercomputers and volunteering computers. The system
dynamically assigns a group of jobs (task) to a group of geographically
distributed computing resources. Dynamic assignment and resources
utilization is one of the major features of the system, it didn’t exist in the
earliest versions of the production system where Grid resources topology
has been predefined using national or/and geographical pattern.
Production System has a sophisticated job fault-recovery mechanism, which
efficiently allows to run a multi-Terabyte tasks without human intervention.
We have implemented train model and open-ended production which allows to
submit tasks automatically as soon as new set of data is available and to
chain physics groups data processing and analysis with central production
run by the experiment.
ProdSys2 simplifies life to ATLAS scientists by offering a flexible web
user interface, which implements a user-friendly environment for main ATLAS
workflows, e.g. simple way of combining different data flows, and a real-time
monitoring optimised to present a huge amount of information.
We present an overview of the ATLAS Production System and its major
components features and architecture: task definition, web user interface
and monitoring. We describe the important design decisions and lessons
learned from an operational experience during the first years of LHC Run2.
We also report the performance of the designed system and how various
workflows such as data (re)processing, Monte-Carlo and physics group
production, users analysis are scheduled and executed within one
production system on heterogeneous computing resources.
|Primary Keyword (Mandatory)||Distributed workload management|
|Tertiary Keyword (Optional)||High performance computing|
|Secondary Keyword (Optional)||Distributed data handling|