Oct 10 – 14, 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs

Oct 10, 2016, 3:00 PM
15m
GG C2 (San Francisco Mariott Marquis)

GG C2

San Francisco Mariott Marquis

Oral Track 3: Distributed Computing Track 3: Distributed Computing

Speaker

Anna Elizabeth Woodard (University of Notre Dame (US))

Description

CRAB3 is a workload management tool used by more than 500 CMS physicists every month to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). CRAB3 allows users to analyze a large collection of input files (datasets), splitting the input into multiple Grid jobs depending on parameters provided by users.

The process of manually specifying exactly how a large project is divided into jobs is tedious and often results in sub-optimal splitting due to its dependence on the performance of the user code and the content of the input dataset. This introduces two types of problems; jobs that are too big will have excessive runtimes and will not distribute the work across all of the available nodes. However, splitting the project into a large number of very small jobs is also inefficient, as each job creates additional overhead which increases load on scheduling infrastructure resources.

In this work we present a new feature called “automatic splitting” which removes the need for users to manually specify job splitting parameters. We discuss how HTCondor DAGMan can be used to build dynamic Directed Acyclic Graphs (DAGs) on the fly to optimize the performance of large CMS analysis jobs on the Grid.

We use DAGMan to dynamically generate interconnected DAGs that estimate the time per event of the user code, then run a set of jobs of preconfigured runtime to analyze the dataset. If some jobs have terminated before completion, the unfinished portions are assembled into smaller jobs and resubmitted to the worker nodes.

Secondary Keyword (Optional) Computing middleware
Primary Keyword (Mandatory) Distributed workload management

Primary authors

Marco Mascheroni (Fermi National Accelerator Lab. (US)) Matthias Wolf (University of Notre Dame (US))

Co-authors

Anna Elizabeth Woodard (University of Notre Dame (US)) Brian Paul Bockelman (University of Nebraska (US)) Eric Vaandering (Fermi National Accelerator Lab. (US)) Jose Hernandez (CIEMAT) Stefano Belforte (Universita e INFN, Trieste (IT))

Presentation materials