CHEP 2016 Conference, San Francisco, October 8-14, 2016

Name: CHEP 2016 Conference, San Francisco, October 8-14, 2016
Start: 2016-10-10T08:00:00-07:00
End: 2016-10-14T18:00:00-07:00
Location: San Francisco Marriott Marquis

10–14 Oct 2016

San Francisco Marriott Marquis

America/Los_Angeles timezone

Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs

10 Oct 2016, 15:00

15m

GG C2 (San Francisco Mariott Marquis)

GG C2

San Francisco Mariott Marquis

Oral Track 3: Distributed Computing Track 3: Distributed Computing

Anna Elizabeth Woodard (University of Notre Dame (US))

CRAB3 is a workload management tool used by more than 500 CMS physicists every month to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). CRAB3 allows users to analyze a large collection of input files (datasets), splitting the input into multiple Grid jobs depending on parameters provided by users.

The process of manually specifying exactly how a large project is divided into jobs is tedious and often results in sub-optimal splitting due to its dependence on the performance of the user code and the content of the input dataset. This introduces two types of problems; jobs that are too big will have excessive runtimes and will not distribute the work across all of the available nodes. However, splitting the project into a large number of very small jobs is also inefficient, as each job creates additional overhead which increases load on scheduling infrastructure resources.

In this work we present a new feature called “automatic splitting” which removes the need for users to manually specify job splitting parameters. We discuss how HTCondor DAGMan can be used to build dynamic Directed Acyclic Graphs (DAGs) on the fly to optimize the performance of large CMS analysis jobs on the Grid.

We use DAGMan to dynamically generate interconnected DAGs that estimate the time per event of the user code, then run a set of jobs of preconfigured runtime to analyze the dataset. If some jobs have terminated before completion, the unfinished portions are assembled into smaller jobs and resubmitted to the worker nodes.

Primary Keyword (Mandatory)	Distributed workload management
Secondary Keyword (Optional)	Computing middleware

Marco Mascheroni (Fermi National Accelerator Lab. (US)) Matthias Wolf (University of Notre Dame (US))

Anna Elizabeth Woodard (University of Notre Dame (US)) Brian Paul Bockelman (University of Nebraska (US)) Eric Vaandering (Fermi National Accelerator Lab. (US)) Jose Hernandez (CIEMAT) Stefano Belforte (Universita e INFN, Trieste (IT))

Highlights-494.pdf

Oral-494.pdf

CHEP 2016 Conference, San Francisco, October 8-14, 2016

Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs

GG C2

San Francisco Mariott Marquis

Speaker

Description

Authors

Co-authors

Presentation materials