10–14 Oct 2016
San Francisco Marriott Marquis
America/Los_Angeles timezone

Design and Execution of make-like distributed Analyses based on Spotify’s Pipelining Package luigi

12 Oct 2016, 11:45
15m
GG A+B (San Francisco Mariott Marquis)

GG A+B

San Francisco Mariott Marquis

Oral Track 5: Software Development Track 5: Software Development

Speaker

Marcel Rieger (Rheinisch-Westfaelische Tech. Hoch. (DE))

Description

In particle physics, workflow management systems are primarily used as tailored solutions in dedicated areas such as Monte Carlo production. However, physicists performing data analyses are usually required to steer their individual workflows manually which is time-consuming and often leads to undocumented relations between particular workloads.
We present a generic analysis design pattern that copes with the sophisticated demands of end-to-end HEP analyses and provides a make-like execution environment. It is based on the open-source pipelining package luigi which was developed at Spotify and enables the definition of arbitrary workloads, so-called Tasks, and the dependencies between them in a lightweight and scalable structure. Further features are multi-user support, automated dependency resolution and error handling, central scheduling, and status visualization in the web.
In addition to already built-in features for remote jobs and file systems like Hadoop and HDFS, we added support for WLCG infrastructure such as LSF and CREAM job submission, as well as remote file access through the Grid File Access Library (GFAL2). Furthermore, we implemented automated resubmission functionality, software sandboxing, and a command line interface with auto-completion for a convenient working environment.
For the implementation of a ttH cross section measurement with CMS, we created a generic Python interface that provides programmatic access to all external information such as datasets, physics processes, statistical models, and additional files and values. In summary, the setup enables the execution of the entire analysis in a parallelized and distributed fashion with a single command.

Primary Keyword (Mandatory) Data processing workflows and frameworks/pipelines
Secondary Keyword (Optional) Distributed workload management

Primary authors

Benjamin Fischer (Rheinisch-Westfaelische Tech. Hoch. (DE)) Marcel Rieger (Rheinisch-Westfaelische Tech. Hoch. (DE)) Martin Erdmann (Rheinisch-Westfaelische Tech. Hoch. (DE)) Robert Fischer (Rheinisch-Westfaelische Tech. Hoch. (DE))

Presentation materials