19–25 Oct 2024
Europe/Zurich timezone

Event Workflow Management System - A SaaS Solution for Massively Divisible and Distributed Workflows

Not scheduled
15m
Poster Track 4 - Distributed Computing Poster session

Speaker

Ric Evans (Wisconsin IceCube Particle Astrophysics Center)

Description

How does one take a workload, consisting of millions or billions of tasks, and group it into tens of thousands of jobs? Partitioning the workload into a workflow of long-running jobs minimizes the use of scheduler resources; however, smaller, more fine-grained jobs allow more efficient use of computing resources. When the runtime of a task averages a minute or less, severe scaling challenges due to scheduling overhead can surface. Employing jobs that run for several hours, each with a large input file comprising a bundle of tasks, is effective in ideal situations. However, given the heterogeneity of available distributed resources and limited control of task-job matching, runtimes can vary widely.
The Event Workflow Management System (EWMS) augments HTCondor to solve this issue. EWMS implements a pilot-based paradigm where each worker, running inside an HTCondor execution point, connects to a message broker and executes many individual fine-grained tasks. This adaptive design increases task throughput while incorporating additional fail-safe features. In addition, EWMS manages workflow scheduling, enables real-time worker scaling, and exports a public-facing interface for user accessibility. Here, we outline the EWMS technique, detail science driver workflows from the IceCube experiment, and provide system usage metrics.

Primary author

Ric Evans (Wisconsin IceCube Particle Astrophysics Center)

Co-authors

Benedikt Riedel (University of Wisconsin-Madison) Brian Aydemir (Morgridge Institute for Research) Brian Paul Bockelman (University of Wisconsin Madison (US)) David Schultz (University of Wisconsin-Madison) MIRON LIVNY

Presentation materials

There are no materials yet.