9–13 Jul 2018
Sofia, Bulgaria
Europe/Sofia timezone

Improving efficiency of analysis jobs in CMS

10 Jul 2018, 14:30
15m
Hall 7 (National Palace of Culture)

Hall 7

National Palace of Culture

presentation Track 3 – Distributed computing T3 - Distributed computing

Speakers

Todor Trendafilov Ivanov (University of Sofia (BG)) Jose Hernandez (CIEMAT)

Description

Hundreds of physicists analyse data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) using the CMS Remote Analysis builder (CRAB) and the CMS GlideinWMS global pool to exploit the resources of the World LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time the CMS collaboration is committed on minimizing time to insight for every scientist, by pushing for the fewer possible access restrictions to the full data sample and for freedom of choosing the application to run. Supporting such varied workflows while preserving efficient resource usage poses special challenges, like: scheduling of jobs in a multicore/pilot model where several single core jobs with an undefined runtime run inside pilot jobs with a fixed lifetime; balancing usage of every available CPU vs. use of CPU close to the data; avoiding that too many concurrent reads from same storage push jobs into I/O wait mode making CPU cycles go idle; watching over user activity to detect low efficiency workflows and prod them into smarter usage of the resources.

In this paper we report on two complementary approaches adopted in CMS to improve the scheduling efficiency of user analysis jobs: job automatic splitting, and job automatic estimated running time tuning. They both aim at finding an appropriate value for the scheduling runtime, a number that tells how much walltime the user job needs, and it is used during scheduling to fit user's jobs into pilots that have enough lifetime. With the automatic splitting mechanism, an estimation of the runtime of the jobs is performed upfront so that an appropriate value can be estimated for the scheduling runtime. With the automatic time tuning mechanism instead, the scheduling runtime is dynamically modified by analyzing the real runtime of jobs after they finish. We also report on how we used the flexibility of the global computing pool to tune the amount, kind and running locations of jobs allowed to run exploiting remote access to the input data.

We discuss the strategies concepts, details, and operational experiences, highlighting the pros and cons, and we show how such efforts helped improving the computing efficiency in CMS.

Primary authors

Stefano Belforte (Universita e INFN Trieste (IT)) Matthias Wolf (University of Notre Dame (US)) Todor Trendafilov Ivanov (University of Sofia (BG)) Marco Mascheroni (Univ. of California San Diego (US)) Antonio Perez-Calero Yzquierdo (Centro de Investigaciones Energéti cas Medioambientales y Tecno) James Letts (Univ. of California San Diego (US)) Justas Balcas (California Institute of Technology (US)) Anna Elizabeth Woodard (University of Notre Dame (US)) Brian Paul Bockelman (University of Nebraska Lincoln (US)) Diego Davila Foyo (Autonomous University of Puebla (MX)) Diego Ciangottini (Universita e INFN, Perugia (IT))

Presentation materials