Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

HTCondor Workshop Autumn 2024 in Amsterdam

Europe/Amsterdam
Colloquium room (Nikhef)

Colloquium room

Nikhef

Nikhef Science Park 105 1098 XG Amsterdam
Helge Meinhard (CERN), Todd Tannenbaum (University of Wisconsin Madison (US)), Chris Brew (Science and Technology Facilities Council STFC (GB)), Christoph Beyer, Mary Hester
Description

We are very pleased to announce that the 2024 European HTCondor Workshop will be held from Tuesday 24th September to Friday 28th September, at Nikhef in Amsterdam, The Netherlands.

The meeting will start on Tuesday morning and run until lunchtime on the Friday.

The workshop will be an excellent occasion for learning from the experts (the developers!) about HTCondor, exchanging with your colleagues about experiences and plans and providing your feedback to the experts. 

The HTCondor Compute Entrypoint (CE) will be covered as well as with token authentication (currently a hot topic), along with general use and administration of HTCondor.

Participation is open to all organisations (including companies) and persons interested in HTCondor (and by no means restricted to particle physics and/or academia!) If you know potentially interested persons, don't hesitate to make them aware of this opportunity.

The workshop will cover both using and administering HTCondor; topics will be chosen to best match participants' interests.

We would very much like to know about your use of HTCondor, in you project, your experience and your plans. Hence you are warmly encouraged to propose a short presentation.

In addition, we would like to thank our Platinum and Gold sponsors for their support with this event!

Platinum Sponsor

Gold Sponsors

 

If you have any questions, please contact hepix-2024condorworkshop-support@hepix.org.

We are looking forward to a rich, productive workshop.

Chris Brew (STFC - RAL) and Christoph Beyer (DESY), Co-Chairs of organising committee.

Mary Hester (Nikhef), Chair of the Local Organising Committee 

Todd Tannenbaum, HTCondor Technical Lead, U Wisconsin, Madison, USA

 

Participants
  • Andrew Owen
  • Antonio Delgado Peris
  • Ben Jones
  • Brian Bockelman
  • Carlos Acosta Silva
  • Chris Brew
  • Christoph Beyer
  • Clicia Dos Santos Pinto
  • Cole Bollig
  • David Cohen
  • David Groep
  • David Rebatto
  • Dirk Sammel
  • Enrique Ugedo Egido
  • Filip Neubauer
  • Francesco Prelz
  • Helge Meinhard
  • Irakli Chakaberia
  • Jeff Templon
  • Jyothish Thomas
  • Luc GUYARD
  • Luca Tabasso
  • Luuk Uljee
  • Michael Hubner
  • Michel Jouvin
  • Oliver Freyermuth
  • R. Florian von Cube
  • Stefano Dal Pra
  • Steven Noorts
  • Thomas Birkett
  • Todd Tannenbaum
  • Vishambhar Nath Pandey
  • +36
Zoom Meeting ID
66574835068
Host
Helge Meinhard
Useful links
Join via phone
Zoom URL
    • 09:00 12:00
      INTERNAL BOARD MEETING: Program comittee meeting for final preparation Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 08:30 09:00
      Registration Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 09:00 10:30
      Workshop Session: Introductions and Welcomes Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 09:00
        Welcome, Introduction and Housekeeping 10m
        Speakers: Christoph Beyer, Mary Hester
      • 09:10
        Nikhef Welcome 20m
      • 09:30
        Philosophy and Architecture: What the Manual Won't tell You 40m

        Philosophy and Architecture: What the Manual Won't tell You

        Speaker: MIRON LIVNY
      • 10:10
        Round the room introductions 20m

        Who are you, where are you from and what do you hope to get out of the workshop?

    • 10:30 11:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 11:00 12:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 11:00
        Troubleshooting: What to do when things go wrong 30m

        Troubleshooting: What to do when things go wrong

        Speaker: Andrew Owen
      • 11:35
        Practical considerations for GPU Jobs 30m

        Practical considerations for GPU Jobs

        Speaker: Andrew Owen
    • 12:30 14:00
      Lunch 1h 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 14:00 15:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 14:00
        Abstracting Accelerators Away 20m

        Currently more and more frameworks appear to perform offloaded compute to accelerators, or accelerating ML/AI workloads using CPU accelerators or GPUs. However right now the user it self still needs to figure out or decide how and what is the best execution library or acceleration system to execute there workloads.

        How can we model this abstraction the best for htcondor so for our users the overhead to use the acceleration?

        Speaker: Emily Kooistra
      • 14:25
        An ATLAS researcher's experience with HTCondor. 20m

        A new users experience of switching to HTCondor

        Speaker: Zef Wolffs (Nikhef National institute for subatomic physics (NL))
      • 14:45
        Monte Carlo simulations of extensive air showers at NIKHEF 20m

        This presentation will show how the Comic Rays group at Nikhef is using HTCondor in their analysis workflows on the local pool.

        Speaker: Kevin Cheminant (Radboud University / NIKHEF)
      • 15:10
        HTCondor + Nikhef - A History of Productive Collaboration 20m
        Speaker: MIRON LIVNY
    • 15:30 16:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 16:00 17:30
      Town Hall Discussion Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 18:00 20:00
      Social Event (Reception): Welcome Reception at Poesiat & Kater

      Brouwerij Poesiat & Kater
      Polderweg 648
      1093 KP Amsterdam
      https://poesiatenkater.nl/

      https://osm.org/go/0E6VHHtKg?node=4815845616

    • 09:00 10:30
      Workshop Session: Your Data and Condor Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 09:00
        Dealing with sources of Data: Choices and the Pros/Cons 30m

        Dealing with sources of Data: Choices and the Pros/Cons

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 09:35
        Managing Storage at the EP 30m

        Managing Storage at the EP

        Speaker: Cole Bollig
      • 10:10
        NetApp DataOps Toolkit for data management 20m

        The NetApp DataOps Toolkit is a python library that makes it easy for developers, data scientists and data engineers to perform various data management tasks. These tasks include provisioning new data volumes or developing workspace almost instantaneously. It improves flexibility in development’s environment management. In this presentation, we will go over some examples and showcase how these libraries can be leveraged for different data management use cases.

        Speaker: Didier Gava (NetApp)
    • 10:30 11:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 11:00 12:30
      Workshop Session: Your Data and Condor cont. Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 11:00
        Storage Solutions with AI workloads 20m

        Various AI workloads, such as Deep Learning, Machine Learning, Generative AI or Retrieval Augmented Generation, require capacity, compute power or data transfer performance. This presentation will show how simple a hardware / Software stack solution deployment, can leverage and/or become part of an AI infrastructure based on Ansible scripts. In addition, I will discuss two use cases, one on video surveillance and the second on real-time language processing, powered by an AI infrastructure setup.

        Speaker: Didier Gava
      • 11:25
        CHTC Vision: Compute and Data Together 15m

        CHTC Vision: Compute and Data Together

        Speaker: MIRON LIVNY
      • 11:45
        Pelican Intro 20m

        Pelican Intro

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 12:05
        PANEL and Discussion - Pelican and Condor: Flying Together, Birds of a Feather, Don't drop your data! 25m

        PANEL and Discussion - Pelican and Condor: Flying Together, Birds of a Feather, Don't drop your data!

        Speakers: Brian Paul Bockelman (University of Wisconsin Madison (US)), MIRON LIVNY, Todd Tannenbaum
    • 12:30 14:00
      Lunch 1h 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 14:00 15:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 14:00
        Dynamic resource integration with COBalD/TARDIS 20m

        With the continuing growth of data volumes and computational demands, compute-intensive sciences rely on large-scale, diverse computing resources for running data processing, analysis tasks, and simulation workflows.
        These computing resources are often made available to research groups by different resource providers resulting in a heterogeneous infrastructure.
        To make efficient use of those resources, we are developing COBald/TARDIS, a resource management system for dynamic and transparent integration.

        COBalD/TARDIS provides an abstraction layer of resource pools and sites and takes care of scheduling and requesting those resources, independent of their sites local resource management systems.
        Through the use of adapters, COBalD/TARDIS is able to interface with a range of resource providers, including OpenStack, Kubernetes, and others, as well as support different overlay batch systems, with current implementations for HTCondor and SLURM.
        In this contribution we present the general concepts of COBalD/TARDIS, several setups, with a focus on those using HTCondor, in different university groups, as well as WLCG sites.

        Speaker: Florian Von Cube (KIT - Karlsruhe Institute of Technology (DE))
      • 14:25
        Adapting Hough Analisys workflow to run on IGWN resources 20m

        The computing workflow of the Virgo Rome Group for the CW search based on Hough Analisys has been performed for several years using storage and computing resources mainly provisioned by INFN-CNAF and strictly tied with its specific infrastructure. Starting with O4a, the workflow has been adapted to be more general and to integrate with computing centers in the IGWN community. We discuss our work toward this integration, the encountered problems, our solutions and the further steps ahead.

        Speaker: Stefano Dal Pra (Universita e INFN, Bologna (IT))
      • 14:45
        Kubenettes ↔ HTC 20m

        Operating HTCondor with kubenettes

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 15:10
        Fun with Condor Print Formats 20m

        During the 20 years history of the Torque batch system at Nikhef, we constructed several command line tools providing various overviews of what was going on in the system. An example: a tool that could tell us "what are the 20 most recently started jobs?"

        mrstarts | tail -20
        

        With HTCondor we wanted the same kind of overviews. Much of this can be accomplished using the HTCondor "print formats" associated with the condor_q, condor_history, and condor_status commands. In this talk I'll present and discuss some examples, advantages and disadvantages of the approach, and along the way present some HTCondor mysteries we haven't solved.

        Speaker: Jeff Templon (Nikhef National institute for subatomic physics (NL))
    • 15:30 16:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 16:00 17:20
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 17:20 18:00
      Office Hours Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam

      Arrange to discuss your questions with members of the Condor Team

    • 09:00 10:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 09:00
        DAGman: I didn't know it could do that! 45m

        DAGman: I didn't know it could do that!

        Speaker: Cole Bollig
      • 09:50
        Final project update 20m

        This year has been eventful for our research lab, New hardware that brought along a host of challenges, we will share network, architecture and recent challenges that we are facing.
        It's all about scale.

        Speaker: David Handelman
      • 10:10
        Integrating an IDE with HTCondor 20m

        Graphical code editors such as Visual Studio Code (VS Code) have gained a lot of momentum in the last years among young researchers. To ease their workflows, we have developed a VS Code entry point to harness the resources of an HTC cluster within their IDE.

        This entry point allows users to have a "desktop-like" experience within VS Code when editing and testing their code while working in batch job environments. Furthermore, VS Code extensions such as Jupyter notebooks and Julia packages can directly leverage cluster resources.

        In this talk we will explain the use case of this entry point, how we implemented it and show some of the struggles we encountered along the way. The developed solution can also scale out to federated HTCondor pools.

        Speaker: Michael Hubner (University of Bonn (DE))
    • 10:30 11:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 11:00 12:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 12:30 14:00
      Lunch 1h 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 14:00 15:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 14:00
        Opportunities and Challenges Courtesy Linux Cgroups Version 2 25m

        Opportunities and Challenges Courtesy Linux Cgroups Version 2

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 14:25
        AMD INSTINCT GPU CAPABILITY AND CAPACITY AT SCALE 20m

        The adoption of AMD Instinct™ GPU accelerators in several of the major high-performance computing sites is a reality today and we’d like to share the pathway that lead us here. We’ll focus on characteristics of the hardware and ROCm software ecosystem, and how they were tuned to match the required compute density and programmability to make this adoption successful, from the discrete GPU to the supercomputer that tightly integrate massive amounts of these devices.

        Speaker: Samuel Antao (AMD)
      • 14:50
        GPUs in the Grid 20m

        In this presentation we will go over GPU deployment at the NL SARA-MATRIX Grid site. An overview of the setup is shown, followed by some rudimentary performance numbers. Finally, the user adoption and how the GPU is used is discussed.

        Speaker: Dr Lodewijk Nauta (SURF)
      • 15:10
        Lenovo’s Cooler approach to HTC Computing 20m

        Breakthroughs in computing systems have made it possible to tackle immense obstacles in simulation environments. As a result, our understanding of the world and universe is advancing at an exponential rate. Supercomputers are now used everywhere—from car and airplane design, oil field exploration, and financial risk assessment, to genome mapping and weather forecasting.

        Lenovo’s High-Performance Computing (HPC) technology offers substantial benefits for High Transaction Computing (HTC) by providing the necessary computational power and efficiency to handle large volumes of transactions. Lenovo’s HPC solutions, built on advanced hardware such as the ThinkSystem and ThinkAgile series, deliver exceptional processing speeds and reliability. These systems are designed to optimize data throughput and minimize latency, which are critical factors in transaction-heavy environments like financial services, e-commerce, and telecommunications. The integration of Lenovo’s HPC technology into HTC environments enhances the ability to process transactions in real-time, ensuring rapid and accurate data handling. This capability is crucial for maintaining competitive advantage and operational efficiency in industries where transaction speed and accuracy are paramount. Additionally, Lenovo’s focus on energy-efficient computing ensures that these high-performance systems are also sustainable, aligning with broader environmental goals.

        By leveraging Lenovo’s HPC technology, organizations can achieve significant improvements in transaction processing capabilities, leading to better performance, scalability, and overall system resilience. According to TOP500.org, Lenovo is the world's #1 supercomputer provider, including some of the most sophisticated supercomputers ever built. With over a decade of liquid-cooling expertise and more than 40 patents, Lenovo leverages experience in large-scale supercomputing and AI to help organizations deploy high-performance AI at any scale.

        Speaker: Mr Rick Koopman
    • 15:30 16:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 16:00 17:30
      Lightning Talks/Show your toolbox Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 17:30 21:30
      Social Event (Dinner): House of Bird Diemerbos

      https://osm.org/go/0E6U8W6og

    • 09:00 10:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 09:00
        WLCG Token Transition Update (incl the illustrious return of x509) 20m

        WLCG Token Transition Update (incl the illustrious return of x509)

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 09:25
        Practical experience with an interactive-first approach to leverage HTC resources 20m

        Development and execution of scientific code requires increasingly complex software stacks and specialized resources such as machines with huge system memory or GPUs. Such resources are present in HTC/HPC clusters and used for batch processing since decades,but users struggle with adapting their software stacks and their development workflows to those dedicated resources. Hence, it is crucial to enable interactive use with a low-threshold user experience, i.e. offering an SSH-like experience to enter development environments or start JupyterLab sessions from a web browser.

        Turning some knobs, HTCondor unlocks these interactive use cases of HTC and HPC resources, leveraging the resource control functionality of a workload manager, wrapping execution within unprivileged containers and even enabling the use of federated resources crossing network boundaries without loss of security.

        This talk presents the positive experience with an interactive-first approach, hiding the complexities of containers and different operating systems from the users, enabling them to use HTC resources in an SSH-like fashion and with their JupyterLab environments. It also provides a short outlook on scaling this approach to a federated infrastructure.

        Speaker: Oliver Freyermuth (University of Bonn (DE))
      • 09:45
        HTCondor setup @ ORNL, an ALICE T2 site 20m

        ALICE experiment at CERN runs a distributed computing model and it is part of the Worldwide LHC Computing Grid (WLCG). WLCG uses a tiered distributed grid model. As part of the ALICE experiment’s computing grid we run two Tier2 (T2) sites in the US, at Oak Ridge National Laboratory and Lawrence Berkeley National Laboratory. Computing resource usage and delivery are being accounted through OSG via GRATIA probes. This information is then forwarded to the WLCG. With the OSG software update and deprecation of some GRATIA probes we had to update the setup for the OSG accounting. To do so we have recently started to move our existing setup to HTCondor based workflow and new GRATIA accounting. I will present the setup for our T2 sites and HTCondor configuration escapade.

        Speaker: Irakli Chakaberia (Lawrence Berkeley National Lab. (US))
      • 10:10
        Implementing OSDF Cache in SURF - MS4 Service 20m

        In this presentation there will be a brief mention of the environment that hosts the OSDF Cache, the setup and suitable software for MS4 service. The presentation will lay out in a bit more depth the process of installing the OSDF cache and the challenges that arose during the installation.

        Speaker: Jasmin Colo
    • 10:30 11:00
      Coffee 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
    • 11:00 12:30
      Workshop Session Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam
      • 11:00
        HPC use case through PIC 20m

        In this contribution, I will present an HPC use case facilitated through gateways deployed at PIC. The selected HPC resource is the Barcelona Supercomputing Center, where we encountered some challenges, particularly in the CMS case, which required meticulous and complex work. We had to implement new developments in HTCondor, specifically enabling communication through a shared file system. This contribution will detail the setup process and the scale we were able to achieve so far.

        Speaker: Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))
      • 11:20
        HTCondor in Einstein Telescope 20m

        The Einstein Telescope (ET) is currently in the early development phase
        for its computing infrastructure. At present, the only officially
        provided service is the distribution of data for Mock Data Challenges
        (using the Open Science Data Federation + CVMFS-for-data), with GitLab
        used for code management. While the data distribution infrastructure is
        expected to be managed by a Data Lake using Rucio, the specifics of the
        data processing infrastructure and tools remain undefined. This
        exploratory phase allows for a detailed evaluation of different solutions.
        Drawing from the experiences of 2nd-generation gravitational wave
        experiments LIGO and Virgo, which began with modest computational needs
        and expanded into distributed computing models using HTCondor, ET aims
        to build upon these foundations. LIGO and Virgo adopted, for their
        offline data analyses, the LHC grid computing model through a common
        computing infrastructure called IGWN (International Gravitational-Wave
        Observatory Network), incorporating systems like glideinWMS, which works
        on top of HTCondor, to handle high-throughput computing (HTC) tasks.
        Despite this, challenges such as the reliance on shared file systems
        have limited the migration to grid-based workflows, with only 20% of
        jobs currently running on the IGWN grid.
        For ET, the plan is to adapt and evolve from the IGWN grid computing
        model, making sure workflows are grid-compatible. This includes
        exploring Snakemake, a framework for reproducible data analysis, to
        complement HTCondor. Snakemake offers the ability to run jobs on diverse
        computing resources, including grid, Slurm clusters, and cloud-based
        infrastructures. This approach aims to ensure flexibility, scalability,
        and reproducibility in ET’s data processing workflows, while overcoming
        past limitations.

        Speaker: Luca Tabasso
      • 11:40
        Transitioning the CMS pools to ALMA9 20m

        The Submission Infrastructure team of the CMS experiment at the LHC operates several HTCondor pools, comprising more than 500k CPU cores on average, for the experiment's different user groups. The jobs running in those pools include crucial experiment data reconstruction, physics simulation and user analysis. The computing centres providing the resources are distributed around the world and dynamically added to the pools on demand.

        Uninterrupted operation of those pools is critical to avoid losing valuable physics data and ensure the completion of computing tasks for physics analyses. With the announcement of the end-of-life of CentOS 7, the CMS collaboration decided to transition their infrastructure, running essential services for the successful operation of the experiment, to ALMA 9.

        In this contribution, we outline CMS's federated HTCondor pools and share our experiences of transitioning the infrastructure from CentOS 7 to ALMA 9, while keeping the system operational.

        Speaker: Florian Von Cube (KIT - Karlsruhe Institute of Technology (DE))
      • 12:00
        Heterogeneous Tier2 Cluster and Power Efficiency Studies at ScotGrid Glasgow 20m

        With the latest addition of 4k ARM cores, the ScotGrid Glasgow facility is a pioneering example of a heterogeneous WLCG Tier2 site. The new hardware has enabled large-scale testing by experiments and detailed investigations into ARM performance in a production environment.

        I will present an overview of our computing cluster, which uses HTCondor as the batch system combined with ARC-CE as the front-end for job submission, authentication, and user mapping, with particular emphasis on the dual queue management. I will also touch on our monitoring and central logging system, built on Prometheus, Loki, and Grafana, and describe the custom scripts we use to extract job information from HTCondor and pass it to the node_exporter collector.

        Moreover, I will highlight our research on power efficiency in HEP computing, showing the benchmarks and tools we use to measure and analyze power data. In particular, I will present a new figure-of-merit designed to characterize power usage during the execution of the HEP-Score benchmark, along with an updated performance-per-watt comparison extended to the latest x86 and ARM CPUs (Ampere Altra Q80 and M80, NVidia Grace, and recent AMD EPYC chips). Within this context, we introduce a Frequency Scan methodology to better characterize performance/watt trade-offs.

        Speaker: Emanuele Simili
      • 12:20
        Workshop Wrap-Up and Goodbye 10m
        Speaker: Chris Brew (Science and Technology Facilities Council STFC (GB))
    • 12:30 14:00
      Lunch 1h 30m Colloquium room

      Colloquium room

      Nikhef

      Nikhef Science Park 105 1098 XG Amsterdam