Sep 21 – 25, 2020
HTCondor in Production: Seamlessly automating maintenance, OS and HTCondor updates, all integrated with HTCondor's scheduling

Sep 24, 2020, 3:35 PM

Oliver Freyermuth (University of Bonn (DE))


Our HTC cluster using HTCondor has been set up at Bonn University in 2017/2018.
All infrastructure is fully puppetised, including the HTCondor configuration.

OS updates are fully automated, and necessary reboots for security patches are scheduled in a staggered fashion backfilling all draining nodes with short jobs to maximize throughput.
Additionally, draining can also be scheduled for planned maintenance periods (with optional backfilling) and tasks to be executed before a machine is rebooted or shutdown can be queued.
This is combined with a series of automated health checks with large coverage of temporary and long-term machines failures or overloads, and monitoring performed using Zabbix.

In the last year, heterogeneous ressources with different I/O capabilities have been integrated and MPI support has been added. All jobs run inside Singularity containers allowing also for interactive,
graphical sessions with GPU access.

Oliver Freyermuth (University of Bonn (DE))


Peter Wienemann (University of Bonn (DE))

