21–25 Sept 2020
(teleconference only)
Europe/Paris timezone

HTCondor in Production: Seamlessly automating maintenance, OS and HTCondor updates, all integrated with HTCondor's scheduling

24 Sept 2020, 15:35
20m
https://cern.zoom.us/j/97987309455

https://cern.zoom.us/j/97987309455

HTCondor user presentations Workshop session

Speaker

Oliver Freyermuth (University of Bonn (DE))

Description

Our HTC cluster using HTCondor has been set up at Bonn University in 2017/2018.
All infrastructure is fully puppetised, including the HTCondor configuration.

OS updates are fully automated, and necessary reboots for security patches are scheduled in a staggered fashion backfilling all draining nodes with short jobs to maximize throughput.
Additionally, draining can also be scheduled for planned maintenance periods (with optional backfilling) and tasks to be executed before a machine is rebooted or shutdown can be queued.
This is combined with a series of automated health checks with large coverage of temporary and long-term machines failures or overloads, and monitoring performed using Zabbix.

In the last year, heterogeneous ressources with different I/O capabilities have been integrated and MPI support has been added. All jobs run inside Singularity containers allowing also for interactive,
graphical sessions with GPU access.

Desired slot length 20
Speaker release Yes

Primary author

Oliver Freyermuth (University of Bonn (DE))

Co-author

Peter Wienemann (University of Bonn (DE))

Presentation materials