HEPiX Autumn 2020 online Workshop

Name: HEPiX Autumn 2020 online Workshop
Start: 2020-10-12T08:00:00+02:00
End: 2020-10-16T21:00:00+02:00
Location: Online Workshop

12–16 Oct 2020

Online Workshop

Europe/Paris timezone

Organisers

hepix-2020autumn-support@hepix.org

Operating a production HTCondor cluster: Seamlessly automating maintenance, OS and HTCondor updates with (almost) zero downtimes

13 Oct 2020, 11:20

20m

Online Workshop

Computing & Batch Services Computing and Batch Services

Dr Oliver Freyermuth (University of Bonn (DE))

Our HTC cluster using HTCondor has been set up at Bonn University in 2017/2018.
All infrastructure is fully puppetised, including the HTCondor configuration.

OS updates are fully automated, and necessary reboots for security patches are scheduled in a staggered fashion backfilling all draining nodes with short jobs to maximize throughput.
Additionally, draining can also be scheduled for planned maintenance periods (with optional backfilling) and tasks to be executed before a machine is rebooted or shutdown can be queued. This is combined with a series of automated health checks with large coverage of temporary and long-term machines failures or overloads, and monitoring performed using Zabbix.

In the last year, heterogeneous resources with different I/O capabilities have been integrated and MPI support has been added. All jobs run inside Singularity containers allowing also for interactive, graphical sessions with GPU access.

Combining increasingly heterogeneous resources and different data centre locations in one cluster allows operations with almost zero (full) downtime. During this talk, some examples will be presented on how the automations can be leveraged for different interventions and how the cluster the impact on users and cluster CPU efficiency is minimized.

Dr Oliver Freyermuth (University of Bonn (DE))

Peter Wienemann (University of Bonn (DE))

HTCondor_Bonn.pdf

HEPiX Autumn 2020 online Workshop

Organisers

Operating a production HTCondor cluster: Seamlessly automating maintenance, OS and HTCondor updates with (almost) zero downtimes

Online Workshop

Speaker

Description

Primary author

Co-author

Presentation materials

Choose timezone

HEPiX Autumn 2020 online Workshop

Organisers

Speaker

Description

Primary author

Co-author

Presentation materials

Share this page

Direct link

Social networks

Calendaring