HTCondor Workshop Autumn 2020
(teleconference only)
The HTCondor Workshop Autumn 2020 was held as a virtual event via videoconferencing, due to current pandemic and related travel restrictions. The workshop was the sixth edition of the series usually hosted in Europe after the successful events at CERN in December 2014, ALBA in February 2016, DESY in June 2017, RAL in September 2018 and JRC in September 2019. The workshops are opportunities for novice and experienced users of HTCondor to learn, get help and have exchanges between them and with the HTCondor developers and experts. They are open to everyone world-wide; they consist of presentations, tutorials and "office hours" for consultancy, covering the HTCondor CE (Compute Element) prominently as well. They also feature presentations by users on their projects and experiences. The workshops address participants from academia and research as well as from commercial entities alike. |
|
-
-
2:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
<a href="https://cern.zoom.us/j/94530716058">https://cern.zoom.us/j/94530716058</a> -
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Michel Jouvin (Université Paris-Saclay (FR)), Catalin Condurache (EGI Foundation), Gregory Thain (University of Wisconsin-Madison)- 1
-
2
State of Distributed High Throughput ComputingSpeaker: Miron Livny (University of Wisconsin-Madison)
-
3
A Users Introduction to HTCondor and Job SubmissionSpeaker: Christina Koch (University of Wisconsin-Madison)
-
4
Manage Workflows with HTCondor DAGManSpeaker: Lauren Michael (UW Madison)
-
4:40 PM
Break
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Chris Brew (Science and Technology Facilities Council STFC (GB)), Christoph Beyer, Helge Meinhard (CERN)-
5
HTCondor deployment at CC-IN2P3
In recent months the HTCondor has been the main workload management system for the Grid environment at CC-IN2P3. The computing cluster consists of ~640 worker nodes of various types which deliver in a total of ~27K execution slots (including hyperthreading). The system supports LHC experiments (Alice, Atlas, CMS, and LHCb) under the umbrella of the Worldwide LHC Computing Grid (WLCG) as a Tier 1 site and other various experiments and research groups under the umbrella of European Grid Infrastructure (EGI). This presentation will provide a brief description of the installation, the configuration aspects of the HTCondor cluster. Besides, we will present the use of the HTCondor-CE grid gateway at CC-IN2P3.
Speaker: Dr Emmanouil Vamvakopoulos (CCIN2P3/CNRS) -
6
Replacing LSF with HTCondor: the INFN-T1 experience.
CNAF started working with HTCondor during spring 2018,
planning to move its Tier-1 Grid Site based on CREAM-CE and LSF
Batch System to HTCondor-CE and HTCondor. The phase out of CREAM and
LSF was completed by spring 2020. This talk describes our experience
with the new system, with particular focus on HTCondor .Speaker: Stefano Dal Pra (Universita e INFN, Bologna (IT)) -
7
HTCondor Philosophy and Architecture OverviewSpeaker: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)
-
5
-
6:05 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
2:00 PM
-
-
2:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Helge Meinhard (CERN), Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Michel Jouvin (Université Paris-Saclay (FR))-
8
What is new in HTCondor? What is upcoming?Speaker: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)
-
9
Installing HTCondorSpeaker: Mark Coatsworth (UW Madison)
-
10
Pslots, draining, backfill: Multicore jobs and what to do with themSpeaker: Gregory Thain (University of Wisconsin-Madison)
-
11
HTC at DESY
In 2016 the local (BIRD) and GRID DESY batch facilities were migrated to HTCondor, this talk will cover some of the experiences and developments we saw over the time and the plans fot the future of HTC at DESY.
Speaker: Christoph Beyer -
12
HTCondor at GRIF
GRIF is a distributed Tier-2 WLCG site grouping four laboratories in the Paris Region (IJCLab, IRFU, LLR, LPNHE). Multiple HTCondor instances are deployed at GRIF since several years. In particular an ARC-CE + HTCondor system provides access to the computing resources of IRFU and a distributed HTCondor pool, with CREAM-CE and Condor-CE gateways, gives unified access to the IJCLab and LLR resources. We propose a short talk (10min max) giving a quick overview of the HTCondor installations at GRIF and some feedback from the GRIF grid administrators.
Speaker: Andrea Sartirana (Centre National de la Recherche Scientifique (FR))
-
8
-
4:25 PM
Break
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Christoph Beyer, Michel Jouvin (Université Paris-Saclay (FR)), Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)-
13
Archival, anonymization and presentation of HTCondor logs with GlideinMonitor
GlideinWMS is a pilot framework to provide uniform and reliable HTCondor clusters using heterogeneous and unreliable resources. The Glideins are pilot jobs that are sent to the selected nodes, test them, set them up as desired by the user jobs, and ultimately start an HTCondor schedd to join an elastic pool. These Glideins collect information that is very useful to evaluate the health and efficiency of the worker nodes and invaluable to troubleshoot when something goes wrong. This includes local stats, the results of all the tests, and the HTCondor log files, and it is packed and sent to the GlideinWMS Factory.
Access to these logs for developers takes long back and forth with Factory operators and manual digging into files. Furthermore, these files contain information like user IDs and email and IP addresses, that we want to protect and limit access to.
GlideinMonitor is a Web application to make these logs more accessible and useful:
- it organizes the logs in an efficient compressed archive
- it allows to search, unpack, and inspect them, all in a convenient and secure Web interface
- via plugins like the log anonymizer, it can redact protected information preserving the parts useful for troubleshootingSpeaker: Marco Mambelli (University of Chicago (US)) -
14
Status and Plans of HTCondor Usage in CMS
The resource needs of high energy physics experiments such as CMS at the LHC are expected to grow in terms of the amount of data collected and the computing resources required to process these data. Computing needs in CMS are addressed through the "Global Pool" a vanilla dynamic HTCondor pool created through the glideinWMS software. With over 250k cores, the CMS Global Pool is the biggest HTCondor pool in the world, living at the forefront of HTCondor limits and facing unique challenges. In this contribution, we will give an overview of the Global Pool, focusing on the workflow managers connected to it and the unique HTCondor features used by them. Then, we will describe the monitoring tools developed to make sure the pool works correctly. We will also analyze the efficiency and scalability challenges faced by the CMS experiment. Finally, plans and challenges for the future will be addressed.
Speaker: Marco Mascheroni (Univ. of California San Diego (US)) -
15
Classified Ads in HTCondorSpeaker: James Frey (University of Wisconsin Madison (US))
-
16
Job Submission TransformationsSpeaker: John Knoeller (University of Wisconsin-Madison)
-
13
-
Office hour
-
17
Administrating HTCondor at a local site https://cern.zoom.us/j/92420227039
https://cern.zoom.us/j/92420227039
For system admins installing and/or configuring an HTCondor pool on their campus
-
18
General Office Hour Lobby https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
For general questions, open discussions, getting started
-
19
HTCondor-CE, Grid, and Federation https://cern.zoom.us/j/98439799794
https://cern.zoom.us/j/98439799794
Questions about grid/cloud: CE, OSG, WLCG, EGI, bursting to HPC/Cloud, etc.
-
20
Using HTCondor https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
For people who want to submit workflows and have questions about using the command line tools or developer APIs (Python, REST)
-
17
-
2:00 PM
-
-
2:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Catalin Condurache (EGI Foundation), Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA), Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno)-
21
HTCondor-CE OverviewSpeaker: Brian Hua Lin (University of Wisconsin - Madison)
-
22
Replacing CREAM-CE with HTCondor-CE: the INFN-T1 experience
CNAF started working with the HTCondor Computing Element from May
2018, planning to move its Tier-1 Grid Site based on CREAM-CE and LSF
Batch System to use HTCondor-CE and HTCondor. The phase out of CREAM
and LSF was completed by spring 2020. This talk describes our
experience with the new system, with particular focus on HTCondor-CE.Speaker: Stefano Dal Pra (Universita e INFN, Bologna (IT)) -
23
HTCondor-CE ConfigurationSpeaker: Brian Hua Lin (University of Wisconsin - Madison)
-
24
How I Learned to Stop Worrying and Love the HTCondor-CE
This contribution provides firsthand experience of adopting HTCondor-CE at German WLCG sites DESY and KIT. Covering two sites plus a remote setup for RWTH Aachen, we share our lessons learned in pushing HTCondor-CE to production. With a comprehensive recap from technical setup, a detour to surviving the ecosystem and accounting, to the practical Dos and Donts, this contribution is suitable for all people that are considering, struggling or already successful in adopting HTCondor-CE as well.
Speaker: Max Fischer (Karlsruhe Institute of Technology)
-
21
-
4:25 PM
Break
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Gregory Thain (University of Wisconsin-Madison), Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Christoph Beyer-
25
HTCondor-CE Live InstallationSpeaker: Brian Hua Lin (University of Wisconsin - Madison)
-
26
HTCondor-CE TroubleshootingSpeaker: Brian Hua Lin (University of Wisconsin - Madison)
-
27
What is next for the HTCondor-CE?Speaker: Brian Hua Lin (University of Wisconsin - Madison)
-
28
Running a large multi-purpose HTCondor pool at CERN
A review of how we run and operate a large multi purpose condor pool, with grid, local submission and dedicated resources. Using grid and local submission to drive utilisation of shared resources. Using transforms and routers in order to ensure jobs end up on the correct resources, and are accounted correctly. We will review our automation and monitoring tools, together with integration of externally hosted and opportunistic resources.
Speaker: Ben Jones (CERN) -
29
Challenge of the Migration of the RP-Coflu-Cluster @ CERN
The Coflu Cluster, also known as the Radio-Protection (RP) Cluster, started as an experimental project at CERN involving a few standard desktop computers, in 2007. It was envisaged to have a job scheduling system and a common storage space so that multiple Fluka simulations could be run in parallel and monitored, utilizing a custom built and easy-to-use web-interface.
Abstract The infrastructure is composed of approximately 500 cores, and relies on HTCondor as an open-source high-throughput computing software framework for the execution of Fluka simulation jobs. Before the migration that was carried out over these last three months, nodes where running under Scientific Linux 6 and HT Condor mostly in the latest HT Condor 7 version. The web interface—based on JavaScript and PHP—allowing job submission was relying intensively on the Quill database hosted in CERN's “database on demand” infrastructure.
Abstract In this talk, we discuss the challenges of migrating HTCondor to its latest version on our infrastructure, which required solving different challenges: replacing the Quill database used intensively in the web interface for supporting the submission and management of jobs, updating a whole system with the least interruption of the production, by gradually migrating its components to both the latest version of HT Condor and Centos 7.
Abstract We then terminate this presentation by the project of migrating this infrastructure to the CERN HT Condor pool.
Speaker: Xavier Eric Ouvrard (CERN)
-
25
-
6:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
2:00 PM
-
-
2:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA), Chris Brew (Science and Technology Facilities Council STFC (GB)), Helge Meinhard (CERN)- 30
-
31
Combining cloud-native workflows with HTCondor jobs
The majority of physics analysis jobs at CERN are run on high-throughput computing batch systems such as HTCondor. However, not everyone has access to computing farms, e.g. theorist wanting to make use of CMS Open Data, and for reproducible workflows more backend-agnostic approaches are desirable. The industry standard here are containers leveraged with Kubernetes, for which computing resources can easily be acquired on-demand using public cloud offerings. This causes a disconnect between how current HEP physics analysis are performed and how they could be reused: when developing a fully "cloud native" computing approach for physics analysis, one still needs to have access to the ten-thousands of cores available on classical batch system to have sufficient resources for the data processing.
In this presentation, I will demonstrate how complex physics analysis workflows that are written and scheduled using a rather small Kubernetes cluster can make use of CERN's HTCondor installation. An "operator" is used to submit jobs to HTCondor and---once completed---collect the results and continue the workflow in the cloud. The audience will also learn the important role that software containers and Kubernetes play in the context of open science.
Speaker: Clemens Lange (CERN) -
32
HTCondor in Production: Seamlessly automating maintenance, OS and HTCondor updates, all integrated with HTCondor's scheduling
Our HTC cluster using HTCondor has been set up at Bonn University in 2017/2018.
All infrastructure is fully puppetised, including the HTCondor configuration.OS updates are fully automated, and necessary reboots for security patches are scheduled in a staggered fashion backfilling all draining nodes with short jobs to maximize throughput.
Additionally, draining can also be scheduled for planned maintenance periods (with optional backfilling) and tasks to be executed before a machine is rebooted or shutdown can be queued.
This is combined with a series of automated health checks with large coverage of temporary and long-term machines failures or overloads, and monitoring performed using Zabbix.In the last year, heterogeneous ressources with different I/O capabilities have been integrated and MPI support has been added. All jobs run inside Singularity containers allowing also for interactive,
graphical sessions with GPU access.Speaker: Oliver Freyermuth (University of Bonn (DE)) -
33
HTCondor Annex: Bursting into CloudsSpeaker: Todd Lancaster Miller (University of Wisconsin Madison (US))
-
34
CHTC Partners with Google Cloud to Make HTCondor Available on the Google Cloud Marketplace
We're excited to share the launch of the HTCondor offering on the Google Cloud Marketplace, built by Google software engineer Cheryl Zhang with advice and support from the experts at the CHTC. Come see how quickly and easily you can start using HTCondor on Google Cloud with this new solution.
Speaker: Cheryl Zhang (Google Cloud)
-
4:30 PM
Break
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Catalin Condurache (EGI Foundation), Helge Meinhard (CERN), Gregory Thain (University of Wisconsin-Madison)-
35
HTCondor Offline: Running on isolated HPC SystemsSpeaker: James Frey (University of Wisconsin Madison (US))
-
36
HEPCloud use of HTCondor to access HPC Centers
HEPCloud is working to integrate isolated HPC Centers, such as Theta at Argonne
National Laboratory, into the pool of resources made available to its user
community. Major obstacles to using these centers include limited or no outgoing
networking and restrictive security policies. HTCondor has provided a mechanism
to execute jobs in a manner that satisfies the constraints and policies. In
this talk we will discuss the various ways we use HTCondor to collect and execute
jobs on Theta.Speaker: Anthony Richard Tiradani (Fermi National Accelerator Lab. (US)) -
37
HPC backfill with HTCondor at CERN
The bulk of computing at CERN consists of embarrassingly parallel HTC use cases (Jones, Fernandez-Alavarez et al), however for MPI applications for e.g. Accelerator Physics and Engineering, a dedicated HPC cluster running SLURM is used.
In order to optimize utilization of the HPC cluster, idle nodes in SLURM cluster are backfilled with Grid HTC workloads. This talk will detail the CondorCE setup that enables backfill to the SLURM HPC cluster with pre-emptable Grid jobs.Speaker: Pablo Llopis Sanmillan (CERN) -
38
HTCondor monitoring at ScotGrid Glasgow
Our Tier2 cluster (ScotGrid, Glasgow) uses HTCondor as batch system, combined with ARC-CE as front-end for job submission and ARGUS for authentication and user mapping.
On top of this, we have built a central monitoring system based on Prometheus that collects, aggregates and displays metrics on custom Grafana dashboards. In particular, we extract jobs info by regularly parsing the output of 'condor_status' on the condor_manager, scheduler, and worker nodes.
A collection of graphs gives a quick overlook of cluster performance and helps identify rising issues. Logs from all nodes and services are also collected to a central Loki server and retained over time.Speaker: Emanuele Simili (University of Glasgow)
-
35
-
Office hour
-
39
Administrating HTCondor at a local site https://cern.zoom.us/j/92420227039
https://cern.zoom.us/j/92420227039
For system admins installing and/or configuring an HTCondor pool on their campus
-
40
General Office Hour Lobby https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
For general questions, open discussions, getting started
-
41
HTCondor-CE, Grid, and Federation https://cern.zoom.us/j/98439799794
https://cern.zoom.us/j/98439799794
Questions about grid/cloud: CE, OSG, WLCG, EGI, bursting to HPC/Cloud, etc.
-
42
Using HTCondor https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
For people who want to submit workflows and have questions about using the command line tools or developer APIs (Python, REST)
-
39
-
2:00 PM
-
-
2:00 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Chris Brew (Science and Technology Facilities Council STFC (GB)), Gregory Thain (University of Wisconsin-Madison), Catalin Condurache (EGI Foundation)-
43
HTCondor at Nikhef
The Physics Data Processing group at Nikhef is developing a Condor-based cluster, after a 19-year absence from the HTCondor community. This talk will discuss why we are developing this cluster, and present our plans and the results so far. It will also spend a slide or two on the potential to use HTCondor for other services we provide.
Speaker: Jeff Templon (Nikhef National institute for subatomic physics (NL)) -
44
HTCondor's Python API - The Python BindingsSpeaker: Jason Patton (UW Madison)
-
45
HTMap: Pythonic High Throughput ComputingSpeaker: Todd Tannenbaum (Univ of Wisconsin-Madison, Wisconsin, USA)
-
46
Lightweight Site-Specific Dask Integration for HTCondor at CHTC
Dask is an increasingly-popular tool for both low-level and high-level parallelism in the Scientific Python ecosystem. I will discuss efforts at the Center for High Throughput Computing at UW-Madison to enable users to run Dask-based work on our HTCondor pool. In particular, we have developed a "wrapper package" based on existing work in the Dask ecosystem that lets Dask spawn workers in the CHTC pool without users needing to be aware of the infrastructure constraints we are operating under. We believe this approach is useful as a lightweight alternative to dedicated, bespoke infrastructure like Dask Gateway.
Speaker: Mr Matyas Selmeci (University of Wisconsin - Madison) -
47
REST API to HTCondorSpeaker: Matyas Selmeci (University of Wisconsin - Madison)
-
43
-
4:15 PM
Break
-
Workshop session https://cern.zoom.us/j/97987309455
https://cern.zoom.us/j/97987309455
Conveners: Jose Flix Molina (Centro de Investigaciones Energéti cas Medioambientales y Tecno), Christoph Beyer, Chris Brew (Science and Technology Facilities Council STFC (GB))-
48
HTCondor Security: Philosophy and Administration ChangesSpeakers: Zach Miller, Brian Paul Bockelman (University of Wisconsin Madison (US))
-
49
From Identity-Based Authorization to Capabilities: SciTokens, JWTs, and OAuth
In this presentation, I will introduce the SciTokens model (https://scitokens.org/) for federated capability-based authorization in distributed scientific computing. I will compare the OAuth and JWT security standards with X.509 certificates, and I will discuss ongoing work to migrate HTCondor use cases from certificates to tokens.
Speaker: Jim Basney (University of Illinois) -
50
Allow HTCondor jobs to securely access services via OAuth token workflowSpeakers: Jason Patton (UW Madison), Zach Miller
- 51
-
48
-
5:55 PM
Hallway time https://cern.zoom.us/j/94530716058
https://cern.zoom.us/j/94530716058
-
2:00 PM