Second K8s-HEP Meetup

America/Chicago
Virtual (Zoom)

Virtual

Zoom

lincoln bryant
Description

This is the second "meetup" of folks dealing with the challenges of applying Kubernetes to computing in high energy physics.  The first (in-person) meeting was held at UChicago in January 2020, https://indico.cern.ch/event/882955/.  The meetup offers an opportunity to share experience, expertise, tips for K8s and cloud-native technologies from both application and infrastructure perspectives. While the context is high energy physics, many contributions may be generally applicable to scientific computing and creation of cyberinfrastructure. All are welcome to attend.

You must register to attend - we'll be sending out Zoom connection details the morning of.

Live noteshttps://docs.google.com/document/d/1s0KAl-LNnn1vvkH-Twiu909sLsaQdj6wo0vWJ9WUv-E/edit?usp=sharing

Registration
Registration
Participants
  • Alessandra Forti
  • Andrew Eckart
  • Anthony Richard Tiradani
  • Armen Vartapetian
  • Benjamin Galewsky
  • Brandon White
  • Brian Lin
  • Brian Paul Bockelman
  • Burt Holzman
  • Carl Lundstedt
  • Ceyhun Uzunoglu
  • Christophe Bonnaud
  • Christopher Hollowell
  • Christopher Weaver
  • Costin Caramarcu
  • Daniele Spiga
  • David Jordan
  • Diego Ciangottini
  • Doug Benjamin
  • Federica Legger
  • Fernando Harald Barreiro Megino
  • Fernando Meireles
  • Glenn Cooper
  • Gordon Watts
  • Horst Severini
  • Humaira Abdul Salam
  • Ilija Vukotic
  • Ivo Jimenez
  • Jason Stidd
  • Jayjeet Chakraborty
  • Jeff LeFevre
  • Joe Breen
  • John Graham
  • John Thiltges
  • Judith Stephen
  • Kenyi Paolo Hurtado Anampa
  • Lincoln Bryant
  • Lindsey Gray
  • Lorena Lobato Pardavila
  • Luis Fernandez Alvarez
  • Maria Acosta Flechas
  • Mark Neubauer
  • Matyas Selmeci
  • Michael Schuh
  • Muhammad Akhdhor
  • Muhammad Imran
  • Oksana Shadura
  • Panos Paparrigopoulos
  • Pascal Paschos
  • Philippe Laurens
  • Ricardo Brito Da Rocha
  • Robert William Gardner Jr
  • Ryan Taylor
  • Shawn Mc Kee
  • Sinclert Perez Castano
  • Soundar Rajendran
  • Spyridon Trigazis
  • Thomas George Hartland
  • Todd Tannenbaum
  • Tommaso Tedeschi
  • Valentin Y Kuznetsov
  • William Strecker-Kellogg
  • Xin Zhao
  • Zhifei Yang
  • Tuesday, 1 December
    • 09:00 11:00
      block 1 - presentations
      • 09:00
        Welcome! What's this? 10m
        Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
      • 09:10
        Fermilab Experience with OKD (OpenShift) 20m

        Fermilab has made the strategic decision to deploy OKD, the open source version of Red Hat OpenShift, for Kubernetes container management. We will discuss our experience so far with OKD and describe some of the challenges we faced deploying a variety of applications.

        Speakers: Anthony Tiradani (Fermilab), Anthony Richard Tiradani (Fermi National Accelerator Lab. (US))
      • 09:30
        Debugging Kubernetes pod throughput with Calico CNI 20m

        Exploring how the kubelet, with Calico as the CNI plugin,
        depends on the performance of the Kubernetes API server
        to be able to start pods quickly.

        Speaker: Mr Thomas George Hartland (CERN)
      • 09:50
        K8s autoscaling based on custom metrics. Two examples of application: CMSWEB and HTCondor in the CMS Analysis Facility@INFN 20m

        At the moment, Kubernetes only supports horizontal pod autoscaling based on predefined pod metrics (CPU and memory usage). Therefore, in order to achieve an actually green elastic cloud model (optimizing resource usage) a key point is to integrate this tool with autoscaling solutions based on custom metrics, and this requires the usage of third-party elements.
        In this work we show the horizontal pod autoscaling based on custom metrics: in this workflow metrics are collected by a Prometheus server, and are then manipulated and made available to k8s-native Horizontal Pod Autoscaler (HPA) resources.
        We show how we apply the presented feature to two HEP-related use cases: in the first one this solution is applied to CMSWEB (i.e. CMS web services) infrastructure, in the second one it is used to enhance elasticity of an analysis facility prototype on INFN-Cloud, with the automatic scaling of HTCondor instances.

        Speaker: Tommaso Tedeschi (Universita e INFN, Perugia (IT))
      • 10:10
        Lightweight integration of Kubernetes clusters for ATLAS batch processing 20m

        The PanDA team has evaluated the possibility of native Kubernetes job submission in order to process ATLAS workloads and offer the possibility of immediate integration of major cloud computing providers. This model also offers a novel way to set up lightweight compute sites, without the need of setting up a Grid stack.

        During the last year we have been running several queues at clusters setup by institutes associated to ATLAS (ASGC, CERN, University of Chicago, University of Victoria) and on cloud providers (Amazon and Google), and have focused on increasing the stability and efficiency.

        This contribution will discuss the advantages and challenges we have faced during our experience and also briefly introduce ongoing work to integrate less trivial (non pleasantly parallel) workloads.

        Speaker: Fernando Harald Barreiro Megino (University of Texas at Arlington)
      • 10:30
        Lazy Image Pulling with Stargz 20m

        Container images allowed having reproducible environments and container orchestration lets users parallelize and create elaborate workflows with tools like Argo or just Kubernetes jobs. It is easy to create very large images and when parallelizing jobs, the time and cost of pulling container images can increase significantly. Golang developers proposed the seekable tar gunzip format (stargz) to address the issue for their CI by downloading files from the container registry only when they are needed. This presentation describes the current state of lazy loading images with containerd and stargz and we will present the results of our benchmarks.

        Speaker: Spyridon Trigazis (CERN)
    • 11:00 11:20
      Coffee break 20m
    • 11:20 13:00
      Overflow & Open Discussion
      • 11:20
        Reproducible and Scalable workflows for SkyhookDM experimentation on Kubernetes 20m

        Preparing a Systems experiment environment requires setting up infrastructure, baselining the infrastructure, installing dependencies and tools, running experiments, and manually plotting results, which if done manually, is cumbersome and error-prone. This same scenario applies to researchers starting to experiment with Ceph or SkyhookDM, which is an extension for Ceph to run queries on tabular datasets stored as objects. To address this issue, we used Popper, the container-native workflow engine, to build scalable and reproducible workflows for automating an end-to-end pipeline for experimenting with Ceph and SkyhookDM deployed on Kubernetes via Rook.

        Speaker: Jayjeet Chakraborty (University of California, Santa Cruz)
      • 11:40
        Discussion 1h 20m
  • Wednesday, 2 December
    • 09:00 11:45
      block 3 - presentations
      • 09:00
        Multi Cluster / Cloud Kubernetes for GPU Evaluation 20m

        GPUs are scarce resources in many of our centers, including CERN.

        This talk will quickly describe a multi cloud deployment with the goal of evaluating the performance of different workloads in all GPUs offered by GCP, Azure and AWS.

        It will include some details about setting up clusters and GPUs in each of these clouds, and some preliminary results.

        Speaker: Ricardo Brito Da Rocha (CERN)
      • 09:20
        Running a multi-tenant Kubernetes with GitOps 20m

        Starting in October 2020, the PATh project is making a concerted effort to transition the centrally-run OSG services (such as websites, software repositories, information services) from ad-hoc deployment models to Kubernetes.

        To do so, we needed a Kubernetes "home" and an operational model! In this talk, we'll overview the work going on in the Tiger cluster at Morgridge, our current GitOps-based workflow with Flux, and how we see things fitting into the larger ecosystem of distributed services and federated Kubernetes.

        Speaker: Brian Paul Bockelman (University of Wisconsin Madison (US))
      • 09:40
        Overview of CMSWEB Cluster in Kubernetes 20m

        The CMS experiment heavily relies on the CMSWEB cluster to host critical services for its operational needs. The cluster is deployed on virtual machines (VMs) from the CERN OpenStack cloud and is manually maintained by operators and developers. The release cycle is composed of several steps, from building RPMs, their deployment, validation, and integration tests. To enhance the sustainability of the CMSWEB cluster, CMS decided to migrate its cluster to a containerized solution such as Docker, orchestrated with Kubernetes (k8s). This allows us to significantly reduce the release upgrade cycle, follow the end-to-end deployment procedure, and reduce operational cost.

        Recently, we have performed the migration of some CMSWEB services from the VM cluster to the Kubernetes. This talk gives an overview of the current CMSWEB cluster. We describe the new architecture of the CMSWEB cluster in Kubernetes and its implementation strategy. We’ll discuss how we create docker images of the services and deploy them in this cluster using the service deployment cycle. Furthermore, we’ll discuss about monitoring of these services. In the end, we’ll discuss our future plan related to this cluster.

        Speaker: Muhammad Imran (National Centre for Physics (PK))
      • 10:00
        Experience with K8s at Coffea-Casa AF@UNL 20m

        In this contribution we would like to share our experience designing an Analysis Facility for the columnar analysis utilizing the analysis package COFFEA at University of Nebraska-Lincoln and to describe our adventure on deploying different workloads and services at UNL Kubernetes cluster (Jupyterhub with Traefik integration, HTCondor, ServiceX and other infrastructure deployments).

        Speaker: Carl Lundstedt (University of Nebraska Lincoln (US))
      • 10:20
        Test REANA Deployment at BNL 20m

        In this presentation we'll discuss our experiences deploying a test REANA instance on a k8s cluster at BNL.

        Speaker: Christopher Henry Hollowell (Brookhaven National Laboratory (US))
      • 10:40
        What's new with SLATE? 20m

        Will review progress over the past year with SLATE - including new containerized apps, storage provisioner, security policies for federated operations

        Speaker: lincoln bryant
      • 11:00
        Kubernetes at UVic 20m

        I will describe Kubernetes cluster deployment at UVic, including batch computing and APEL accounting for ATLAS.

        Speaker: Ryan Taylor (University of Victoria (CA))
      • 11:20
        Packaging and using services in Kubernetes 20m

        OSG lessons learned distributing service container images and experiences contributing to and deploying services with SLATE

        Speaker: Brian Hua Lin (University of Wisconsin - Madison)
    • 11:45 12:05
      Coffee break 20m
    • 12:05 13:45
      Overflow & Open Discussion
      • 12:05
        Discussion 55m