HEPiX Autumn 2022 Workshop

Name: HEPiX Autumn 2022 Workshop
Start: 2022-10-31T09:00:00+01:00
End: 2022-11-03T18:00:00+01:00
Location: Clarion hotel Umeå

31 Oct 2022, 09:00 → 3 Nov 2022, 18:00 Europe/Amsterdam

Clarion hotel Umeå

Storgatan 36, Umeå, Sweden

Peter van der Reest, Tony Wong

Description

HEPiX Autumn 2022 at Umeå, Sweden

Reconvening for an in-person meeting after the COVID hiatus

The HEPiX forum brings together worldwide Information Technology staff, including system administrators, system engineers, and managers from High Energy Physics and Nuclear Physics laboratories and institutes, to foster a learning and sharing experience between sites facing scientific computing and data challenges.

Participating sites include BNL, CERN, DESY, FNAL, IHEP, IN2P3, INFN, IRFU, KEK, LBNL, NDGF, NIKHEF, PIC, RAL, SLAC, TRIUMF, many other research labs and numerous universities from all over the world.

This workshop is hosted by NeIC - the Nordic e-Infrastructure Collaboration together with HPC2N - High Performance Computing Center North at Umeå University.

Organisers

hepix-conference-support@hepix.org

LOC - Mattias Wadenstein

+46707969462

Monday 31 October
- 09:00 → 10:00
  
  Registration 1h
- 10:00 → 10:30
  Miscellaneous
  
  Conveners: Peter van der Reest, Tony Wong
  - 10:00
    
    Welcome 15m
    
    Speaker: Peter van der Reest
    
    HEPiX_Opening_Autumn2022.pdf
  - 10:15
    
    Logistics 15m
    
    Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
    
    20221031-PracticalMatters.pdf
    
    20221101-PracticalMatters.pdf
- 10:30 → 11:30
  Site Reports
  
  Conveners: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE)), Dr Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR))
  - 10:30
    
    NDGF Site Report 15m
    
    News and development at NDGF-T1
    
    Speaker: Erik Mattias Wadenstein (University of Umeå (SE))
    
    20221031-NDGF-SiteReport.odp
    
    20221031-NDGF-SiteReport.pdf
  - 10:45
    
    ASGC Site Report 15m
    
    Site report and status update of Academia Sinica Grid Computing Centre (ASGC) in Taiwan.
    
    Speaker: Eric Yen (ASGC)
    
    ASGC-HEPiXFall2022-v2.pdf
  - 11:00
    
    IHEP Site Report 15m
    
    The presentation reports the running status and development progress at IHEP site since last HEPIX workshop.
    
    Speaker: Lu Wang (Computing Center,Institute of High Energy Physics, CAS)
    
    Hepix Fall 2022 v4.pdf
    
    Hepix Fall 2022 v4.pptx
  - 11:15
    
    INFN-T1 Site report 15m
    
    A short update on what's new at INFN-T1 Data Center.
    
    Speaker: Andrea Rendina
    
    20221031_InfnT1_site_report.pdf
- 11:30 → 12:00
  
  Coffee break 30m
- 12:00 → 13:15
  Site Reports
  
  Conveners: Dr Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR)), Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))
  - 12:00
    
    CERN site report 15m
    
    News from CERN since the last HEPiX workshop. This talk gives a general update from services in the CERN IT department.
    
    Speaker: Jarek Polok (CERN)
    
    CERN site report - HEPiX 2022 Autumn.pdf
  - 12:15
    
    KEK site report 15m
    
    In this talk, we report updates on the KEK site from the previous HEPiX workshop, mainly focusing on the Grid system configuration, status, and newly introduced IAM system.
    
    Speaker: Tomoaki Nakamura
    
    2022-10-31_TomoakiNakamura.pdf
  - 12:30
    
    CC-IN2P3 Site Report 15m
    
    We will present the site updates since the last Site Report, made in HEPiX Fall 2019.
    
    Speaker: Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR))
    
    CC-IN2P3 HEPiX 22 site report.pdf
  - 12:45
    
    Diamond Light Source Site Report 15m
    
    Diamond Light Source is the UK's national synchrotron. Based on the RAL site, Diamond provides researchers with access to X-Ray techniques which are free at the point of use for academics.
    
    This talk updates HEPiX on the latest developments at Diamond such as a new HPC scheduler, Kubernetes developments, MFA for SSH and staff changes.
    
    Speaker: James Thorne (Diamond Light Source)
    
    Diamond Light Source HEPiX Site Report Autumn 2022.pdf
    
    Diamond Light Source HEPiX Site Report Autumn 2022.pptx
    
    Hepix-DLS-k8s-slide.pdf
    
    Hepix-DLS-k8s-slide.pptx
- 13:15 → 14:45
  
  Lunch break 1h 30m
- 14:45 → 16:00
  End-user Services, Operating Systems
  
  Conveners: Andreas Haupt (Deutsches Elektronen-Synchrotron (DESY)), Georg Rath (Lawrence Berkeley National Laboratory)
  - 14:45
    
    Building HA stateful services using anycast 25m
    
    RCAUTH.eu was set up as a PKIX translation service for AARC with high-availability in mind. This was implemented by using anycast, so the same service is delivered by separate organisations.
    
    This demonstrates that even stateful services can be anycasted; it provides a reference HA infrastructure for EOSC core services and shows that building IP anycasted services does not have to be complex.
    
    This talk was originally given at the recent EUGridPMA meeting.
    
    Speaker: Dennis van Dok (Nikhef)
    
    Building-stateful-HA-services-for-RCauth.eu-HEPiX-202210.pdf
    
    Building-stateful-HA-services-for-RCauth.eu-HEPiX-202210.pptx
  - 15:10
    
    Kubernetes operators for web hosting at CERN: experience 25m
    
    An ambitious plan to modernize the web hosting infrastructure at CERN started in 2019, aiming at consolidating the various hosting technologies on a shared Kubernetes/Openshift platform.
    In this talk I will present the design of the web hosting infrastructure resulting from this project and how we leveraged the Kubernetes operator pattern for site provisioning and configuration management.
    I will also cover our choices and design decisions regarding the user interface for site management, the use of upstream components from the Kubernetes ecosystem and the integration of this Kubernetes-based infrastructure with CERN SSO and more generally the CERN computing environment.
    
    Speaker: Alexandre Lossent (CERN)
    
    CodiMD slides
    
    kubernetes-operators-web-hosting.pdf
  - 15:35
    
    CERN Linux Updates and Outlook 25m
    
    The CERN IT Linux team, together with core services in CERN IT, have been reviewing the Red Hat ecosystem for our production services. In this presentation we will share our current assessment of the situation regarding Stream’s suitability, in particular with regards to stability and the handling of security vulnerabilities. Together with the state of the Enterprise Linux rebuilds, the RHEL license situation, and the community feedback this summary will feed into the plan to move forward to the next platform underpinning services provided by CERN IT.
    
    Speaker: Alex Iribarren (CERN)
    
    20221031-hepix-cern-linux.pdf
- 16:00 → 16:30
  
  Coffee break 30m
- 16:30 → 17:30
  End-user Services, Operating Systems
  
  Conveners: Georg Rath (Lawrence Berkeley National Laboratory), Andreas Haupt (Deutsches Elektronen-Synchrotron (DESY))
  - 16:30
    
    BoF: Recent experiences with CentOS Stream 1h
    
    Birds of a Feather session on experiences with CentOS Stream during the last year
    
    Speakers: Alex Iribarren (CERN), Thomas Hartmann (Deutsches Elektronen-Synchrotron (DE))
    
    20221031-hepix-linux-bof.pdf
    
    Experiences with AlmaLinux.pdf
Tuesday 1 November
- 09:30 → 10:00
  
  Registration 30m
- 10:30 → 10:50
  Miscellaneous
  - 10:30
    
    What is a "Show us your Toolkit" session 15m
    
    Speakers: Christoph Beyer, Peter van der Reest
    
    HEPiX_WhatIsToolbox.pdf
- 10:50 → 11:15
  Computing and Batch Services
  
  Conveners: Michel Jouvin (Université Paris-Saclay (FR)), Dr Michele Michelotto (Universita e INFN, Padova (IT))
  - 10:50
    BigHPC: A Management Framework for Consolidated Big Data and High-Performance Computing 25m
    
    The BigHPC project is bringing together innovative solutions to improve the monitoring of heterogeneous HPC infrastructures and applications, the deployment of applications and the management of HPC computational and storage resources. It aims as well to alleviate the current storage performance bottleneck of HPC services.
    
    The BigHPC project will address these challenges with a novel management framework, for Big Data and parallel computing workloads, that can be seamlessly integrated with existing HPC infrastructures and software stacks.
    The BigHPC project has the following main goals:
    
    Improve the monitoring of heterogeneous HPC infrastructures and applications;
    
    Improve the deployment of applications and the management of HPC computational and storage resources;
    
    Alleviate the current storage performance bottleneck of HPC services.
    
    BigHPC platform is composed by the following main components:
    
    Monitoring Framework
    
    Virtualization Manager
    
    Storage Manager
    
    For the BigHPC project, the main mission of the Monitoring Framework component is to empower users with a better understanding of their jobs workload and to help system admins to predict possible malfunctions or misbehaved applications. BigHPC will provide a novel distributed monitoring software component, targeted for Big Data applications, that updates the state of the art of previous solutions, by:
    
    supporting Big Data specific metrics, namely disk and GPU;
    being non-intrusive, i.e., it will not require the re-implementation or re-design of current HPC cluster software;
    
    efficiently monitoring the resource usage of thousands of nodes without significant overhead in the deployed HPC nodes;
    
    being able to store long-term monitoring information for historical analysis purposes;
    
    providing real-time analysis and visualization about the cluster environment.
    
    Virtual Manager (VM) is a component in the BigHPC implementation that aims to stage and execute application workloads optimally on one of a variety of HPC systems. It mainly consists of two subcomponents, ie. VM scheduler and VM repository.
    The Virtual Manager Scheduler provides an interface to submit and monitor application workloads, coordinate the allocation of computing resources on the HPC systems, and optimally execute workloads by matching the workload resource requirements and QoS specified by the user with the available HPC clusters, partitions and QoS reported by the BigHPC Monitoring and Storage Manager components respectively.
    Additionally, the Virtual Manager Repository provides a platform to construct and store the software services and applications that support BigHPC workloads as container images. It then provides those uploaded images in a programmatic way when a workload request is submitted to the Virtual Manager Scheduler for execution.
    
    The storage performance has become a pressing concern in these infrastructures, due to high levels of performance variability and I/O contention generated by multiple applications executing concurrently. To address the previous challenge, storage resources are managed by following a design based on Software-Defined Storage (SDS) principles, namely through a control plane and data plane. With an architecture tailored for the requirements of data-centric applications running on modern HPC infrastructures, it is possible to improve I/O performance and manage I/O interference of HPC jobs with none to minor code changes to applications and HPC storage backends.
    
    In order to keep all development tasks in a common path, some good practices are needed to get a shorter development life cycle and provide continuous delivery and deployment with software quality. All these BigHPC components are being tested on two different testbeds: development and preview. In the development testbed there is a workflow to test each platform component, where a pipeline allows to automate all required tasks related to software quality and validation. Afterwards, the components are tested in real infrastructure using the preview testbed, where the integration and performance tests take place.
    The implementation of software quality and continuous deployment adopts a GitOPS set of practices that allow the delivery of infrastructure as code and application configurations using git repositories. In this work we are creating the git workflow being adopted for application development and the tools that we are joining together to answer the three components of GitOPS: infrastructure as code, merging changings together and deployment automation.
    
    In this presentation, we will do a brief introduction of the BigHPC project, but focusing on the main challenges we found during this project, facing the goals of the project and the reality of HPC BigData environments concerning the integration tasks.
    
    Speaker: Mr Samuel Bernardo (LIP)
    
    BigHPC_Hepix_Autumn_2022.pdf
- 11:15 → 11:45
  
  Coffee break 30m
- 11:45 → 13:25
  Computing and Batch Services
  
  Conveners: Michel Jouvin (Université Paris-Saclay (FR)), Dr Michele Michelotto (Universita e INFN, Padova (IT))
  - 11:45
    
    Scalable Machine Learning with Kubeflow at CERN 25m
    
    Machine learning (ML) has been shown to be an excellent method for improving performance in high-energy physics (HEP). Applications of ML in HEP are expanding, ranging from jet tagging with graph neural networks to fast simulations with 3DGANs and numerous classification algorithms in beam measurements. ML algorithms are expected to improve in performance as more data are collected during Run 3 and the high luminosity upgrade.
    
    Computing infrastructure is required to support this new paradigm by providing a scalable ML platform for a myriad of users with existing and future use cases. In this talk, we present a general-purpose Kubeflow-based machine learning platform deployed at CERN. We present the platform features such as pipelines, hyperparameter optimization, distributed training, and model serving. We discuss infrastructure details, and the integration of accelerators and external resources. We discuss the existing use cases for the platform, along with a demonstration of the core functionalities.
    
    Speaker: Dejan Golubovic (CERN)
    
    Hepix-Kubeflow-2022-11-01.pdf
  - 12:10
    
    Automation of (Remediation) procedures for Batch Services in CERN with StackStorm 25m
    
    We will update on current status and activities in the CERN batch infrastructure, before concentrating on an introduction to remediation automation. Stackstorm is an open source product which has allowed us to automate activities such as health checks, alarm handling and provide enriched data to L2/L3 support for incidents. We will explain a little of the deployment, the architecture of the product and how it integrates with other products such as our Monitoring stack - Monit, our state manager - BrainSlug and our Cloud infrastructure.
    
    Speaker: Ankur Singh (CERN)
    
    CodiMD Slides
    
    [HEPIX Fall 2022] Automation of (Remediation) procedures for Batch Services in CERN with StackStorm - CodiMD.pdf
  - 12:35
    
    Summary of the european HTC week 25m
    
    Short summary of the european HTC (condor) week held in Cuneo (10 - 14 october)
    
    Speaker: Christoph Beyer
    
    202210_HTCondorWS_Highlights.pdf
  - 13:00
    
    HEPscore benchmark 25m
    
    Report the progress made in the last six months to define the WLCG HEPscore benchmark. This is a joint work of the HEPiX Benchmark Working Group and the WLCG HEPscore deployment Task Torce.
    
    Speaker: Domenico Giordano (CERN)
    
    HEPiX-Workshop-01-11-2022-giordano.pdf
- 13:25 → 13:35
  
  Photo session 10m
- 13:35 → 14:55
  
  Lunch break 1h 20m
- 15:20 → 15:45
  Computing and Batch Services
  
  Conveners: Michel Jouvin (Université Paris-Saclay (FR)), Dr Michele Michelotto (Universita e INFN, Padova (IT))
  - 15:20
    
    Review of HTC scheduling strategies for HEP at GridKa Tier 1 25m
    
    Scheduling at large WLCG sites has to account for several peculiarities of the HEP usage profile: Prominently, the split into only 1-core and 8-core requests is known to lead to fragmentation. In addition, sites have to satisfy long-term and short-term fairshare, efficient job packing, internal flexibility and various other goals. Over the years, various strategies have been proposed in the community and implemented by sites at their own discretion.
    
    We present a review of the strategies previously and currently used at the GridKa Tier 1 to tackle the HEP usage profile. We cover defragmentation, static versus dynamic partitioning, subgroups and more as well as their interplay. As a large grid site supporting several VOs and using the common HTCondor resource manager, we expect our experience to be applicable or at least educational for many sites.
    
    Speaker: Max Fischer (Karlsruhe Institute of Technology)
    
    2022_11_GridKa_Scheduling.pdf
- 15:45 → 16:10
  Site Reports
  
  Conveners: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE)), Dr Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR))
  - 15:45
    
    NSC site report 25m
    
    News from National Supercomputer Centre (NSC) at Linköping University, Sweden.
    
    Speaker: Thomas Bellman (NSC, Linköping University)
    
    nsc-sitereport-20221101.pdf
    
    nsc-sitereport-20221101.tex
- 16:10 → 16:40
  
  Coffee break 30m
- 16:40 → 18:10
  Site Reports
  
  Conveners: Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE)), Dr Sebastien Gadrat (CCIN2P3 - Centre de Calcul (FR))
  - 16:40
    
    MUST site report (IN2P3) 15m
    
    The MUST datacenter (Mesocentre de l'université Savoie Mont Blanc et du CNRS) hosts the university's research computing resources and the IN2P3-LAPP WLCG Tier-2 site. It is also part of the EGI european federation. LHC computing activities of ATLAS and LHCb are supported at production level and remains the driving forces of the MUST datacenter. At the same time, the computing resources including CPU/GPU and storage are used by University researchers who develop and submit their own calculations from an interactive portal.
    
    This site report will present the achieved and ongoing improvements in line with WLCG specifications for the HL-LHC era. Then focus will be given to the studies bridging the gap between the MUST datacenter and cloud environments and related to : infrastructure as code, evolution of AAI (openID, SAmL2, EduGAin)… The opportunity to quote some open questions could be taken as well.
    
    Speaker: Jean MULTIGNER
    
    202211-HEPiX MUST-v2.pdf
  - 16:55
    
    BNL Site Report 15m
    
    An update on BNL activities since the Spring 2022 workshop
    
    Speaker: Costin Caramarcu
    
    BNL Scientific Data and Computing Center (SDCC) Site Report (2).pdf
  - 17:10
    
    DESY Site Report 15m
    
    News from the lab
    
    Speaker: Andreas Haupt (Deutsches Elektronen-Synchrotron (DESY))
    
    DESY-Site-Report.pdf
  - 17:25
    
    RAL Site Report 15m
    
    Round up from RAL
    
    Speaker: Martin Bly (STFC-RAL)
    
    2022-10 HEPiX Umea - RAL Site Report.pdf
  - 17:40
    
    Nikhef site report 15m
    
    I was asked to do the Nikhef site report.
    
    Speaker: Bart van der Wal (NIkhef)
    
    site report 2022.pdf
- 19:00 → 22:00
  
  Conference Dinner 3h Tonka Strandgatan (Umea)
  
  Tonka Strandgatan
  
  Umea
  
  Östra Strandgatan 24, 903 33 Umeå, Sweden
Wednesday 2 November
- 09:30 → 10:00
  
  Registration 30m
- 10:00 → 11:15
  Grids, Clouds and Virtualisation
  
  Conveners: Ian Collier (Science and Technology Facilities Council STFC (GB)), Tomoaki Nakamura
  - 10:50
    
    Cloud Infrastructure Update: Operations, Campaigns, and Evolution 25m
    
    CERN IT’s OpenStack-based cloud service offers more than 300,000 cores to over 3,500 users via virtual machines or directly as bare metal servers. We will give an update on recent service developments, e.g. the automation of bare metal provisioning or the integration of ARM servers, current campaigns, such as the cold-migration of more than 4,000 VMs to a new network control plane, and a prospect on the upcoming activities, such as the upgrade of the compute control plane or the update of the O/S of the hypervisor fleet.
    
    Speaker: Domingo Rivera Barros (CERN)
    
    Cloud_infrastructure_update.pdf
- 11:15 → 11:45
  
  Coffee break 30m
- 11:45 → 13:00
  Networking and Security
  
  Conveners: David Kelsey (Science and Technology Facilities Council STFC (GB)), Shawn Mc Kee (University of Michigan (US))
  - 11:45
    
    Transferring data from ALICE to the Computer Center - 2.4Tbps capacity with DWDM lines 25m
    
    Network setup and design of the connection between CERN computer center and ALICE container hosting Infiniband setup. The infrastructure provides a total of 2.4Tbps capacity by using two DWDM (Dense Wavelength-Division Multiplexing) lines.
    
    Speaker: Daniele Pomponi (CERN)
    
    DWDM_ALICE_DC_HEPIX_2022.pdf
  - 12:10
    
    Enforcing Two-Factor Authentication at CERN: A Technical Report on Our Experiences with User Migration 25m
    
    During 2022 CERN introduced permanent Two-Factor Authentication (2FA) for accounts having access to critical services. The new login flow requires users to always login with a 2FA token (either TOTP or WebAuthn), introducing a significant security improvement for the individual and the laboratory. In this paper we will discuss the rationale behind the 2FA deployment, as well as the technical setup of 2FA in CERN's Single Sign-On system, Keycloak. We will share statistics on how users are responding to the change, and concrete actions we have taken thanks to their feedback. Finally, we briefly cover our custom extensions to Keycloak for specific use cases, which include, persistent cookies and our Kerberos setup.
    
    Speaker: Adeel Ahmad (CERN)
    
    2fa-hepix.pdf
  - 12:35
    
    Computer Security Landscape Update 25m
    
    This presentation provides an update on the global security landscape since the last HEPiX meeting. It describes the main vectors of risks and compromises in the academic community including lessons learnt, presents interesting recent attacks while providing recommendations on how to best protect ourselves.
    
    Speaker: Christos Arvanitis (CERN)
    
    Computer_Security_Update_Autumn_2022.pdf
- 13:00 → 14:35
  
  Lunch break 1h 35m
- 14:35 → 15:50
  Networking and Security
  
  Conveners: Shawn Mc Kee (University of Michigan (US)), David Kelsey (Science and Technology Facilities Council STFC (GB))
  - 14:35
    
    Update from the HEPiX IPv6 working group 25m
    
    The HEPiX IPv6 working group continues to track and encourage the deployment of dual-stack IPv4/IPv6 services. We also recommend dual-stack clients (worker nodes etc). Monitoring of data transfers shows that many are are happening today over IPv6 but it is still true that many are not! Our long-term aim is to move to an IPv6-only WLCG, so we need to discourage the use of IPv4 for data transfers. This talk will present our recent activities including new investigations for the reasons behind ongoing use of IPv4 as well as planning for the move to an IPv6-only core WLCG.
    
    Speaker: David Kelsey (Science and Technology Facilities Council STFC (GB))
    
    kelsey2nov22.pdf
    
    kelsey2nov22.pptx
  - 15:00
    
    Update on the Global perfSONAR Network Monitoring and Analytics Framework 25m
    
    WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues, including connection failures, congestion and traffic routing. The IRIS-HEP/OSG-LHC Networking Area is a partner of the WLCG effort and is focused on being the primary source of networking information for its partners and constituents. We will report on the changes and updates that have occurred since the last HEPiX meeting.
    
    We will cover the status of, and plans for, the evolution of the WLCG/OSG perfSONAR infrastructure, as well as the new, associated applications that analyze and alert upon the metrics that are being gathered.
    
    Speaker: Shawn Mc Kee (University of Michigan (US))
    
    HEPiX Network Monitoring Update Fall 2022.pdf
    
    Update on the Global perfSONAR Monitoring Framework
  - 15:25
    
    Research Networking Technical WG Status and Plans 25m
    
    The high-energy physics community, along with the WLCG sites and Research and Education (R&E) networks are collaborating on network technology development, prototyping and implementation via the Research Networking Technical working group (RNTWG). As the scale and complexity of the current HEP network grows rapidly, new technologies and platforms are being introduced that greatly extend the capabilities of today’s networks. With many of these technologies becoming available, it’s important to understand how we can design, test and develop systems that could enter existing production workflows while at the same time changing something as fundamental as the network that all sites and experiments rely upon.
    
    In this talk we’ll give an update on the Research Networking Technical working group activities, challenges and recent updates. In particular we’ll focus on the flow labeling and packet marking technologies (scitags), tools and approaches that have been identified as important first steps for the work of the group.
    
    Speaker: Marian Babik (CERN)
    
    HEPiX Research Networking Technical WG Update (1).pdf
- 15:50 → 16:15
  IT Facilities and Business Continuity
  
  Conveners: Peter Gronbech (University of Oxford (GB)), Wayne Salter (CERN)
  - 15:50
    
    Saving Energy at DESY 25m
    
    Germany is severely hit by energy price increases and an uncertain availability of power.
    This has consequences also to the DESY laboratory.
    We will show the overall situation of DESY, and reflect on some ideas for saving energy - with a focus on IT related topics.
    
    Speaker: Dr Yves Kemp (Deutsches Elektronen-Synchrotron (DE))
    
    HEPIX-Fall2022-EnergySaving.pdf
- 16:15 → 16:45
  
  Coffee break 30m
- 16:45 → 18:00
  IT Facilities and Business Continuity
  
  Conveners: Wayne Salter (CERN), Peter Gronbech (University of Oxford (GB))
  - 16:45
    
    CERN IT Hardware Procurement activities in 2020-2022 25m
    
    This talk will give an overview of the hardware procurement activities carried in the IT department in the last 2-3 years, highlighting the achieved results in the acquisition of computing and storage equipment during LS2 and moving to the preparation for PCC.
    Together with the figures and the data, it will also be covered how the recent issues in the global supply chain affected the procurement process, the infrastructure planning and the support for production hardware already deployed.
    Finally, current and future hardware trends will be presented both in the context of the usage at CERN, for HEP and for IT services, as well as in the general market.
    
    Speaker: Luca Atzori (CERN)
    
    HEPiX_2022_Procurement_CERN.pdf
  - 17:10
    
    Managing CERN data centers with OpenDCIM 25m
    
    In last two years at CERN we have moved from spreadsheets / notes / printouts to an integrated environment allowing followup of the complete lifecycle of data center assets in a coherent and streamlined way and also providing assistance in planning for future hardware installations.
    The talk will present current usage of OpenDCIM platform at CERN and outline its integration with other systems: Enterprise Asset Management, Network Database, Power and Environment monitoring, Ticket Management system and CERN Cloud environment. Future development ideas will be presented and (time permitting) a short live demonstration will be shown.
    
    Speaker: Jarek Polok (CERN)
    
    CERN-OpenDCIM-HEPiX-2022-Autumn.pdf
  - 17:35
    
    High density computing room, 15 years of experience 25m
    
    In 2007 HPC2N at Umeå University built a new computer room for air cooled high density computing (40+ kW/rack), and presented this over several HEPiX meetings.
    
    This is a follow-up talk where we go over the design, how it turned out in practice, upgrades, experiences over 15 years in production, thoughts for the future, etc.
    
    Speakers: Erik Mattias Wadenstein (University of Umeå (SE)), Niklas Edmundsson (HPC2N, Umeå University)
    
    20221102-HPC2N-room.pdf
- 18:00 → 20:00
  
  Board Meeting (closed session) 2h
Thursday 3 November
- 08:30 → 09:00
  
  Registration 30m
- 09:00 → 10:20
  
  Basic IT Services/Show Us Your ToolBox
  
  Conveners: Erik Mattias Wadenstein (University of Umeå (SE)), Jingyan Shi (IHEP)
- 10:20 → 10:40
  Storage and Filesystems
  - 10:20
    
    DPM storage migration and EOL 20m
    
    DPM storage support is gradually declining and it will be discontinued in the coming years. Computing sites with this grid storage must decide what to use as their future storage technology and each migration strategy comes with different requirements for site administrator expertise, operational effort and expected downtime. We will describe the dCache migration tool distributed with the recent DPM which provides quick and easy way to make one-to-one grid storage replacement transparent to the client applications with less than a day downtime. Several production storage endpoints were already migrated to dCache using this method recently EGI started "DPM migration and decommission" GGUS campaign, because we are getting close to the DPM EOL.
    
    Speaker: Petr Vokac (Czech Technical University in Prague (CZ))
    
    dpm_migration_and_EOL.pdf
- 10:40 → 11:15
  
  Coffee break 35m
- 11:15 → 12:30
  Storage and Filesystems
  
  Conveners: Peter van der Reest, Ofer Rind (Brookhaven National Laboratory)
  - 11:15
    
    Evolving storage services at INFN-T1 25m
    
    INFN CNAF is the National Center of INFN (National Institute for Nuclear Physics) for research and development in the ﬁeld of information technologies applied to high energies physics experiments. CNAF hosts the largest INFN data center, which also includes a WLCG Tier1 site.
    We describe the technologies adopted at CNAF for Data Management and Data Transfer, namely XrootD and StoRM (with its StoRM WebDAV service), and the way our services are evolving in a worldwide context of new protocols and authorization approaches for bulk data transfers between WLCG sites.
    In particular, we report on the challenging transititon from gsiftp to http protocol, which has been implemented via StoRM WebDAV for several experiments hosted at CNAF, and on the ongoing transition from X.509 certificates to JSON Web Tokens (JWT), allowing users to access the resources in a more fine-grained way.
    Also, we detail on a few issues our daily management of storage services has brought to light.
    
    Speaker: Andrea Rendina
    
    Evolving storage services at INFN-T1.pdf
  - 11:40
    
    CERNBox : sync, share and science 25m
    
    CERNBox aims to bring the ease of a file sync and share service to scientific data processing at CERN. It provides a simple and uniform way to access over 15PB of research, administrative and engineering data across more than 2 billion files.
    
    In this contribution we report on our experience and the challenges of taking an upstream sync and share service (ownCloud) and integrating it into CERN’s scientific services and workflows.
    
    The result is a highly capable platform which allows access to scientific data through numerous protocols, applications (e.g. SWAN) and large-scale processing farms (e.g. lxbatch).
    
    We report on recent evolution including privacy enhancements, prototype integrations with Rucio and CERN’s HPC system, evolving the existing integration with FTS for CMS’s Asynchronous Stage Out and exposure of CERN’s ATLAS group disk via CERNBox web.
    
    We close with some observations on future work including synergies with other HEP sites and the federation potential of the system.
    
    Speaker: Diogo Castro (CERN)
    
    presentation.pdf
  - 12:05
    
    Storage for the LHC: operations during Run 3 25m
    
    The CERN IT Storage group operates multiple distributed storage systems to support all CERN data storage requirements. Among these requirements, the storage and distribution of physics data generated by LHC and non-LHC experiments is one of the biggest challenges the group has to take on during Run-3.
    EOS, the CERN distributed disk storage system is playing a key role in LHC data-taking. Since the beginning of the year 2022, more than 380PB have been written by the experiments and 2.4EB have been read so far. With the start of Run-3, the requirements in terms of data storage will be higher than what has been delivered so far.
    The year 2022 was also marked by the decommissioning of the CASTOR service and its successful replacement by the CERN Tape Archive (CTA). This distributed tape storage system offers low-latency tape file archival and retrieval. CTA currently stores over 375 PB and this is expected to increase to more than 1 EB during Run-3.
    The large-scale distribution of the data of the LHC stored on EOS and CTA across the WLCG is mainly ensured by the File Transfer Service (FTS). It offers a reliable, flexible and smart way of initiating data transfers and managing data archives between different storage endpoints all around the world.
    In this presentation, we will show how all these different components interact with each other and the architecture and workflows in place to deliver the Run-3 resources in terms of data storage and provision.
    
    Speaker: Cedric Caffy (CERN)
    
    HEPIX Autumn 2022 - Storage for the LHC operations during Run 3.pdf
- 12:30 → 14:05
  
  Lunch break 1h 35m
- 14:05 → 15:45
  Storage and Filesystems
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Peter van der Reest
  - 14:05
    
    Lessons learned from an atlas data movement stress test for the Lancaster WLCG Tier-2's new XROOTD/CEPHFS storage. 25m
    
    In an attempt to stress test and gauge the throughput potentialof the new XRootD fronted CEPHFS storage element at the Lancaster WLCG Tier 2 a large scale atlas transfer of 100TB of data was initiated. This volume of data was pushed to the site over a period of 3 days and revealed unexpected bottlenecks and problems - of which raw network bandwidth was not one of them. The most notable issue was the timely calculation of the transferred file's checksums. These results have changed the site design to horizontally expand our xrootd infrastructure in response. This talk will detail the tests, our observations and conclusions, and the site's plans after taking the results into account.
    
    Speaker: Matthew Steven Doidge (Lancaster University (GB))
    
    AdHocAtlasDataChallange-HEPiX22-MattDoidge.pdf
  - 14:30
    
    Multi-experiment Storage service at BNL 25m
    
    Part of the mission of the Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory is to provide access to storage service to a diverse range of HEP scientific experiments, e.g, LHC-ATLAS, Belle2, DUNE. An aggregate of 122M files and data storage of 76PBs is distributed and managed by independent storage instances for each VO. The underlying technology used to support such storage is dCache[1].
    An overview of the BNL storage service archive, challenges and recent developments will be provided in this talk.
    
    [1] https://www.dcache.org/
    
    Speaker: Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
    
    BNL dCache FALL HEPIX 2022 .pdf
  - 14:55
    
    Using dCache Storage Events in NextCloud and On-Site Experiments 25m
    
    With the release of 4.0 dCache offers its transfer logs in form of a Kafka message stream. Since then the DESY dCache operations team has made heavy use of these for the monitoring of our installations as well as analytics in general. However, it has so far not been used for any of our production services.
    On of our instances provides the underlying storage for our NextCloud instance. The way NextCloud operates it only recognizes files written byt itself unless a specific scan is made. This prevents making use of dCache's wide array of access protocols.
    This talks shows how the storage events can be used to register files with NextCloud and how we can adapt this workflow to suit the needs of other experiments undertaken at DESY.
    
    Speaker: Christian Voss
    
    HEPiX-Autumn-2022.pdf
  - 15:20
    
    A Scalable and Efficient Staging System between Dcache and HPSS at BNL 25m
    
    BNL recently implemented an efficient staging system between dCache and the backend HPSS tape storage. The system has three modules: the Endit Provider Plugin originated from NDGF, the ENDIT HPSSRetriever developed by BNL, and the optimized BNL’s HPSS Batch (ERADAT) system. This solution addresses performance bottlenecks in staging identified during the WLCG tape challenges. The system significantly improves overall tape staging performance through its asynchronous nature, alleviating heavy loads on stage hosts, and optimizing tape batch retrieval methods. As USATLAS dCache has demonstrated its effectiveness, we plan to extend this staging system to multiple instances of production dCache.
    
    Speaker: Zhenping Liu (Brookhaven National Laboratory)
    
    Zhenping_Liu_ENDIT.pdf
- 15:45 → 16:15
  
  Coffee break 30m
- 16:15 → 16:45
  Miscellaneous
  - 16:15
    
    Closing remarks 30m
    
    Speaker: Peter van der Reest
    
    HEPIX_Wrap-Up_Autumn2022.pdf

Choose timezone

HEPiX Autumn 2022 Workshop

Clarion hotel Umeå

HEPiX Autumn 2022 at Umeå, Sweden

Reconvening for an in-person meeting after the COVID hiatus

Tonka Strandgatan

Umea