HEPiX Fall 2024 Workshop

Name: HEPiX Fall 2024 Workshop
Start: 2024-11-04T08:00:00-06:00
End: 2024-11-08T17:00:00-06:00
Location: No location set

4–8 Nov 2024

US/Central timezone

Contribution List

39. Welcome and Workshop Logistics

Horst Severini (University of Oklahoma (US))

04/11/2024, 09:15

Welcome and Workshop Logistics
40. Research Computing and Storage Strategy at the University of Oklahoma

Henry Neeman (OU), Horst Severini (University of Oklahoma (US))

04/11/2024, 09:25

Welcome and Workshop Logistics
41. Welcome Address from the Physics Department

Phillip Gutierrez (University of Oklahoma (US)), Horst Severini (University of Oklahoma (US))

04/11/2024, 10:10

Welcome and Workshop Logistics
44. SWT2 site report

Zachary Booth (University of Texas at Arlington)

04/11/2024, 11:00

Site Reports

The Southwest Tier-2 (SWT2) consortium is comprised of two data centers
operated at the University of Texas at Arlington (UTA) and at the
University of Oklahoma (OU). SWT2 provides distributed computing
services in support of the ATLAS experiment at CERN. In this
presentation we will describe the resources at each site (CPU cycles and
data storage), along with other associated...
Go to contribution page
16. Site report for AGLT2

Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)

04/11/2024, 11:20

Site Reports

Site Reports

AGLT2 has a few updates to report since the last HEPix meeting in Spring 2024.
1) We transitioned from Cobbler and Satellite plus Capsule server for RHEL provision
2) we transitioned from CFengine to Ansible for configuration management for the RHEL9 nodes.
3) In order to improve the occupancy of the HTcondor cluster, we started tuning of HTCondor and also new developments of scripts to...
Go to contribution page
9. PIC report

Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))

04/11/2024, 11:40

Site Reports

Site Reports

PIC report to HEPIX Fall 2024.
Go to contribution page
18. RAL Site Report

Martin Bly (STFC-RAL)

04/11/2024, 13:30

Site Reports

Site Reports

An update on activities at the RAL datacentre.
Go to contribution page
2. KEK Site Report

Go Iwai (KEK)

04/11/2024, 13:50

Site Reports

Site Reports

The KEK Central Computer System (KEKCC) is the KEK's largest-scale computer system and provides several services such as Grid and Cloud computing.

Following the procurement policy for the large-scale computer system requested by the government, we have taken a multiple-year contract and replaced the entire system at the end of every contract year. The new system has been in production since...
Go to contribution page
35. Operating the 200 Gbps IRIS-HEP Demonstrator for ATLAS

David Jordan (University of Chicago (US))

04/11/2024, 14:10

Computing & Batch Services

The ATLAS experiment is currently developing multiple analysis frameworks which leverage the Python data science ecosystem. We describe the setup and operation of the infrastructure necessary to support demonstrations of these frameworks. One such demonstrator aims to process the compact ATLAS data format PHYSLITE at rates exceeding 200 Gbps. Integral to this study was the analysis of network...
Go to contribution page
23. AUDITOR: An Accounting tool for Grid Sites and Opportunistic Resources

Matthias Jochen Schnepf

04/11/2024, 14:40

Computing & Batch Services

Computing & Batch Services

More and more opportunistic resources are provided to the Grid. Often behind one Compute Element several opportunistic computing resource provider exists or are additional to the pledged resources of a Grid site. For such use cases and others, we have developed a most flexible multipurpose accounting ecosystem AUDITOR (AccoUnting DatahandlIng Toolbox for Opportunistic Resources).  

AUDITOR...
Go to contribution page
25. HEPiX Benchmarking WG: Status Report

Matthias Jochen Schnepf

04/11/2024, 15:45

Computing & Batch Services

HEPScore23 has been the official benchmark for WLCG sites since April 2023.
Since then, we have included community feedback and demand. The Benchmarking WG has started a new development effort to expand the Benchmark Suite with modules that can measure server utilization metrics (load, frequency, I/O, power consumption) during the execution of the HEPScore benchmark.
This enables a closer...
Go to contribution page
33. Atmospheric Visibility Estimation From Single Camera Images: A Deep Learning Approach

Prof. Anderw Fagg (University of Oklahoma)

04/11/2024, 16:15

Miscellaneous

Atmospheric Visibility Estimation From Single Camera Images: A Deep Learning Approach
Go to contribution page
21. S3DF: SLAC Shared Science Data Facility

Adeyemi Adesanya (SLAC)

05/11/2024, 09:00

Site Reports

Site Reports

A site report on the infrastructure and services that underpin SLAC's data-intensive processing pipelines. The SLAC Shared Science Data Facility hosts the Rubin Observatory DF, LCLS-II and many other experimental and research workflows. Networking and Storage form the core of S3DF with hardware deployed in a modern Stanford datacenter.
Go to contribution page
32. Science Cloud based on WLCG Core Technology

Eric Yen (Academia Sinica (TW))

05/11/2024, 09:20

Operating systems, clouds, virtualisation, grids

This presentation will focus on two topics: 1) status of ATLAS T2 site in Taiwan, and 2) experiences of supporting broader scientific computing over the cloud based on WLCG technology.
Go to contribution page
22. A Cloud-Native Control Plane for Infrastructure and Platform Management

Mr Dino Conciatore (CSCS (Swiss National Supercomputing Centre))

05/11/2024, 09:50

Operating Systems, Cloud & Virtualisation, Grids

Operating systems, clouds, virtualisation, grids

Crossplane is a cloud-native control plane for declarative management of infrastructure and platform resources using Kubernetes-native APIs.
It enables the integration of infrastructure-as-code practices by reusing existing tools such as Ansible and Terraform, while providing flexible, instanceable "compositions" for defining reusable resource configurations. This approach allows...
Go to contribution page
45. Nebraska Coffea-Casa Analysis Facility Update

Garhan Attebury (University of Nebraska Lincoln (US))

05/11/2024, 10:45

Operating systems, clouds, virtualisation, grids

The CMS Coffea-Casa analysis facility at the University of Nebraska-Lincoln provides researchers with Kubernetes based Jupyter environments and access to CMS data along with both CPU and GPU resources for a more interactive analysis experience than traditional clusters provide. This talk will cover updates to this facility within the past year and recent experiences with the 200 Gbps challenge.
Go to contribution page
34. dCache on Kubernetes

Elia Luca Oggian (ETH Zurich (CH))

05/11/2024, 11:15

Operating Systems, Cloud & Virtualisation, Grids

Operating systems, clouds, virtualisation, grids

dCache is composed by a set of components running in Java Virtual Machines (JVM) and a storage backend, Ceph in this case. CSCS moved these JVMs into containers and developed an Helm Chart to deploy them on a Kubernetes cluster. This cloud native approach makes the deployments and management of new dCache instances easier and faster.

Encountered challenges and future developments will be...
Go to contribution page
42. Summary of the Joint Xrootd and FTS Workshop

Wei Yang (SLAC National Accelerator Laboratory (US))

05/11/2024, 11:45

Operating systems, clouds, virtualisation, grids

The 2nd Joint Xrootd and FTS Workshop at STFC in September 2024 covered many interesting topics. This presentation will summarize the discussion on state of affairs of FTS and Xrootd, plan on FTS4, WLCG token support in FTS, future plan on CERN Data Management Client, The Pelican project and Xrootd/Xcache, Xrootd monitoring, etc. It will cover some of the feedback by experiments, especially...
Go to contribution page
11. Cost Comparison of On-Premises Storage with S3 Interfaces

Mr Nathan Thompson (Spectra Logic)

05/11/2024, 13:30

Storage & Filesystems

Storage & Filesystems

Abstract: To evaluate the cost of various on-premises storage solutions with traditional and S3 interfaces, including flash, disk, and tape.

This presentation compares the costs, factors of flash, disk, and tape-based storage systems, including systems that are compatible with AWS S3. Key metrics to be considered include purchase price, power consumption, cooling requirements, product...
Go to contribution page
1. Optimising Data Access Analytics: Integrating dCache BillingDB with PIC’s Scalable Big Data Platform

Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES)), Mr Marc Santamaria Riba (PIC)

05/11/2024, 14:00

Storage & Filesystems

Storage & Filesystems

PIC has developed CosmoHub, a scientific platform built on top of Hadoop and Apache Hive, which facilitates scalable reading, writing and managing huge astronomical datasets. This platform supports a global community of scientists, eliminating the need for users to be familiar with Structured Query Language (SQL). CosmoHub officially serves data from major international collaborations,...
Go to contribution page
12. Using AI/ML for Data Placement Optimization in a Multi-Tiered Storage System within a Data Center

Qiulan Huang (Brookhaven National Laboratory (US))

05/11/2024, 14:30

Storage & Filesystems

Storage & Filesystems

Scientific experiments and computations, particularly in High Energy Physics (HEP) programs, are generating and accumulating data at an unprecedented rate. Effectively managing this vast volume of data while ensuring efficient data analysis poses a significant challenge for data centers. This paper aims to introduce machine learning algorithms to enhance data storage optimization across...
Go to contribution page
37. Stories from the TSM to HPSS Migration at KIT

Andreas Petzold (KIT - Karlsruhe Institute of Technology (DE))

05/11/2024, 15:30

Storage & Filesystems

In 2020 we started the migration from our TSM-based tape system to HPSS which was finally finished in the summer of 2024. I'll present lessons learned, pitfalls and also the necessary in-house software developments.
Go to contribution page
8. CERN site report

Elvin Alin Sindrilaru (CERN)

05/11/2024, 16:00

Site Reports

Site Reports

News from CERN since the last HEPiX workshop. This talk gives a general update from services in the CERN IT department.
Go to contribution page
26. Jefferson Lab Site Report and HPDF Introduction

Bryan Hess

05/11/2024, 16:20

Site Reports

Site Reports

I will give an report on the Scientific Computing program at Jefferson Lab and a brief introduction to HPDF, the High Performance Data Facility.
Go to contribution page
43. IHEP Site Report

Siqi Hou

05/11/2024, 16:40

Site Reports

The progress and status of IHEP site since last Hepix.
Go to contribution page
50. Impetus and Drivers

Dr David Crooks (UKRI STFC)

06/11/2024, 09:00

Topical Session: Security Operations Center (SOC)
51. Mini Intro: Zeek

Aashish Sharma (LBNL)

06/11/2024, 09:15

Topical Session: Security Operations Center (SOC)
52. Mini-intro: MISP

James Acris (STFC)

06/11/2024, 09:25

Topical Session: Security Operations Center (SOC)
53. Mini-Intro: pDNSSOC

Romain Wartel (CERN)

06/11/2024, 09:35

Topical Session: Security Operations Center (SOC)
54. WLCG SOC at U Chicago

David Jordan (University of Chicago (US))

06/11/2024, 09:45

Topical Session: Security Operations Center (SOC)
55. Discussion

Dr David Crooks (UKRI STFC), Liviu Valsan (CERN)

06/11/2024, 10:00

Topical Session: Security Operations Center (SOC)

We need to have a discussion about what sites and possible users need and expect.
The goal is to both clarify details and get guidance for what we should focus on during the afternoon sessions today.
Go to contribution page
3. Computer Security Update

Stefan Lueders (CERN)

06/11/2024, 11:25

Networking & Security

Networking & Security

This presentation aims to give an update on the global security landscape from the past year. The global political situation has introduced a novel challenge for security teams everywhere. What's more, the worrying trend of data leaks, password dumps, ransomware attacks and new security vulnerabilities does not seem to slow down.
We present some interesting cases that CERN and the wider HEP...
Go to contribution page
56. Topic 1: Issues capturing ALL traffic with Zeek?

Aashish Sharma (LBNL), Dr David Crooks (UKRI STFC), David Jordan (University of Chicago (US)), Shawn Mc Kee (University of Michigan (US))

06/11/2024, 13:30

SOC Hackathon

We have some sites that have question/potential issues concerning the traffic measurements from Zeek vs SNMP.
- Should be expect that the Zeek traffic estimate should be close to the SNMP counters from the corresponding switch ports?
- Is some kind of NIC/hardware offloading hiding traffic from Zeek?
- Do we have best practice recommendations regarding configurations?
- What should sites...
Go to contribution page
57. Topic 2: Building a good Zeek alert

Aashish Sharma (LBNL), Dr David Crooks (UKRI STFC), Romain Wartel (CERN)

06/11/2024, 14:00

SOC Hackathon

What does it take to craft a good Zeek alert? Can we work through an example or two? What is the suggested guidance for doing this?
Go to contribution page
59. Topic 3: How to deploy pDNSSOC (part 1)

Romain Wartel (CERN)

06/11/2024, 14:30

SOC Hackathon

How to deploy pDNSSOC
Example deployment
Working session
Go to contribution page
58. Topic 1: Configuring sending alerts for various tools

Dr David Crooks (UKRI STFC), Liam Atherton, Romain Wartel (CERN)

06/11/2024, 15:30

SOC Hackathon

How to enable alerts using webhooks and various applications.
Sending to SLACK
Sending to Mattermost
What about Keybase?

Why not email?
Go to contribution page
60. Tool options and choices

Dr David Crooks (UKRI STFC), Stefan Lueders (CERN)

06/11/2024, 16:00

SOC Hackathon

Zeek, MISP, pDNSSOC, Elasticsearch, Opensearch, Elastiflow, ElastiAlert, other information sources, other tools?

Advantages, capabilities, limitations, concerns....

Let's discuss
Go to contribution page
15. Exploring the Carbon Compromises

David Britton (University of Glasgow (GB))

07/11/2024, 09:00

IT Facilities, Business Continuity and Green IT

Topical Session: Carbon & Sustainability in Data Centers

Minimising carbon associated with computing will require compromise. In this presentation I will present the results from simulating a Grid site where the compute is run at reduced frequency when the predicted carbon intensity rises above some threshold. The compromise is a reduction in throughput in exchange for an increased carbon-efficiency for the work that is completed. The presentation...
Go to contribution page
47. Carbon costs of storage: a UK perspective

Samuel Cadellin Skipsey

07/11/2024, 09:25

Topical Session: Carbon & Sustainability in Data Centers

In order to achieve the higher performance year on year required by the 2030s for future LHC up- grades at a sustainable carbon cost to the environment, it is essential to start with accurate measurements of the state of play. Whilst there have been a number of studies of the carbon cost of compute for WLCG workloads published, rather less has been said on the topic of storage, both nearline...
Go to contribution page
36. Case Study: AI Training Power Demand on a GPU-Accelerated Node

Imran Latif (Brookhaven National Laboratory), Shigeki Misawa (Brookhaven National Laboratory (US))

07/11/2024, 09:45

IT Facilities, Business Continuity and Green IT

Topical Session: Carbon & Sustainability in Data Centers

Data center sustainability, a phenomenon that has grown in focus due to the continuing evolution of Artificial intelligence (AI)/High Performance Computing (HPC) systems; furthermore, the rampant increase in carbon emissions resulted in an unprecedented rise in Thermal Design Power (TDP) of the computer chips at the Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory...
Go to contribution page
14. Smart Procurement Utility

David Britton (University of Glasgow (GB))

07/11/2024, 10:05

Topical Session: Carbon & Sustainability in Data Centers

The Smart Procurement Utility is a tool that allows the visualisation of HEPScore/Watt vs HEPScore/unit-cost to guide procurement choices and the compromise between cost and carbon. It uses existing benchmarking data and allows the entry of new benchmarking data. Costs can be entered as relative numbers (percentages relative to a chosen baseline) to generate the cost-related plots.
Go to contribution page
49. Natural job drainage and power reduction in PIC Tier-1 using HTCondor

Jose Flix Molina (CIEMAT - Centro de Investigaciones Energéticas Medioambientales y Tec. (ES))

07/11/2024, 10:15

IT Facilities, Business Continuity and Green IT

Topical Session: Carbon & Sustainability in Data Centers

I will present some preliminary studies and ideas to understand natural job drainage and power reduction in PIC Tier-1, which is using HTCondor. Based on the historical batch system logs, we are simulating natural drainage and understanding how we can modulate the PIC farm without killing jobs.
Go to contribution page
48. Discussion

07/11/2024, 10:25

Topical Session: Carbon & Sustainability in Data Centers
20. Purdue CMS Analysis Facility

Dmitry Kondratyev (Purdue University (US))

07/11/2024, 11:00

Operating Systems, Cloud & Virtualisation, Grids

Operating systems, clouds, virtualisation, grids

The Purdue Analysis Facility (Purdue AF) is an advanced computational platform designed to support high energy physics (HEP) research at the CMS experiment. Based on a multi-tenant JupyterHub server deployed on a Kubernetes cluster, Purdue AF leverages the resources of the Purdue CMS Tier-2 computing center to provide scalable, interactive environments for HEP workflows. It supports a full HEP...
Go to contribution page
38. Transitioning from RHEV to Openshift

Robert Hancock

07/11/2024, 11:30

Operating systems, clouds, virtualisation, grids

A description of our experience deploying Openshift both for container orchestration as well as a replacement for Redhat Enterprise Virtualization.
Go to contribution page
4. One year into the CERN Cyber-Security Audit

Stefan Lueders (CERN)

07/11/2024, 13:30

Networking & Security

Networking & Security

This talk presents the findings of the 2023 cybersecurity audit undertaken at CERN, and the resulting plans/progress/accomplishment the Organization experienced in the past 9 months while implementing their recommendations.
Go to contribution page
7. Firewall under attack: operational security rollercoaster

Patrick Storm, Romain Wartel (CERN)

07/11/2024, 14:00

Networking & Security

Networking & Security

This talk will walk you through the challenges the ESnet security team faced during an attack against one of its firewalls. It covers the struggle and drama to access the data we needed and, in the end, highlights how nothing quite beats good old-fashioned, down-and-dirty system forensics.
Go to contribution page
27. HELP! I have DataCenter Nightmares

Stefan Lueders (CERN)

07/11/2024, 14:30

Networking & Security

With the growing complexity of the IT hardware and software stack, with a move from bare-metal to virtual machines & containers, with the prelevant usage of shared central computing resources for Internet-facing services, provisioning of (internal) user services but also the need for serving industrial control systems (OT) in parallel, the design of data centre architectures and in particular...
Go to contribution page
24. Networking Topics for WLCG

Shawn Mc Kee (University of Michigan (US))

07/11/2024, 15:30

Networking & Security

Networking & Security

We will describe the current activities and plans in WLCG networking, including details about SciTags, the WLCG perfSONAR deployment and the related activities to monitor and analyze our networks. We will also described the related efforts to plan for the upcoming WLCG Network Data Challenge through a series of mini-challenges that incorporate our tools and metrics.
Go to contribution page
19. Getting closer to an IPv6-only WLCG – update from the HEPiX IPv6 Working Group

Martin Bly (STFC-RAL)

07/11/2024, 16:00

Networking & Security

Networking & Security

The HEPiX IPv6 Working Group has been encouraging the deployment of IPv6 in WLCG for many years. At the last HEPiX meeting in Paris we reported that the LHC experiment Tier-2 storage services are now close to 100% IPv6-capable. We had turned our attention to WLCG compute and launched a GGUS ticket campaign for WLCG sites to deploy dual-stack computing elements and worker nodes. At that time...
Go to contribution page
17. Network tests at CZ Tier-2

Jiri Chudoba (Czech Academy of Sciences (CZ))

07/11/2024, 16:30

Networking & Security

Networking & Security

The CZ Tier-2 in Prague (the Czech Republic) joined the WLCG Data Challenge 24 and managed to receive and sent more than 2 PB during the second week of the DC24. Since than we upgraded our network connection to LHCONE from 100 to 2x100 Gbps. The LHCONE link uses GEANT connection, which was also upgraded to 2x100 Gbps. During July 2024 we executed dedicated network stress tests between Prague...
Go to contribution page
13. How AI networking Fabrics are different from today's Data Center fabrics

Paul Gilbert (Arista Networks)

07/11/2024, 17:00

Networking & Security

Networking & Security

This presentation looks at what is different about building and deploying AI fabrics. I can if needed remove the Arista logo's from the presentation. I don't see a place to attach the presentation?
Go to contribution page
6. pkcli: A Framework for Scripts to Manage Applications

evan carlin (RadiaSoft LLC)

08/11/2024, 09:00

Basic and End-User IT Services

Basic and end-user IT services

System administrators and developers need a way to call application code and other tasks through command line interfaces (CLIs). Some examples include user management (creation, deletion, moderation, etc) or seeding the database for development. We have developed an open source Python framework, [pykern.pkcli][1], that simplifies the creation of these application-specific CLIs. In this talk, I...
Go to contribution page
5. CI4FPGA: Continuous Integration for FPGA/SoC Projects

Carmen Marcos

08/11/2024, 09:30

Basic and End-User IT Services

Basic and end-user IT services

As the complexity of FPGA and SoC development grows, so does the need for efficient and automated processes to streamline testing, building, and collaboration, particularly in large-scale scientific environments such as CERN. This initiative focuses on providing CI infrastructure that is tailored for FPGA development and pre-configured Docker images for essential EDA tools, keeping the...
Go to contribution page
46. CRISP: Collaborative Tools for the ePIC Experiment

Ofer Rind (Brookhaven National Laboratory)

08/11/2024, 10:00

Basic and end-user IT services

This talk describes a project to develop a set of collaborative tools for the upcoming ePIC experiment at the BNL Electron-Ion Collider (EIC). The "Collaborative Research Information Sharing Platform" (CRISP) is built upon an extensible, full-featured membership directory, with CoManage integration and a customized InvenioRDM document repository. The CRISP architecture will be presented, along...
Go to contribution page
28. Shifting Hardware Landscape

Shigeki Misawa (Brookhaven National Laboratory (US))

08/11/2024, 11:00

Miscellaneous

Advances in computing hardware are essential for future HEP and NP experiments. These advances are seen as incremental improvements in performance metric over time, i.e. everything works the same, just better, faster, and cheaper. In reality, hardware advances and changes in requirements can result in the crossing of thresholds that require a re-evaluation of existing practices. The HEPiX...
Go to contribution page

Choose timezone

HEPiX Fall 2024 Workshop