# SONC (+bonus Al hardware show&tell)

### Javier Campos, Yongbin Feng, Nhan Tran Fermilab May 22, 2024



# What you've heard already

### **Computing Challenge in High Energy Physics** Large-scale Monte Carlo (MC) Simulations Data Analysis from Particle Detectors Complex Computational Models (e.g., Lattice QCD) Total CPU HL-LHC (2031/No R&D Improvements) fractions 50000 ATLAS Preliminary **CMS** Public Other: 2%





# What you've heard already



### CPU vs GPU: Chip Structure

### CPU

- Small number of powerful cores (~10)
  - Branch prediction, out-of-order execution, etc.
- Large caches
- Single instruction on multiple data (SIMD)

### GPU

- Many number of cores (≳1000)
  - A lot simpler
- Small caches
- Single instruction on multiple threads (SIMT)



Practically GPU can outperform CPU with parallelizable and relatively simple algorithms





# Computing technology













| Operation:          | Energy |
|---------------------|--------|
| 8b Add              | 0.0    |
| 16b Add             | 0.0    |
| 32b Add             | 0.1    |
| 16b FP Add          | 0.4    |
| 32b FP Add          | 0.9    |
| 8b Mult             | 0.2    |
| 32b Mult            | 3.     |
| 16b FP Mult         | 1.1    |
| 32b FP Mult         | 3.     |
| 32b SRAM Read (8KB) | 5      |
| 32b DRAM Read       | 64     |
|                     |        |



|  | Operation:          | Energy |
|--|---------------------|--------|
|  | 8b Add              | 0.0    |
|  | 16b Add             | 0.0    |
|  | 32b Add             | 0.1    |
|  | 16b FP Add          | 0.4    |
|  | 32b FP Add          | 0.9    |
|  | 8b Mult             | 0.2    |
|  | 32b Mult            | 3.1    |
|  | 16b FP Mult         | 1.1    |
|  | 32b FP Mult         | 3.7    |
|  | 32b SRAM Read (8KB) | 5      |
|  | 32b DRAM Read       | 64     |
|  |                     |        |



|  | Operation:          | Energy |
|--|---------------------|--------|
|  | 8b Add              | 0.0    |
|  | 16b Add             | 0.0    |
|  | 32b Add             | 0.1    |
|  | 16b FP Add          | 0.4    |
|  | 32b FP Add          | 0.9    |
|  | 8b Mult             | 0.2    |
|  | 32b Mult            | 3.1    |
|  | 16b FP Mult         | 1.1    |
|  | 32b FP Mult         | 3.7    |
|  | 32b SRAM Read (8KB) | 5      |
|  | 32b DRAM Read       | 64     |
|  |                     |        |



| Operation:          | Energy |
|---------------------|--------|
| 8b Add              | 0.0    |
| 16b Add             | 0.0    |
| 32b Add             | 0.1    |
| 16b FP Add          | 0.4    |
| 32b FP Add          | 0.9    |
| 8b Mult             | 0.2    |
| 32b Mult            | 3.1    |
| 16b FP Mult         | 1.1    |
| 32b FP Mult         | 3.7    |
| 32b SRAM Read (8KB) | 5      |
| 32b DRAM Read       | 64     |
|                     |        |



| Operation:          | Energy |
|---------------------|--------|
| 8b Add              | 0.0    |
| 16b Add             | 0.0    |
| 32b Add             | 0.1    |
| 16b FP Add          | 0.4    |
| 32b FP Add          | 0.9    |
| 8b Mult             | 0.2    |
| 32b Mult            | 3.1    |
| 16b FP Mult         | 1.1    |
| 32b FP Mult         | 3.7    |
| 32b SRAM Read (8KB) | 5      |
| 32b DRAM Read       | 64     |
|                     |        |



# Accelerated compute

### Embedded Systems

Embedded in our experiments; often (hard) real-time latency constraints, custom architectures

### Coprocessors

Traditional datacenter-scale compute; throughput-driven; general purpose architectures

## Embedded hardware demo

### Embedded Systems

Embedded in our experiments; often (hard) real-time latency constraints, custom architectures





### CMS Experiment 40MHz collision rate ~1B detector channels







### CMS Experiment 40MHz collision rate ~18 detector channels



10x AVERAGE INTERNET TRAFFIC IN NORTH AMERICA (2021)





### CMS Experiment 40MHz collision rate ~1B detector channels



10x AVERAGE INTERNET TRAFFIC IN NORTH AMERICA (2021)





### CMS Experiment 40MHz collision rate ~1B detector channels



TRAFFIC IN NORTH AMERICA (2021)









TRAFFIC IN NORTH AMERICA (2021)







adapted from Vladimir Loncar

### MLCommons launches machine learning benchmark for devices like smartwatches and voice assistants

by Ben Wodecki 6/16/2021



consortium behind the MLPerf benchmark test, has launched a new measurement suite aimed

performance of embedded devices and models

### Fermilab, UCSD, Columbia, teamed up with AMD/Xilinx for IoT submissions for MLCommons benchmarks



| AI<br>Dn-                    |                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|------------------------------|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| te Al<br>ions,               |                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| for<br>a-chip<br>eural<br>nd | m<br>so<br>w<br>le<br>re | atapult AI NN brings together hls4ml, an open-source package for<br>achine learning hardware acceleration, and Siemens' Catapult™<br>oftware for High-Level Synthesis. Developed in close collaboratio<br>ith Fermilab, a U.S. Department of Energy Laboratory, and other<br>ading contributors to hls4ml, Catapult AI NN addresses the uniq<br>quirements of machine learning accelerator design for power,<br>erformance, and area on custom silicon. |



# Accelerated compute



### Coprocessors

Traditional datacenter-scale compute; throughput-driven; general purpose architectures



### A New Golden Age for Computer Architecture

Agriculture Technology Monitoring Noise Pollution The Computational Sprinting Game Blockchain from a Distributed Computing Perspective





**IBM North** 



Inspired by the brain, NorthPole stores memory near compute, with no centralized or off-chip memory, mitigating von Neumann bottleneck (unlike contemporary architectures).

|      | Compute | Memory |       |             |               |             |
|------|---------|--------|-------|-------------|---------------|-------------|
| Pole |         |        | (     | Other Conte | mporary Al Ar | chitectures |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               | TPU         |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               |             |
|      |         |        |       |             |               | trans       |
|      |         |        |       |             |               | CPU (Zen 3) |
|      |         |        |       |             |               |             |
|      |         |        | GPU ( | A100)       |               |             |















































































# **Coprocessors for Science**

- We are not the primary customers of these chips
- How can we leverage advancements in industry?
  - Be flexible map the right architecture to right application
  - Build benchmarks for physics workloads
  - **Programmability** important for accessibility why GPU is leading the pack

### **SONIC** Services for Optimized Network Inference on Coprocessors



### SONIC Services for Optimized Network Inference on Coprocessors

- SW frameworks
- Abstract software AND hardware with modern containerization tools
- Scalable, Flexible, Adaptable, Non-disruptive

#### SONIC: tools for inference-as-a-service within experimental

#### Domain Algo

#### as a Service (aaS)

















### aaS



#### gRPC is a cross-platform open source high performance remote procedure call framework...

It uses HTTP/2 for transport, Protocol Buffers as the interface description language, and provides features such as authentication, bidirectional streaming and flow control, blocking or nonblocking bindings, and cancellation and timeouts. It generates cross-platform client and server bindings for many languages.

https://en.wikipedia.org/wiki/GRPC

## aaS vs direct connect



Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)



Pros: less system complexity no network latency

## aaS vs direct connect



Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)

less system complexity no network latency

### SONIC Services for Optimized Network Inference on Coprocessors

- Scalable: Not bound to coprocessors directly connected to CPUs at a given node
- the number of coprocessors needed based on the task latest TF or Torch, custom libraries or languages
- Flexible: Can be used on any hardware and can "right-size" • Adaptive: Abstracts server software stack including compiling Non-disruptive: builds off current experimental infrastructure, can offload as needed without changing current paradigm

## A brief history

- First deployed on FPGAs for CMS in collaboration with Microsoft Brainwave
- Then deployed on GPUs for LHC and neutrino applications
- Demonstrated for GW experiments and now explored in many other areas
  - See recent SONIC mini-workshop: <u>https://indico.cern.ch/event/</u> <u>1372201</u>

https://arxiv.org/abs/1904.08986 https://arxiv.org/abs/2009.04509 https://en.wikipedia.org/wiki/Amdahl%27s\_law



| Wall time (s) |           |                |      |
|---------------|-----------|----------------|------|
|               | ML module | non-ML modules | Tota |
| CPU only      | 220       | 110            | 330  |
| CPU + GPUaaS  | 13        | 110            | 123  |







## State-of-the-art

 Most popular/flexible workflows currently use GPUs and leverage Nvidia Triton Server



## State-of-the-art

 Most popular/flexible workflows currently use GPUs and leverage Nvidia Triton Server



## State-of-the-art

#### Most popular/flexible workflows currently use GPUs and leverage Nvidia Triton Server





## Highlight: first CMS paper



- Scales up well with large number of CPUs and GPUs
- GPUs, and also first study of the inference as-a-service approach



• Similar performance for servers running at different sites. Network overload impacts is small!

• CMS-PAS-MLG-23-001: First CMS paper systematically studying the computing performance on



- SONIC is one of the most promising approaches to using coprocessors for accelerating computing workloads in science
- A lot of interesting areas to work on: benchmarking & tech survey, scale-out, resource orchestration, etc.

 Now you have a chance to try SONIC/Triton tools out yourselves!

## Outlook

# **Tutorial time**