# Front-End Rdma Over Converged Ethernet, real-time

## firmware simulation

BERGNOLI Antonio<sup>1</sup>, BORTOLATO Damiano<sup>2</sup>, <u>BORTOLATO Gabriele<sup>1,a,b</sup></u>, MENGONI Daniele<sup>1,a</sup>, MIGLIORINI Matteo<sup>1,a</sup>, MONTECASSIANO Fabio<sup>1</sup>, PAZZINI Jacopo<sup>1,a</sup>, TRIOSSI Andrea<sup>1,a</sup>, VENTURA Sandro<sup>1</sup>, ZANETTI Marco<sup>1,a</sup>

 $^1$ INFN sezione Padova,  $^2$ INFN LNL,  $^a$ Università degli studi di Padova,  $^b$ CERN













#### Introduction on RDMA and RoCE

In a DAQ system a large fraction of CPU resources is engaged in networking rather than in data processing; common network stacks that take care of network traffic usually manipulate data through several copies.



Remote Direct Memory Access (RDMA), as the name suggests, allows read and write operations directly in the target machine(s). This implies no OS involvement allowing high-throughput and low-latency applications.

This requires RDMA enabled NICs on both ends (RNIC) that perform the DMA, reducing the CPU load.



#### Many RDMA flavours are available:

- InfiniBand, it requires IB capable switches
- RoCEv1, it introduces the Ethernet framing, enable use of commodity switches
- RoCEv2, it adds the UDP/IP transport protocol

| InfiniBand |        |                         |            |      |      |  |  |  |
|------------|--------|-------------------------|------------|------|------|--|--|--|
| LRH        | IB GRH | IB BTH<br>(+ RETH/AETH) | IB Payload | ICRC | VCRC |  |  |  |

- Local and Global Route Headers
- Base and Extended Transport Headers

#### Many RDMA flavours are available:

- InfiniBand, it requires IB capable switches
- RoCEv1, it introduces the Ethernet framing, enable use of commodity switches
- RoCEv2, it adds the UDP/IP transport protocol

| InfiniBand       |        |                         |            |      |      |  |  |  |  |
|------------------|--------|-------------------------|------------|------|------|--|--|--|--|
| LRH              | IB GRH | IB BTH<br>(+ RETH/AETH) | IB Payload | ICRC | VCRC |  |  |  |  |
| RoCEv1           |        |                         |            |      |      |  |  |  |  |
| Eth L2<br>Header | IB GRH | IB BTH<br>(+ RETH/AETH) | IB Payload | ICRC | FCS  |  |  |  |  |

• Eth L2 Header instead of LRH



#### Many RDMA flavours are available:

- InfiniBand, it requires IB capable switches
- RoCEv1, it introduces the Ethernet framing, enable use of commodity switches
- RoCEv2, it adds the UDP/IP transport protocol



• Drop the use of Global ID (GID) in favour of IP (RoCEv2 UDP port number 4791)



#### Many RDMA flavours are available:

- InfiniBand, it requires IB capable switches
- RoCEv1, it introduces the Ethernet framing, enable use of commodity switches
- RoCEv2, it adds the UDP/IP transport protocol  $\leftarrow$



RoCEv2 is the only industrystandard Ethernet-based RDMA solution with a multi-vendor ecosystem. For this reason it has been chosen as target protocol.

#### Honourable mention

• iWARP, congestion-aware protocols, but higher complexity





Constant trend in producing larger and larger dataset in almost every experimental physics field, new requirements arise form that:

- High throughput, low latency
- Efficient data movement



Constant trend in producing larger and larger dataset in almost every experimental physics field, new requirements arise form that:

- High throughput, low latency
- Efficient data movement

Such requirements lead to clever ideas and features:

- Zero-copy protocols such as InfiniBand or RoCE
- Move network protocol directly in the front-end electronics (FPGA)
- Need to be scalable 1/10/100 Gb/s to target different scenarios
- Multi-vendor ecosystem Xilinx/Microchip/Altera



Constant trend in producing larger and larger dataset in almost every experimental physics field, new requirements arise form that:

- High throughput, low latency
- Efficient data movement

Such requirements lead to clever ideas and features:

- Zero-copy protocols such as InfiniBand or RoCE
- Move network protocol directly in the front-end electronics (FPGA)
- Need to be scalable 1/10/100 Gb/s to target different scenarios
- Multi-vendor ecosystem Xilinx/Microchip/Altera

What can we achieve?

- Front-end initiates the RDMA transfer
- No point-to-point connection between front-end back-end
- Dynamical switching routing with COTS (lowering the costs and maintenance)



#### What is FERoCE?



Back-end boards required to get the data, and send it to the computing farms. This requires multiple custom cards and custom boards

#### What is FFRoCF?



to the computing farms. This requires multiple custom ethernet frame allowing switching and routing. cards and custom boards

Back-end boards required to get the data, and send it Front-end boards send data already packaged within an Choosing the proper protocol allows the use of COTS switches

#### What is FFRoCF?



to the computing farms. This requires multiple custom ethernet frame allowing switching and routing. cards and custom boards

Back-end boards required to get the data, and send it Front-end boards send data already packaged within an Choosing the proper protocol allows the use of COTS switches

ETH RDMA network stack library has been chosen for the first prototype. Some of its characteristics:

- Entirely written in HLS (Vivado 2019.1)
- It targets Xilinx FPGA with PCIe connection
- 10/100 Gb/s speeds
- It supports UDP, TCP and RDMA



Systems @ **ETH** zürich



Why a dynamic firmware simulation is needed?



- Narrow test-case, limited by the stimulus
- Difficult to evaluate the RoCE stream produced
- Easy to set-up



- Explore wider test-case phase space
- Feed/Get ethernet frames directly to/from the code
- Simulate the HDL produced starting from the HLS code
- Capture frames with third party programs (e.g. Wireshark)
- Possibility to treat it as a device and send frames to Soft-RoCE or to a physical RNIC





Start form ETH network stack entirely developed in HLS. Functionalities and features must be understood: real-time firmware simulation with real network traffic.

Works on Linux machines: Tun/Tap devices



- Works on Linux machines: Tun/Tap devices
- It makes use of DPI-C interface of SystemVerilog: C code in our testbench!



- Works on Linux machines: Tun/Tap devices
- It makes use of DPI-C interface of SystemVerilog: C code in our testbench!
- Tap device exchanges raw ethernet frames between simulation and Linux network stack



- Works on Linux machines: Tun/Tap devices
- It makes use of DPI-C interface of SystemVerilog: C code in our testbench!
- Tap device exchanges raw ethernet frames between simulation and Linux network stack
- We can capture such frames and study them



Start form ETH network stack entirely developed in HLS. Functionalities and features must be understood: real-time firmware simulation with real network traffic.

- Works on Linux machines: Tun/Tap devices
- It makes use of DPI-C interface of SystemVerilog: C code in our testbench!
- Tap device exchanges raw ethernet frames between simulation and Linux network stack
- We can capture such frames and study them

Simulation with Synopsys VCS. XGMII interface directly form Xilinx MAC



Capture and analyze packets, are they malformed? Are the RoCE parameters sent correctly?

Once the stack has been verified, firmware can be eventually built (Resources? Performances? Is timing closure reached?)



Start form ETH network stack entirely developed in HLS. Functionalities and features must be understood: real-time firmware simulation with real network traffic.

- Works on Linux machines: Tun/Tap devices
- It makes use of DPI-C interface of SystemVerilog: C code in our testbench!
- Tap device exchanges raw ethernet frames between simulation and Linux network stack
- We can capture such frames and study them

Simulation with Synopsys VCS. XGMII interface directly form Xilinx MAC



Soft-RoCE used to capture and store in memory data sent. Enable fast verification of the stack without going through sythesis/implementation every time.

Once the stack has been verified, firmware can be eventually built (Resources? Performances? Is timing closure reached?)



## Changes implemented

Some changes have to be made to the stack to enable us to use the AXI-stream port:

- Update FSM for RMDA WRITE:
  - AXI-stream port did not work properly with RDMA WRITE, FSM get stuck if message is too big: support only for WRITE ONLY, WRITE FIRST-MIDDLE-LAST are needed!
- Re-enable and update iCRC computation:
  - By default iCRC was disabled
  - The mask for its computation was wrong
  - Need to solve timing violation here (time multiplex the computation?)
- Add prefix in each IP:
  - Simulator doesn't like IP with same name but different functionality..



#### **RoCE**

RoCEv2 is a complex protocol, but not all its features are required for this project. RoCE supports many operations such as: RDMA SEND, RDMA WRITE, RDMA READ, ATOMIC OPERATIONS.



The goal is only to push data and initiate the RDMA transfer, for this reason only RDMA WRITE is considered.



## ETH TX engine details

QP Context and connection info contains:

- QP numbers
- Remote and local PSNs
- Remote key
- Virtual address
- Remote IP address



#### TX metadata contains:

- Operation type
- QP number
- Remote address
- Local address
- DMA length

## ETH TX engine details

QP Context and connection info contains:

- QP numbers
- Remote and local PSNs
- Remote key
- Virtual address
- Remote IP address



#### TX metadata contains:

- Operation type
- QP number
- Remote address
- Local address
- DMA length

## ETH TX engine details

QP Context and connection info contains:

- QP numbers
- Remote and local PSNs
- Remote key
- Virtual address
- Remote IP address

#### TX metadata contains:

- Operation type
- QP number
- Remote address
- Local address
- DMA length





Used Wireshark to capture Ethernet frames coming out of the simulation.



- Queue Pair number
- RDMA OP Code

- IP addresses
- Memory addresses



Used Wireshark to capture Ethernet frames coming out of the simulation.



- Queue Pair number
- RDMA OP Code

- IP addresses
- Memory addresses



Used Wireshark to capture Ethernet frames coming out of the simulation.



- Queue Pair number
- RDMA OP Code

- IP addresses
- Memory addresses



Used Wireshark to capture Ethernet frames coming out of the simulation.



- Queue Pair number
- RDMA OP Code

- IP addresses
- Memory addresses



# Summary and Outlook

#### Summary

- Developed a dynamic simulation
- Tested and verified ETH network stack
- Fed simulation RoCE data to Soft-RoCE end-point



## Summary and Outlook

#### Summary

- Developed a dynamic simulation
- Tested and verified ETH network stack
- Fed simulation RoCE data to Soft-RoCE end-point

#### Outlook

October 3 2023

- Cut ETH library to reduce the FPGA resource footprint
- Move from Xilinx HLS to a more agnostic HLS and/or rewrite a stack's subset in HDL (only RDMA WRITE)
- Delopy the light-RoCE in a Microchip FPGA







# References (I)

- System@ETHzürich network stack repository,
  https://github.com/fpgasystems/fpga-network-stack
- System@ETHzürich Distributed OS, https://github.com/fpgasystems/davos
- Modified network stack repository (work in progress), https://github.com/Gabriele-bot/fpga-network-stack
- CRC mask fix network stack repository, https://github.com/Nayib/fpga-network-stack

