Input & Weights

Tree 000000 Results

Backup 00000000000

# Tree Tensor Network inference on FPGA 1st FPGA Developers Forum

## University of Padua

#### L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

12 June 2024









University of Padua

Tree Tensor Network inference on FPGA

・ロト・日本 きょうきょうへん L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

1 / 33

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 00000000000 |
|              |                 |        |         |             |

**2** Input & Weights

**3** Tree





University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

(E) (E)

< 17 ▶

Tree Tensor Network inference on FPGA

æ

| Introduction | Input & Weights | <b>Tree</b> | Results | Backup      |
|--------------|-----------------|-------------|---------|-------------|
| ●00          | 00000           | 000000      | 000000  | 00000000000 |
|              |                 |             |         |             |

**2** Input & Weights

**3** Tree

4 Results

6 Backup

- ▲ロト ▲団ト ▲臣ト ▲臣ト 三回 ろんの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

Input & Weights

Tree 000000 Results 000000 Backup 00000000000

#### Tree Tensor Networks

**Tensor Networks** have first been developed to investigate quantum many-body systems on classical computers by efficiently representing quantum wavefunction  $|\psi\rangle$  and Hamiltonians H [1]. Typically used for energy minimization or time evolution simulations, they can also be exploited in **Machine Learning** contexts.





They are the result of the factorization of very large tensors into networks of smaller tensors. Several types of decompositions are possible (MPS, MPO etc.): their approximation can be tuned by modifying *bond dimensions*[2].



University of Padua

#### Tree Tensor Networks: Machine Learning

**TTNs** can be trained as **ML Classifiers** following the decision function:

 $f(x) = W \cdot \Phi(x).$ 

The information of an N-features dataset can be encoded in the network W by contracting it with each data sample  $\Phi(x)$  and iteratively updating all the inner tensors [3].



In this project:

- **Task**: binary classification, scalar result.
- Datasets: Iris[4], Titanic[5] and LHCb[1].



**Inference** can be performed by contracting the whole TTN with each sample: the resulting vector stores the classification probabilities for each label of the dataset.

- Architectures: 4, 8, 16 input features.
- **Parallelism**: full parallel and partial parallel implementations.
- **Training** in software, **inference** in hardware (FPGA KCU1500).

▲ □ ▷ ▲ (관 ▷ ▲ 볼 ▷ ▲ 볼 ▷ 볼 ♡ 익 ( L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

University of Padua

| Introduction | Input & Weights | Tree   | Results | Backup     |
|--------------|-----------------|--------|---------|------------|
| 000          | ●0000           | 000000 | 000000  | 0000000000 |
|              |                 |        |         |            |

## **2** Input & Weights

**3** Tree

4 Results



・ロト・西ト・ヨト・ヨー わえの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti



All the values in firmware are represented as <u>16</u> bit fixed-point numbers, devoting 1 bit for the Sign, 1 bit for the Integer part and 14 bits for the Fractional part, corresponding to [-2,+2] as total representation range, with precision  $6.103 \cdot 10^{-5}$ .



Following the quantum approach, to perform classification each dataset feature  $x_i$  is encoded by a local feature map  $\Phi(x_i) = [\cos \frac{\pi x_i}{2}, \sin \frac{\pi x_i}{2}]$ . In this way, each sample represents a separable state  $|\psi\rangle$  resulting from the tensor products of 2-dim vectors. The spinorial mapping encloses input values in the range [-1,+1] and guarantees their representability.

Input data are sent to the FPGA via **AXI-Stream** protocol and the feature map is implemented in hardware. Since the original features live in  $\mathbb{R}$ , they first need to be rescaled in  $[0, \frac{\pi}{2}]$  in software.

University of Padua

・ロト・日本 日本 モラト モラト モーラー モークへの L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction   | Input & Weights | Tree   | Results | Backup      |
|----------------|-----------------|--------|---------|-------------|
| 000            | 00●00           | 000000 | 000000  | 00000000000 |
| Input: Feature | map             |        |         |             |

 $\sin(x)$  and  $\cos(x)$  functions are implemented in hardware with Vivado IP Block Memory Generator. Each BRAM is configured during implementation (sin.coe, cos.coe) and fixed in firmware.



BRAM: lat=2 clk, width=16, depth=65536, corresponding to 131 kB/BRAM. The number of necessary BRAMs is always twice the number of features N.

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 000●0           | 000000 | 000000  | 00000000000 |
| Weights      |                 |        |         |             |

<u>TTN architecture fixed</u> in firmware by setting: number of features N, input dimension D, bond dimensions  $X_i$ , output dimension O.

Weights loaded from host PC: try different networks and perform quantization tests.



| Ν  | D | $X_1$ | $X_2$ | $X_3$ | 0 | Params | Mem.  | Reg. | Blocks |
|----|---|-------|-------|-------|---|--------|-------|------|--------|
| 4  | 2 | 4     | 0     | 0     | 1 | 48     | 96 B  | 24   | 1      |
| 4  | 2 | 8     | 0     | 0     | 1 | 128    | 256 B | 64   | 1      |
| 8  | 2 | 4     | 4     | 0     | 1 | 208    | 416 B | 104  | 1      |
| 8  | 2 | 4     | 8     | 0     | 1 | 384    | 768 B | 192  | 1      |
| 16 | 2 | 4     | 8     | 8     | 1 | 1728   | 3 kB  | 864  | 2      |
| 16 | 2 | 4     | 8     | 16    | 1 | 2944   | 5 kB  | 1472 | 3      |

The weights are loaded on FPGA via **AXI Lite** protocol: read-write from host PC, read-only from TTN. Blocks of 512x32bit registers generated as custom Vivado IP **AXI Peripheral**.

University of Padua

・ロト・日本 日本 モラト モラト モーシー モークへの L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup       |
|--------------|-----------------|--------|---------|--------------|
| 000          | 0000●           | 000000 | 000000  | 000000000000 |
| Weights      |                 |        |         |              |

Once the architecture is fixed and the FPGA is programmed, the weights registers can be written by host PC and read back for verification. During inference, these values remain static and are only read by the TTN component.



- Crossbar: receives AXLL ite information and switches between different register blocks, according to base address value.
- Register Slice: slices vectors  $([65:0] \rightarrow 2\times [31:0])$  and registers the values

- Reg. Block: 512x32b registers, 1024×16b weights. Forwards W vector to TTN
- Timing: timing constraints must be relaxed. Area is too big but the values are static.

イロト 不得 トイヨト イヨト L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

#### University of Padua

Tree Tensor Network inference on FPGA

3

| Introduction | Input & Weights | <b>Tree</b> | Results | Backup      |
|--------------|-----------------|-------------|---------|-------------|
| 000          | 00000           | ●00000      | 000000  | 00000000000 |
|              |                 |             |         |             |

### **2** Input & Weights

**3** Tree





・ ・ ロ ・ ・ 日 ・ ・ 日 ・ ・ 日 ・ うへの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti



Single node contraction is the basic building block operation, which in formulae (Einstein notation) is  $z^{\mu} = V^{\mu}_{\nu,\rho} x^{\nu} y^{\rho}$ , considering 2 vectors x and y of dimension [D] and one tensor V of dimension [D, D, X].

Vivado IP DSP Macro for multiplications, with variable number of registers  $\Delta t_{DSP}$  and intrinsic latency. Two different degrees of parallelization: Full Parallel and Partial Parallel.





3-factor multiplication, in hardware we split it into the following stages:

**Mult1**: *x* and *y* cartesian product.

**Mult2**: multiply results of mult1 by corresponding weights *V*.

**Sum**: *X* parallel sums to compute final vector components.

University of Padua

・ロト・日本 日本 モラト モラト モーラー モークへの L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti







Example for D = 2 and X = 2.

Full Parallel computation: maximize resources and minimize latency.

1 DSP for each multiplication:  $D^2$  at mult1 and  $XD^2$  at mult?

Adder tree for sum stage.

Tree DSPs

$$\sum_{i=1}^{L} \frac{N}{2^{i}} X_{i-1}^{2} (X_{i}+1)$$

Tree latency:

$$\Delta t_{DSP} \sum_{i=1}^{L} 2 + \log_2(X_{i-1}^2)$$

イロト イボト イヨト イヨト L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

University of Padua

#### Node Contraction, Partial Parallel





Example for D = 2 and X = 2.

**Partial Parallel** computation: reduce resources and increase latency.

1 DSP at mult1,  $D^2$  DSP at mult2 and X serial sums.

Pipelined computation.

Tree DSPs:

(日)



Tree latency:  $\Delta t_{DSP} \sum_{i=1}^{L} X_{i-1}^{2} + X_{i} + 1$ 

University of Padua

| Introduction | Input & Weights | Tree   | Results | Backup     |
|--------------|-----------------|--------|---------|------------|
| 000          | 00000           | oooo●o | 000000  | 0000000000 |
| Node, Laye   | r, Tree         |        |         |            |

#### Implementation:

- In library file tensors.vhd we fix the parameters N, D<sub>0</sub>, X<sub>0</sub>, O.
- A VHDL function derives the architecture of the TTN with options:

fixed:  $X_i = X_0$ minimal:  $X_i = min(X_0, D_0^{2^i})$ maximal: $X_i = D^{2^i}$ .

- layer.vhd file generates <sup>N</sup>/<sub>2<sup>i</sup></sub> nodes for layer *i*.
- tree.vhd file generates  $L = \log_2(N)$  layers.



| N  | $D_0$ | $X_0$ | Impl.                     | DSP   | Latency [clk] |
|----|-------|-------|---------------------------|-------|---------------|
| 8  | 2     | 4     | FP                        | 272   | 64            |
| 8  | 2     | 4     | PP                        | 105   | 192           |
| 16 | 2     | 8     | FP                        | 2016  | 104           |
| 16 | 2     | 8     | PP                        | 501   | 692           |
|    |       |       | <ul> <li>↓ □ ▶</li> </ul> | 4 A 1 | ENVEN E       |

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

University of Padua

| Introduction<br>000 | Input & Weights<br>00000 | Tree<br>00000● | Results<br>000000 | Backup<br>000000000000 |
|---------------------|--------------------------|----------------|-------------------|------------------------|
| Firmware            |                          |                |                   |                        |
|                     |                          |                |                   |                        |





- FIFO only for PP implementation.
- Project developed on <u>KCU 1500</u> <u>Kintex Ultrascale</u>.
- Board plugged in host PC with <u>PCle communication</u>.
- Configurable registers for weights: AXI Lite.
- TTN input and output values: AXI Stream.
- AXI Stream clck: 250 MHz. OOC-TTN can reach 500 MHz.

University of Padua

Tree Tensor Network inference on FPGA

< 17 ▶

| Introduction | Input & Weights | Tree   | Results | Backup     |
|--------------|-----------------|--------|---------|------------|
| 000          | 00000           | 000000 | ●00000  | 0000000000 |
|              |                 |        |         |            |

### **2** Input & Weights

**3** Tree





・ ・ ロ ・ ・ 日 ・ ・ 日 ・ ・ 日 ・ うへの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti





TTN architecture N=8,  $X_i$ =[2,4,8,1], 100 samples.



TTN architecture N=16,  $X_i = [2,4,8,8,1],500$  samples.

<ロ> < 団> < 団> < 豆> < 豆> < 豆> < 豆> < < つ) <</p>

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

University of Padua





- ▲日を▲聞を▲開を▲開を 通し ろんの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti





University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

# Thank you for your attention.

University of Padua

<ロト < 回 > < 回 > < 回 > L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

Tree Tensor Network inference on FPGA

3



 A. Giannelle D. Zuliani T. Felser D.Lucchesi S.Montangero M. Trenti, L. Sestini.

Quantum-inspired machine learning on high-energy physics data. 2021.

[2] Tensor network.

https://tensornetwork.org.

- [3] E. Miles Stoudenmire and David J. Schwab.
   Supervised learning with quantum-inspired tensor networks. 2017.
- [4] Iris dataset.

https://www.kaggle.com/datasets/uciml/iris.

[5] Titanic dataset.

https://www.kaggle.com/c/titanic/data.

・ 同 ト ・ ヨ ト ・ ヨ ト

| Introduction | Input & Weights | Tree   | Results | Backup       |
|--------------|-----------------|--------|---------|--------------|
| 000          | 00000           | 000000 | 000000  | ●00000000000 |
|              |                 |        |         |              |

**2** Input & Weights

**3** Tree

4 Results



・ ・ロト・(団ト・)目・ (団ト・)目・ のへの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti





- ▲日を▲聞を▲回を▲回を 回 ろんの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 00●00000000 |
| Frequency    |                 |        |         |             |



・ロト・日本・モト・モー シック

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 000●0000000 |
| DSP          |                 |        |         |             |





シック・ 叫 ・ ・ エッ・ ・ 日 ・ シック

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 0000●000000 |
| DSP          |                 |        |         |             |





シック・ 叫 ・ エリ・ トロ・ トーロ・

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 00000●00000 |
| Latency      |                 |        |         |             |





University of Padua

<ロト < 回 > < 回 > < 回 > L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

Tree Tensor Network inference on FPGA

3

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 000000●0000 |
| Latency      |                 |        |         |             |





・ロト・日下・日下・日 りくの

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti







- ・ロト・西ト・西ト・西ト・ 同一 ろくぐ

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | <b>Tree</b> | Results | Backup        |
|--------------|-----------------|-------------|---------|---------------|
| 000          | 00000           | 000000      | 000000  | 0000000000000 |
| TTN: corre   | lation matrices |             |         |               |





#### - ・ロト ・団ト ・ヨト ・ヨー つくぐ

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup        |
|--------------|-----------------|--------|---------|---------------|
| 000          | 00000           | 000000 | 000000  | 0000000000000 |
| TTN: ROC cur | ves             |        |         |               |





・ロト・西ト・ヨト・ヨー りゅつ

University of Padua

L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

| Introduction | Input & Weights | Tree   | Results | Backup      |
|--------------|-----------------|--------|---------|-------------|
| 000          | 00000           | 000000 | 000000  | 00000000000 |
| TTN: entropy |                 |        |         |             |





メロト メロト メヨト メヨト L.Borella, A.Coppi, J.Pazzini, A.Stanco, A.Triossi, M.Zanetti

University of Padua

Tree Tensor Network inference on FPGA

æ