## hls4ml Tutorial ## Part I. Introduction ## Introduction - In this session you will get hands on experience with the hls4ml package - Translate pre-trained models into FPGA code - Explore the different handles provided by the tool to optimize the inference - Latency, throughput, resource usage - Run inference on an FPGA with AWS - Make our inference more computationally efficient with pruning - But first... Field Programmable Gate Arrays are reprogrammable integrated circuits Contain many different building blocks ('resources') which are connected together as you desire Originally popular for prototyping ASICs, but now also for high performance computing 'Computing in space as well as time' ### FPGA diagram Field Programmable Gate Arrays are reprogrammable integrated circuits Logic cells / Look Up Tables perform arbitrary functions on small bitwidth inputs (2-6) These can be used for boolean operations, arithmetic, memory Flip-Flops register data in time with the clock pulse ### **FPGA** diagram ### Logic cell Field Programmable Gate Arrays are reprogrammable integrated circuits **DSPs** are specialized units for multiplication and arithmetic Faster and more efficient than using LUTs for these types of operations And for Neural Nets, DSPs are often the most precious ### **DSP** diagram Field Programmable Gate Arrays are reprogrammable integrated circuits **BRAMs** are small, fast memories - RAMs, ROMs, FIFOs (18Kb each in Xilinx) Again, memories using BRAMs are more efficient than using LUTs A big FPGA has nearly 100Mb of BRAM, chained together as needed ### **FPGA** diagram Also contain embedded components: Digital Signal Processors (DSPs): logic units used for multiplications Random-access memories (RAMs): embedded memory elements In addition, there are specialised blocks for I/O, making FPGAs popular in embedded systems and HEP triggers High speed transceivers with Tb/s total bandwidth PCIe, (Multi) Gigabit Ethernet, Infiniband **AND:** Support <u>highly parallel</u> algorithm implementations **Low power per Op** (relative to CPU/GPU) ### FPGA diagram Digital Signal Processors (DSPs): logic units used for multiplications Random-access memories (RAMs): embedded memory elements Flip-flops (FF) and look up tables (LUTs) for additions # Why are FPGAs Fast? - Fine-grained / resource parallelism - Use the many resources to work on different parts of the problem simultaneously - Allows us to achieve low latency - Most problems have at least some sequential aspect, limiting how low latency we can go - But we can still take advantage of it with... - Pipeline parallelism - Use the register pipeline to work on different data simultaneously - Allows us to achieve high throughput Like a production line for data... # How are FPGAs programmed? #### **Hardware Description Languages** HDLs are programming languages which describe electronic circuits Compile from C/C++ to VHDL Pre-processor directives and constraints used to optimize the design <u>Drastic decrease in firmware development</u> time! Today we'll use Xilinx Vivado HLS [\*] [\*] https://www.xilinx.com/support/documentation/sw\_manuals/xilinx2014\_1/ug902-vivado-high-level-synthesis.pdf # Jargon - LUT Look Up Table aka 'logic' generic functions on small bitwidth inputs. Combine many to build the algorithm - FF Flip Flops control the flow of data with the clock pulse. Used to build the pipeline and achieve high throughput - DSP Digital Signal Processor performs multiplication and other arithmetic in the FPGA - BRAM Block RAM hardened RAM resource. More efficient memories than using LUTs for more than a few elements - HLS High Level Synthesis compiler for C, C++, SystemC into FPGA IP cores - HDL Hardware Description Language low level language for describing circuits - RTL Register Transfer Level the very low level description of the function and connection of logic gates - Latency time between starting processing and receiving the result - Measured in clock cycles or seconds ## Neural network inference $$N_{\text{multiplications}} = \sum_{n=2}^{N} L_{n-1} \times L_n$$ layer m addition logic cells ## Neural network inference $$N_{\text{multiplications}} = \sum_{n=2}^{N} L_{n-1} \times L_n$$ ### Today you are going to implement a NN on FPGA with this package: # high level synthesis for machine learning # Efficient NN design for FPGAs #### FPGAs provide huge flexibility Performance depends on how well you take advantage of this ## **Constraints:** Input bandwidth **FPGA** resources Latency # NN TRAINING Today you will learn how to optimize your project through: - compression: reduce number of synapses or neurons - quantization: reduces the precision of the calculations (inputs, weights, biases) - parallelization: tune how much to parallelize to make the inference faster/slower versus FPGA resources # Today's hls4ml hands on #### • First part: - take confidence with the package, its functionalities and design synthesis by running with one of the provided trained NN - learn how to read out an estimate of FPGA resources and latency for a NN after synthesis - learn how to optimize the design with quantization and parallelization #### Second part: - learn how to run the design on Amazon Web Services FPGAs with SDAccel - timing and resources studies after running on real FPGA #### • Third part: - learn how to do model compression and its effect on the FPGA resources/latency ## hls4ml Tutorial ## Part I. Introduction - Hands On # Efficient NN design: quantization ap\_fixed<width,integer> 0101.1011101010 integer fractional width - Quantify the performance of the classifier with the AUC - Expected AUC = AUC achieved by 32-bit floating point inference of the neural network ## Scan integer bits Fractional bits fixed to 8 #### Scan fractional bits Integer bits fixed to 6 # Efficient NN design: parallelization - Trade-off between latency and FPGA resource usage determined by the parallelization of the calculations in each layer - Configure the "reuse factor" = number of times a multiplier is used to do a computation Reuse factor: how much to parallelize operations in a hidden layer # Parallelization: DSP usage # Parallelization: Timing #### Latency of layer m $$L_m = L_{\text{mult}} + (R - 1) \times II_{\text{mult}} + L_{\text{activ}}$$ # The config.yml file - The model to translate - Some test vectors for simulation (check precision) - Output directory / name - Target FPGA, clock speed - Model data precision and parallelisation - More fine grained data precision and parallelisation - Per-layer, or per-layer type ``` KerasJson: keras/KERAS_3layer.json keras/KERAS_3layer_weights.h5 KerasH5: #InputData: keras/KERAS_3layer_input_features.dat #OutputPredictions: keras/KERAS_3layer_predictions.dat OutputDir: my-hls-test ProjectName: myproject XilinxPart: xcku115-flvb2104-2-i ClockPeriod: 5 IOType: io_parallel # options: io_serial/io_parallel HLSConfig: Model: Precision: ap_fixed<16,6> ReuseFactor: 1 LayerType: Dense: ReuseFactor: 2 Strategy: Resource ``` 21 Compression: True # Hands On - Setup - https://github.com/holzman/course\_material/blob/fml/part0\_setup.md - Download the ssh key from: <a href="https://www.dropbox.com/s/yd5yiov0onva6qt/fastml\_rsa?dl=0">https://www.dropbox.com/s/yd5yiov0onva6qt/fastml\_rsa?dl=0</a> - Find your IP from: <a href="https://tinyurl.com/hls4ml-demo">https://tinyurl.com/hls4ml-demo</a> - Then in a terminal: chmod 600 fastml\_rsa ssh -i fastml\_rsa <your assigned IP> (password) You are now connected to a VM with the Xilinx Vivado HLS suite installed ## Exercise - Follow the step-by-step instructions at: - https://github.com/FPGA4HEP/course\_material/blob/fml/part1\_hls4ml\_intro.md - For the final part "Change precision of calculations and reuse factor": - Everybody pick a Precision and Reuse Factor from the spreadsheet - Put your name in the column, pick one that isn't already assigned - https://docs.google.com/spreadsheets/d/ 1xrFf3\_-6G10wmYnZ8zuDM3SfCfUMe0\_KOB8mW6S-E2E/edit#gid=0 - Put your results in the spreadsheet! - Plots are generated on the 'Plots' sheet # Other Examples The FPGA workflow can take a long time, so here are some results from pre-compiled models... # Large MLP - 'Strategy: Resource' for larger networks and higher reuse factor - Uses a slightly different HLS implementation of the dense layer to compile faster and better for large layers - We use a different partitioning on the first layer for the best partitioning of arrays ``` KerasJson: keras/MNIST_model.json keras/MNIST_model_weights.h5 KerasH5: #InputData: keras/MNIST_model_input_features.dat #OutputPredictions: keras/MNIST_model_predictions.dat OutputDir: my-hls-test ProjectName: myproject XilinxPart: xcku115-flvb2104-2-i ClockPeriod: 5 IOType: io_parallel # options: io_serial/io_parallel HLSConfig: Model: Precision: ap_fixed<16,6> ReuseFactor: 128 A model trained on the MNIST digits classification dataset Strategy: Resource Architecture: 784 x 128 x 128 x 128 x 10 LayerName: Model accuracy: ~97% dense1: Can you calculate the number of DSPs it will use? ReuseFactor: 112 (Don't cheat and look ahead) ``` # Large MLP - It takes a while to synthesise, so here's one I made earlier... - The DSPs should be: (784 x 128) / 112 + (2 x 128 x 128 + 128 x 10) / 128 = 1162 | * Summary: | | | | | | | |---------------------|-------------|---------|----------|----------|--|--| | Name | BRAM_18K | DSP48E1 | FF I | LUT | | | | DSP | ++ | +<br> - | <br>- | <br> - | | | | Expression | 1 -1 | - | 01 | 3144 | | | | FIFO | 13941 | - 1 | 289981 | 461161 | | | | Instance | 1 5681 | 1162 | 1402031 | 166361 l | | | | Memory | -1 | - | - | - | | | | Multiplexer | -1 | - | - | 7002 l | | | | Register | -1 | | 7781 | - | | | | Total | 1962 | 1162 | 1699791 | 222623 | | | | Available SLR | 1 21601 | 27601 | 6633601 | 331680 | | | | Utilization SLR (%) | 901 | 421 | 251 | 67 | | | | Available | 4320 | 55201 | 13267201 | 663360 | | | | Utilization (%) | 45 <br> 45 | 211 | 121 | 331 | | | | | ++ | + | | | | | # Binary & Ternary Neural Networks - Constrain the weights (and optionally activations) to ±1 (binary) or ±1 & 0 (ternary) - Can use a few LUTs to perform 'activation \* weight' products instead of DSPs - Consider it a form of model compression - ...but typically need to increase model size to retain performance - So in hls4ml this goes hand in hand with Strategy: Resource - There are example jet tagging B/TNNs in the hls4ml repo: - KERAS\_3layer\_binary\_smaller{.json, \_weights.h5} Binary dense layer & our optimized 'Batch Normalization + Binary Tanh' layer (1 bit weights & activations) - KERAS\_3layer\_binarydense\_relu\_max{.json, \_weights.h5} Binary dense layer, batch normalization and 'clipped ReLU' activation (~few bits activations) - KERAS\_3layer\_ternary\_small{.json, \_weights.h5} Ternary dense layer, batch normalization and ternary tanh (2 bit weights & activations) # Binary MNIST Model - Now a model trained on the MNIST digits, with the same architecture as the last one, but now with 1 bit weights: 93% accuracy - Now we use 0 DSPs! The LUTs look a bit high, but note these always go down a lot after the later stages of Xilinx compilation flow (goes down to 14%) | * Summary:<br>+ | | | | | | | |---------------------|----------|---------|----------|-------------|--|--| | Name | BRAM_18K | DSP48E1 | FF I | LUT | | | | DSP | ++ | <br>- | <br>- | <br> <br> - | | | | Expression | -1 | - | 01 | 3144 | | | | FIFO | 13941 | | 289981 | | | | | Instance | 1 591 | 01 | 65662 l | 209840 | | | | Memory | -1 | - | -1 | - | | | | Multiplexer | -1 | - [ | - I | 7002 | | | | Register | -1 | -1 | 7781 | - | | | | Total | 1453 | 0 | | 259632 | | | | Available SLR | 21601 | | 6633601 | | | | | Utilization SLR (%) | 671 | 01 | 141 | 78 | | | | Available | 4320 | 55201 | 13267201 | 663360 | | | | | 331 | | <br>71 | | | |