Speaker
Description
Given the recent advances of machine learning techniques, the Large Hadron Collider (LHC) at CERN is incorporating deep learning (DL) models, such as DeepCalo, to enhance the quality of data analysis of particle experiments. However, the need for in-time inference to keep up with data generation rates, as well as the dynamics of the experiments, require that the data processing feature short processing latency as well as flexibility to quickly implement different DL models. The LHC plans to use FPGAs (Field Programmable Gate Arrays) to provide timely data analysis via the highly parallel dataflow-based processing and short latency enabled by customized logic. A high level synthesis tool, hls4ml, is also adopted to facilitate design and synthesis of the fully on-chip dataflow architecture which avoids long-latency DRAM accesses. However, the current hls4ml framework has limited support for very large CNN models due to suboptimal data streaming schemes and inefficient processing architectures. The dataflow architecture also requires proper data quantization to efficiently utilize the limited resources within the FPGA. In this paper, we present the first automated design and optimization workflow based on hls4ml to implement DeepCalo models on FPGAs. The current DeepCalo framework is extended and integrated with QKeras layers to perform quantization-aware training to minimize resource consumption while retaining good model quality. A comprehensive exploration is performed on various key design factors, and observations have been summarized as useful design guidelines for future applications. With the proposed workflow, we have shown that the design on a Xilinx Alveo U50 FPGA can significantly outperform the implementations on Ryzen-5600H CPUs and Tesla V100 GPUs by up to 14.1x and 7.9x respectively, and meet the latency requirement of the HLT (High Level Trigger) within the particle
experiment.