The first running implementation on FPGA of a histogram-based trigger primitive generator for the CMS Drift Tubes at the HL-LHC is presented. This project is driven by the need to generate a trigger by processing the charge collection times, acquired by means of a TDC, and asynchronously shipped to the back-end. We review the design of the bunch crossing evaluation, its implementation on FPGAs of the Xilinx UltraScale family by means of High-Level Synthesis (HLS), and the performance of a demonstrator board of such a trigger.
Here we present the first running implementation of the bunch-crossing identification block of a proposed histogram-based Level 1 Trigger Primitives Generator (L1 TPG) on FPGA for the Drift Tubes (DT) chambers of the Compact Muon Solenoid (CMS) Experiment at the High Luminosity-LHC. The planned upgrade of the front-end foresees an asynchronous data shipping to the back-end, and then requires a novel approach to the generation of the trigger primitives, being the current one directly sampling the detector signals. We designed a Hough Transform-based trigger which features, as a first step, the identification of the parent bunch crossing, followed by the track parameter estimation. The bunch crossing identification algorithm is based on a statistical selection among several parent bunch crossing hypotheses by means of the evaluation of the most voted one in a real-time built histogram.
The bunch crossing identification block was deployed and run on FPGAs of the Xilinx UltraScale family. Our approach in the design was compliant to High-Level Synthesis tools in order to minimise the effort on hardware implementation of the algorithm while closely matching its software emulation within the official framework provided by the experiment. The design was optimised balancing between the large area required to implement high-granularity multi-dimensional histograms and the processing time, by implementing a highly parallelised processing of input data multiplets and reducing to one-dimensional histograms.
We evaluated the performance of the implemented algorithm in terms of required area and latency. The minimal processing unit groups together 18 read-out channels: for such a unit, called macro-cell, the amount of resources needed is approximately 16000 Look-Up Tables (LUT), and the best obtained latency so far is approximately half a microsecond, corresponding to 20 LHC bunch crossings, when covering about a full CMS DT super-layer (21 macro-cells), exploiting a Xilinx VCU440, speed grade -3, clocked at 200 MHz. Hence, the current design, yet provisional and open to improvements, requires an amount of resources comparable to the ASIC processors of the current L1 DT TPG and is compatible with the L1 decision latency budget. The very same design has been tested also on a Kintex KU115, speed grade -2, clocked a 160 MHz, requiring 57% of LUTs and no hard-wired multiplier blocks (DSP) at all. The precision used throughout the algorithm is 3.125 ns, slightly better than the single hit resolution obtained in offline reconstruction, and 4 times better than the current L1 DT TPG.
The comparison between the software emulation of the algorithm and the results of the hardware execution gives negligible mismatch on the very same set of realistically simulated muon tracks, including inefficiencies and pile-up. We also report on the results of a test with cosmic muons collected with a clone of a DT chamber available at the INFN National Laboratories in Legnaro, and on future improvements of this project.