Speaker
Description
The HL-LHC has motivated a generalized upgrade in electronic systems across all experiments. In the new electronics architecture for the CMS Drift Tubes detector, the trigger generation moves from on-detector ASICs to the back-end, to be carried out by top-range FPGAs. The new algorithm aims to deliver full-resolution, offline-grade performance in the reconstruction of muon segments. To achieve this objective, meeting the latency and data rate requirements, a high-speed, highly-pipelined FPGA design with several optimizations has been developed. This work describes the architecture and performance of this algorithm, as well as the challenges encountered during implementation and the solutions adopted.
Summary (500 words)
The CMS Drift Tubes (DT) Phase-2 trigger generation will be performed by the Barrel Muon Trigger system. The board’s Virtex UltraScale+ VU13P FPGA processes 8 chambers (two sectors), each with three superlayers, two for phi and one for theta. DTs maximum drift time is 16 bunch-crossings (BXs), and thus wires’ pulses (hits) cannot be a priori assigned to individual events. All possible combinations with neighboring hits arriving in a ±16 BX interval must be evaluated, and the best fit delivered. Moreover, laterality (left/right of wire) uncertainty must be resolved. This leads to a combinatorial explosion of possible segments in case of high background levels or high multiplicity of hits (radiated muons or noise). The algorithm has been designed to successfully absorb combinatorial peaks of around 100 segments, which must be processed almost immediately (few BXs) because the <1 microsecond latency prevents queuing the computational load. In order to optimize available resources, the algorithm was designed for a clock frequency of 480 MHz (12x LHC), which, given the algorithm complexity, is very challenging. The algorithm’s stages are:
- Grouping: receives 1 hit per clock cycle (CC), keeps history, and delivers hits groups at 1/CC rate, stopping the explosion when reaching maximum latency.
- Prediction: low-footprint ML module that formulates hypotheses on laterality and discards a majority of unviable groupings to minimize wasting cycles of the resource-expensive fitter.
- Segment fitter: linear regression, used both at superlayer level (4 hits) and after the matcher (8 hits). Takes 15 CCs to deliver a segment’s T0, position and slope and chi square. Highly optimized, makes use of ROM to store pre-calculated parameters and reduce computational load.
- Filtering: ensures hits are only used once, in the best-quality segment. Each incoming segment can be killed by all 144 stored segments (24 BX depth x 6 segments width), and, in case of survival, kill any of them before being stored. This must happen before the next segment arrives at a rate of 1/CC.
- Matching: for phi, segments from superlayers are matched with the ones from the other, arrived in current and past BX. The best candidates from the 108 possible combinations are selected for re-fitting and further filtering.
- Global coordinates conversion: arctangent function is calculated in a module that implements the piecewise linear approximation of an arbitrary function in 4 CCs with <1 LSB error.
The chamber trigger module closes timing at 2 ns in out-of-context runs, but the integration of 8 instances plus the infrastructure is a remarkable challenge. The quality of the placement scales poorly with the size of the design and the critical path gets close to 3 ns, despite providing enough area and taking care of interface timing. We have developed a Python script to do arbitrarily-fine-grained, versatile and maintainable hierarchical placement. Improvements yielded by this methodology will be discussed.
The implementation and performance will be presented, emphasizing the technically-challenging and creative solutions, as well as problems that might be of interest for other developers.