

## The FPGA design for the LO trigger of the RICH detector of the NA62 Experiment at CERN SPS

# 

Mattia Barbanera (INFN Perugia) <u>mattia.barbanera@pg.infn.it</u> FDF Seminar 26.11.2024

### Table of Contents







PERFORMANCE



CONCLUSIONS

2

### **Some Context: The NA62 Experiment**







TI2

ions

AD



### **NA62 Detector Layout and Principles**



Measure the  $K^+ \rightarrow \pi^+ \nu \bar{\nu}$  BR to 10% precision collecting O(100) events

• Decay occurring with 10<sup>-12</sup> probability

- Identify particles
  - High-performance
     EM calorimeter
  - Redundant particle ID for e/μ/π
- Hermetic photon vetoes Measure crossing time
  - Resolution better than 100 ps
- High-rate 10 MHz events
   Precision tracking



### **Trigger and Data Acquisition**

- Three-level trigger to reduce the amount of stored data
  - 10 MHz events  $\rightarrow$  10 kHz
  - Level-0 (L0): hardware trigger; reduction factor 10
  - Level-1 and Level-2: software triggers; reduction factor 10 each
- L0: a subset of detectors produces fast and small-sized information
  - Trigger primitives to be sent to L0TP
    - Timestamp information: ~100 ps resolution
    - Primitive ID: type of event detected (specific for each detector)
- L0TP produces L0 triggers if predefined conditions are satisfied
  - Some detectors using TEL62 have dedicated GW creating trigger primitives





### Data Acquisition: TEL62 Read-Out Board



### Common read-out board shared by many detectors: TEL62

- 4 Daughter Card Slots
  - TDCB (Time-To-Digital Converter Board)
  - Dedicated FEE (Front-End Electronics) for each subdetector
- 4 Pre-Processing Stratix-III FPGAs
  - Receive data from the 4 TDCBs
  - 2 GB data buffer
- 1 Sync-Link Stratix-III FPGA
  - Combines the inputs from the 4 PPs
  - Sends data to PC farm and trigger processor
- 1 Credit-Card PC
  - Configuration and monitoring interface



### **RICH-LO GW: Working Principles**

- TDCBs produce timing information of particles interacting in the RICH
  - 40-bit timestamp
    - ~100 ps LSb (resolution), ~60 s MSb
- GOAL: group hits and provide a precise time reference for trigger
  - Exploit the full ~100 ps resolution of the timestamp
  - Hits belong to the same Cherenkov circle
- PPs perform a preliminary clustering
- SL merges the clusters from the PPs
  - And from InterTEL, if applicable
- SL produces primitives to send to the L0 Trigger Processor





### **RICH-LO GW: Requirements and Working Principles**

| REQUIREMENTS                                                                                                                                                                                                                               | WORKING PRINCIPLES                                                                                                                   |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Primitives shall be <u>ordered in time</u> down to 100 ps LSb</li> <li>TDCBs do not produce time-ordered data</li> </ul>                                                                                                          | <ul> <li>Exploit 25-ns ordering of read-out modules</li> </ul>                                                                       |
| <ul> <li><u>Minimize delay</u> in primitives' production</li> <li>Maximum 32 μs, or 5 time-frames of 6.4-μs</li> <li>6.4 μs: basic time-division of the experiment</li> <li>Translates in: throughput of 1 word per clock cycle</li> </ul> | <ul> <li>All modules shall use a <u>common data format</u> as input<br/>and output</li> <li>Reusability and compatibility</li> </ul> |
| <ul> <li>Minimize logic occupation</li> <li>Already high due to read-out modules</li> </ul>                                                                                                                                                | <ul> <li>PPs and SL shall reuse the same clustering module</li> <li><u>Reusability</u> and compatibility</li> </ul>                  |
| <ul> <li>The design shall be <u>InterTEL ready</u></li> <li>16-bit bus</li> <li>Daisy-chain to improve trigger accuracy and robustness</li> </ul>                                                                                          |                                                                                                                                      |

### **RICH-LO GW: Block Diagram**





- ✓ Minimize delay in primitives' production
  - ✓ If modules process 1 word per clock cycle

### **Common RICH Format for Primitives and Clusters**



- 2 types of 32-bit words
  - Timestamp: Count time in 28 bits
    - Resolution of 400 ns, up to ~60 minutes
  - Data: Describe cluster information
    - Fine-time timestamp: resolution of 100 ps
    - Number of hits
    - Cluster Time-Sum
- Separable 2x16-bit words for *InterTEL* 
  - Each word identified by 2x2 bits

#### Data converter

- Reads TDCBs data and converts them into *RICH format* 
  - Interpreted as clusters
  - $N_{hits} = 1, CTS = 0$
- 16 timestamps per 6.4 µs time slot
  - If no data in a timestamp, produces fake clusters (speed-data)

$$N_{hits} = 0, CTS = 0$$

| 30<br>31      | 24<br>25<br>26<br>27<br>27<br>28<br>28        | 16<br>17<br>18<br>19<br>20<br>21<br>22<br>22   | 14<br>15      | 12<br>13                                                                   | 8<br>9<br>10<br>11                                                         | 4 7 6 V | ω N − O |
|---------------|-----------------------------------------------|------------------------------------------------|---------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|---------|---------|
| T.S.<br>1/2   | Timestamp (27:14)<br>MSb ~= 60 s, LSb =400 ns |                                                |               | Timestamp (27:14)<br>MSb ~= 60 s, LSb =400 ns                              |                                                                            |         |         |
| Data<br>1 1/2 | N <sub>hits</sub> (7:2)                       | Cluster Time-Sum (7:0)(signed)<br>LSb = 100 ps | Data<br>1 2/2 | N <sub>hits</sub><br>(1:0)                                                 | N <sub>hits</sub><br>(1:0) Fine Time (11:0)<br>LSb = 100 ps (up to 400 ns) |         |         |
| Data<br>2 1/2 | N <sub>hits</sub> (7:2)                       | Cluster Time-Sum (7:0)(signed)<br>LSb = 100 ps | Data<br>2 2/2 | N <sub>hits</sub><br>(1:0) Fine Time (11:0)<br>LSb = 100 ps (up to 400 ns) |                                                                            |         | ns)     |

✓ All modules shall use a common data format ✓ The design shall be *Inter-TEL* ready

### **SL Input Data Merger**

- <u>SL data merger</u> merges the clusters from two sources
  - Purely combinatorial
  - Clocked at the same frequency of the other modules (160 MHz)
- Waits that both its input FIFO are not-empty
  - Consistency in data ordering
  - WARNING: <u>Starvation of data in FIFOs</u>
- Produces data sorted in frames of 25 ns
  - Same as PPs input
  - No replication of TS or speed-data









### **Clustering Module: Working Principles**





#### 16 rows

- Each row takes in input hits belonging to a specific 25-ns frame
  - Enough rows to avoid waiting for an empty slot

4 cells per row

- Each cell creates a cluster of hits (or clusters) if closer than a programmable time value
- Discard clusters if more than 4 in a timestamp
  - Instantaneous clusters rate of 160 MHz

### **Clustering Module: Data Distributor**

- Rearrange input data into
  - 25-ns timestamp, 32 bits
  - 100-ps fine time, 8 bits
- Deliver data to the proper row
  - Splits in 25-ns time slots
- Handle corner cases of clusters spanning across two 25-ns time slots
  - 1 additional bit to signal this occurrence
  - 100-ps fine time uses 9 bits
    - 9th bit is 1 if cluster overflowed, 0 otherwise





Compared to



### **Clustering Module: Cell**

When a cell receives a hit (or cluster)

- If empty
  - Store  $\mathtt{Time}_{\mathtt{0}}$  as seed of the cluster
- If not-empty and inputs do NOT match
  - Pass the cluster to the adjacent cell
- If not-empty and input Time<sub>1</sub> matches Time<sub>0</sub>
  - Merge the two clusters
    - $CTS_0 += N_1(Time_1 Time_0) + CTS_1$
    - $N_0 += N_1$
  - Programmable time-matching window

Each computing block embeds a DSP multiplier

• FIFO to compensate for multiplier delays





### **Clustering Module: Cell**

Data distributor flushes a  $(n-2)^{th}$  row when filling the  $n^{th}$ 

- In flush mode, row acts as a shift register
  - Cells output  $N_{hits}$ , T, and CTS to the right
  - Data go to the Data collector through a FIFO

To sort clusters, cells have 8-bit position field

- Used by Data collector to quickly sort clusters
- To sort clusters:
  - Matching cluster in input:
    - Do nothing
  - Greater cluster in input:
    - Increase its position field before forwarding to next cell
  - Smaller cluster in input:
    - Increase the position of the stored cluster
    - WARNING: What if the cluster overflows the 4 cells?







### **Clustering Module: Data Collector**



- Data retriever
  - Reads data from rows' FIFOs
  - Checks empty signals of two consecutive rows
    - Goal is to read 1 word per clock cycle
- Sorter
  - Writes clusters by addressing a RAM with the cells' position field
  - 8 RAMs of 4 positions each

- Cluster Discard
  - Discard clusters with multiplicity out of a programmable range
  - Reduce noise from events with multiplicity at extremes (low or high)
- Formatter
  - Formats data in RICH format



### **Clustering Module: Sizing**





- GOAL: throughput of clustering module must be 1 word per clock cycle
- 16 rows design:
  - Two rows concurrently filled
  - Once the second row is completed, the first can be read-out
  - The latency of a row is given by
  - $L = 2 \cdot \langle row depth \rangle + \langle multiplier latency \rangle + \langle FIFO latency \rangle$
  - $L = 2 \cdot 4 + 3 + 3 = 14$

### **Average Calculator**



#### CTS field to compute geometric average of cluster time

- Each time the clustering module has a data word in input, the sum is updated:
  - $CTS_{new} = CTS_{old} + N_{new}(t_{seed} t_{input})$
- The average calculator computes the reference time as
  - Seed<sub>new</sub> = Seed<sub>old</sub> +  $CTS_{old}/N_{old}$
  - N<sub>new</sub> = N<sub>old</sub>
  - CTS<sub>new</sub> = Ø
- Last module in PPs
  - Pre-clusters with more precise reference time
  - Avoid overflows of CTS field in the SL
- Last module in SL
  - Compute the cluster reference time for primitives



### **Multiplicity: Gateware vs Offline**



Dedicated runs to verify design

- Data triggered by other detectors
- Correlations between primitives' time and offline-computed hits time
- Inefficiency: 1.2%
  - L0 mult. = 0, offline mult. != 0
- False positive: 0.005%
  - L0 mult. != 0, offline mult. = 0



19

### **Delay in Primitives' Production**

Delay computed as

- Time written in a primitive vs Time of production of the primitive
- Stable and between 2 and 3 time-frames of 6.4  $\mu s$  each



inar **20** 

### **Performance in Other Detectors LO**



No detector-specific information: GW used also in L0 of CHOD detector

- Same results in efficiency
- With a higher rate (~15%), the delay diminishes



ır **2** 

### **Ready for use with** *InterTEL* **Boards**



RICH format and architecture allows to use InterTEL Boards

- Picture shows a possible use-case
- In the daisy-chain, only the last TEL62 sends primitives to the L0TP



### Conclusions



RICH-L0 GW produces cluster of hits providing a precise time reference for the L0 trigger of the NA62 experiment

| ACHIEVEMENTS                                                                             | COMMENTS                                                                                           |
|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| Stands the full rate of detectors                                                        | Actual rate twice the one of simulations                                                           |
|                                                                                          | Efficiency of 98.76%                                                                               |
| Primitives time-ordered down to 100 ps                                                   |                                                                                                    |
| Maximum delay of 3 time-frames <ul> <li>Lower than the required 5 time-frames</li> </ul> | The higher the rate, the lower the delay <ul> <li>Up to the saturation of the GbE links</li> </ul> |
| GW fits the PP and SL FPGAs                                                              | No timing violations, full-scale clock frequency                                                   |
| No detector-specific information used                                                    | Employed also in the L0 of CHOD detector                                                           |
| InterTEL ready                                                                           | All modules use the common RICH format                                                             |