#### Real-Time FPGA Design for the L0 Trigger of the RICH Detector of the NA62 Experiment at CERN SPS

Mattia Barbanera, Francesco Gonnella





# The NA62 experiment



#### NA62 detector layout and principles

Measure the  $K^+ \rightarrow \pi^+ v V$  BR to 10% precision collecting O(100) events



# Trigger and Data AcQuisition



- Three-level trigger to reduce the amount of stored data, from 10MHz to 100kHz of events:
  - Level-0: hardware trigger; reduction factor 10
  - Level-1 and Level-2: software triggers; reduction factor 10
- Some sub-detectors participate in the level-0 trigger, producing fast and small-sized information to be sent to the LOTP: "trigger primitives"
  - Primitives contain time information (~100 ps LSB) and a primitive ID containing reduced information different for every detector
- LOTP produces the LO trigger if predefined conditions are satisfied

#### Some detectors using TEL62 have dedicated FW creating trigger primitives

#### RICH LO-FW: General Working Principle



- The aim of the RICH LO primitive generating firmware is to group together hits belonging to the same Cherenkov circle, creating time clusters (very general)
- In the PP a preliminary clustering is performed and in the SL clusters coming from the 4 PPs are merged together
- In the final stage of the SL clusters are used to produce primitives to be sent to the LO Trigger Processor



#### RICH LO-FW: General Working Principle

- All modules are clocked at 160 MHz
- The delay in production of primitives must be less than 5 time frames of 6.4-us (basic time-division of the experiment)
- Inside the firmware a common data-format (RICH format) is used
  - All the modules must accept RICH format as input and output format, so that they can be freely moved inside the firmware
- A common clustering module is implemented, able to accept the RICH format as input and output
- It is possible to use a multiple TEL62 setup, by connecting in daisy-chain all the SLs with Inter-TEL boards (as foreseen by the collaboration in the next years)
  - In this case, only the last board sends primitives to LOTP



TWEPP 2016 - Mattia Barbanera 29 September 2016

#### RICH LO-FW: Working Scheme





# Data Converter: RICH format

- Reads TDC data sorted in frames of 25 ns and converts them into RICH format: each hit is a cluster with N<sub>hits</sub>=1 and CTS=0 (Cluster Time-Sum)
  - 400 ns time stamp, cluster fine-time and time-sum of the differences between the cluster time and the time of every hit belonging to the cluster
  - Being the sum signed, on average its value is small even if the cluster is made of a significant number of hits
- It produces 16 400-ns TimeStamps per 6.4-us frame
  - If there are no data corresponding to that TS, a fake cluster (speed-data) with N<sub>hits</sub>=0 and SUM=0 is produced
- □ This module can handle 1024–16=1008 words per 6.4-us frame
- All the data can be split into 2 16-bit words to be sent through the Inter-TEL bus

| 30<br>31      | 24<br>25<br>26<br>27<br>28<br>28 | 16<br>17<br>18<br>19<br>20<br>21<br>22<br>22   | 14<br>15      | 12<br>13                   | 9<br>10<br>11                                   | 00 | 6<br>7 | 4 Ю | ωı | v – c |
|---------------|----------------------------------|------------------------------------------------|---------------|----------------------------|-------------------------------------------------|----|--------|-----|----|-------|
| T.S.<br>1/2   | Timestamp (27:14) 400 ns         |                                                | T.S.<br>2/2   | Timestamp (13:0) 400 ns    |                                                 |    |        |     |    |       |
| Data<br>1 1/2 | N <sub>hits</sub> (7:2)          | Cluster Time-Sum (7:0)(signed)<br>LSB = 100 ps | Data<br>1 2/2 | N <sub>hits</sub><br>(1:0) | Fine Time (11:0)<br>LSB = 100 ps (up to 400 ns) |    |        |     |    |       |
| Data<br>2 1/2 | N <sub>hits</sub> (7:2)          | Cluster Time-Sum (7:0)(signed)<br>LSB = 100 ps | Data<br>2 2/2 | N <sub>hits</sub><br>(1:0) |                                                 |    |        |     |    |       |

8

# SL Input Data merger



- "SL data merger" is purely combinatorial and merges the clusters from two sources
  - Clocked at the same frequency of the other modules (160 MHz)
  - Waits that both its input FIFO are not-empty
  - Compares the type of words (TS, data, speed-data) and their time
  - Produces data sorted in frames of 25 ns (as the input of PP), with no replication of TS or speed-data



9

### SL average calculator

- The geometric mean of the time of the hits is achieved using the CTS field
  - Each time the clustering module has a hit (or a pre-cluster, in the SL) in input, the sum is updated:
    - $\Box CTS_{new} = CTS_{old} + N_{new}(t_{seed} t_{input})$
  - The average calculator computes the reference time as follows
    - Seed<sub>new</sub> = Seed<sub>old</sub> +  $CTS_{old}/N_{old}$
  - The multiplicity remains the same and the sum is re-set to 0
    - $\square N_{new} = N_{old}$
    - **C** $TS_{new} = 0$
- Used in the PPs in order to have clusters with more precise reference time and to avoid overflows of CTS field in the SL
- The final time is computed with the calculator in the SL



#### Clustering-module: working principle



- 4 cells to handle a 25 ns time slot means an instantaneous rate of clusters of 160 MHz
- 16 rows of cells are used to guarantee a through-put of 1, while handling many time frames
- Rows are used in cycle: in case a cluster must be formed with events in two adjacent frames, the data distributor sends the hit to the previous row



11

#### Clustering-module: Data Distributor

- Rearrange data in input into TS@25 ns (32 bit) and fine time@100 ps (8 bit)
- Delivers data to the proper row, splitting it into TS@25 ns (32 bit) and fine-time@100 ps (8+1 bit)
- Handles clusters split into two adjacent TS@25ns by sending them to the proper row
  - the 9th bit is set to one if the cluster belongs to the row used at the moment, to zero otherwise



# Clustering-module: Cell

- Each cell stores the first *Time*<sub>0</sub> (9 bit) received
- If the Time<sub>1</sub> of successive cluster matches the stored Time<sub>0</sub>, within a programmable time window, it merges the 2 clusters as follows:
  - $\square CTS_0 += N_1^* (Time_1 Time_0) + CTS_1$
  - **D**  $N_0 += N_1$
- Divided in 2 block separated by a FIFO
  - **D** The matching block, handling the comparison between Time<sub>0</sub> and Time<sub>i</sub>
  - The computing block, handling the computations (containing an FPGAembedded multiplier)
- When the flush-mode is enabled, it acts as a shift register, giving as output the stored cluster



# Clustering-module: Sorting

- Dedicated electronics for the sorting of the clusters
- Each cell has an internal position field, that is appended to the output cluster
- The position field increases when a cluster with time bigger than the cluster seed passes through the cell
  - All the seeds that cannot fit in the row (i.e. from the 4<sup>th</sup> seed in a row) are subtracted from each cell





#### Clustering-module: Data Collector

Retrieves data produced by the 16 rows of clustering cells, sorts them and re-converts them in RICH format

- Reader of the rows: in order to read 1 word per clock cycle, reacts to the empty signals of two consecutive rows
- Sorter: sorts the clusters by addressing a RAM with the position field computed by the cells. In order to stand the full rate, there are 8 RAMs
- Cluster Discard: discard the clusters that have multiplicity out of a predefined range
- **Formatting** module: formats the cluster in the RICH format



#### Clustering-module: Performance



- The throughput of the clustering module (like any other module in the RICH firmware) must be kept at 1 word per clock cycle
- For this reason we need 16 rows of clustering:
  - Two rows are filled "at the same time" to take care of border effect
  - Once the second row is completed, the first can be read out
  - 16 rows are needed to compensate the latency of the cell:
    - The latency of one cell is given by: 2d + m + f where d is the depth of the row, m is the latency of the multiplier and f is the latency of the internal FIFO



In our case 2\*4 + 3 + 3 = 14 < 16, so there will always be an empty row to fill</p>

#### LO multiplicity vs offline multiplicity



- Correlation between offline multiplicity and FW multiplicity
  - The line corresponding to FW multiplicity 0 represents the inefficiency of the FW algorithm



# Delay of primitive production

MTP Offset for RICH vs timestamp



Delay is stable and between 2 and 3 time-frames of 6.4 us each
FEE sends data in time-frames of 6.4 us



# Delay of primitive production

MTP Offset for CHOD vs timestamp



Because of its generality, the FW has been employed also for L0 of CHOD detector

With a higher rate (~15%), the delay tends to diminish



#### Conclusions

- A Firmware for RICH Level-0 has been developed and it's working with an efficiency of 98.76%
- The system can stand the full rate of the detector
  - Real rate of the detector is twice the one of the MC: a single Gb-Ethernet cannot stand the primitive rate
- The maximum delay of primitive production is 3 time-frames of 6.4 us each
  - The higher is the rate, the faster is the production (up to the saturation of the GbE link)
- Because of it's generality, it has been employed also for the L0 of CHOD detector and it's ready for the use of Inter-TEL boards, foreseen by the NA62 collaboration for the years to come



## InterTEL configuration



Daisy-chain Heavy Neutrino trigger architecture



TWEPP 2016 - Mattia Barbanera 29 September 2016

### RICH LO multiplicity

Rich L0 Multiplicity





### **RICH LO multiplicity**





#### Ultra rare kaon-decays

 $\text{K}^{\scriptscriptstyle +} \to \pi^{\scriptscriptstyle +} vv$  : theoretically pure and almost experimentally unexplored



- [1] A. J. Buras, D. Buttazzo, J. Girrbach-Noe and R. Knegjens, arXiv:1503.02693
- [2] A. V. Artamonov et al. (E949 Collaboration)
  - B. Phys.Rev.Lett.101, 191802, 2008.
- [3] J. K. Ahn et al. (E391a Collaboration) PR D81 (2010)072004

#### These processes are very sensitive probes for new physics:

- They are highly suppressed
- They are predicted with very high accuracy



#### In-flight kaon decay at 75 GeV/c

