

# AM + FPGA based L1 Track Trigger for CMS and INFN developments

Fabrizio Palla INFN Pisa

8<sup>th</sup> INFIERI Workshop FNAL - October 19, 2016



# Track Trigger





On detector data reduction ~20, mainly limited by nuclear interactions, conversions and loopers





Baseline L1 latency 12.5  $\mu$ s

Data

**formatters** 

- Track Trigger latency  $< 5 \mu s$
- Trigger providing tracks with p<sub>T</sub>>2 GeV at 40 MHz

Track

finders

- Need time-multiplexing to be able to process ~50 Tb/s. incoming data
- Expect up to 300 candidate L1 tracks to Global Trigger





# Select only hits from "high-p<sub>T</sub>" tracks



Select "high-p<sub>T</sub>" tracks (>2 GeV) by correlating hits in 2 nearby sensors (stub)



R-Φ plane, "ideal" barrel layer



- F. Palla, G. Parrini, PoS VERTEX2007 (2007) 034, http://pos.sissa.it/archive/ conferences/057/034/Vertex%202007 034.pdf
- J. Jones, A. Rose, C. Foudas, G. Hall, http://arxiv.org/pdf/physics/ 0510228v1.pdf

$$\approx \Delta R \frac{pT_{min}}{pT} = 0.15 \frac{B}{B} \Delta R \frac{R}{pT}$$

**Large B field of CMS** beneficial!

- $\triangleright$  In the barrel,  $\triangle R$  is given directly by the sensors spacing
- In the end-cap, it depends on the location of the detector
  - ⇒ End-cap configuration typically requires wider spacing (up to ~ 4 mm)









### CMS 2S modules

FE amp comp



fast reset

test pulse

I2C refresh

**CBC2** architecture

fast

control

pipeline shift reg.

256 deep

pipeline

32 deep

register

#### @2S(trip) sensors modules

- sensors
- readout by 8 CBC on either sides

10 cm x 10 cm

- First discriminates signals by rejecting large clusters; then form a coincidence between the two sensor planes
- Concentrator chip sends data from 8 chips to GBT

5 cm

• CMS Binary Chip

• 2 x 8 chips 1200 mW



nearest neighbour signals

power converter

1000 mW

sensors

GBT & opto package 800 mW

concentrator

2 x 200 mW



### CMS PS modules



### P(ixel)S(strip) module

 $\bigcirc$ strips = 100  $\mu$ m x 2.4 cm

 $\bigcirc$  pixels = 100  $\mu$ m x 1.5 mm

•Pixels are logically OR-ed for finding coincidence in the r- $\phi$  plane, and the precise z-coordinate is retained in the pixel storage and provided to the trigger processors.





# CMS L1 Track Trigger Demonstrators(s)



### Goal the demonstrators

- Develop an hardware system capable to (efficiently) reconstruct tracks with pT> 3 GeV within the latency of 4  $\mu$ s, using the current state of the art technology, validating the simulation studies
- Use the simulation and emulation to then dimension the system with technology available for HL-LHC
- Evaluate the costs of the final system

### Three demonstrators under development

- M + FPGA (this + Sergo's talks)
- Tracklet (Jorge's talk)
- TMT (lan + Davide + Luigi's talks)



# L1 Track finding with AM + FPGA





Track fit with FPGA inside the Mezzanine (~1 ns/fit)

 $\bigcirc$  Latency < 4  $\mu$ s (out of 12.5  $\mu$ s)

F. Palla INFN Pisa



### The demonstrator

L1 Track Trigger Tower





#### Data Source Board (DSB) shelf

- Emulates the output of ~400 modules
- 10 x Pulsar2b with RTM
- 100 QSFP+ fibers
  - 400 lanes @ 10Gbps

### Pattern Recognition Board (PRB) shelf

One Trigger Tower10 x Pulsar2b with RTMVarious PRM Mezzanines





# Time multiplexing



Since events from LHC arrive every 25 ns and the time to reconstruct tracks is  $4\mu$ s one needs to divide event processing in





### The AM at work







## Two step approach



 Find low resolution track candidates called "roads". Solve most of the

pattern recognition



2. Then fit tracks inside roads.

Thanks to 1st step it is much easier







AM chip + FPGA





# Dimensioning the system



#### **Number of patterns/Tower**

- ~ 0.5 M for the Barrel and Forward towers
- ~1 M for the intermediate  $\eta$  region (Hybrid) towers

4 (or 8) AM06 chips (128k patterns - today's technology)

#### Matched roads & combinations in <PU>200 + ttbar

- ~20 matched roads on average (~60@95% percentile)
- ~90 combinations on average (~250@95% percentile)

### Fitting with "Principal Component Analysis"

 Over narrow regions of the detector, equations linear in the local hit coordinates give resolutions on track parameters nearly as good as time-consuming helical fit

p<sub>i</sub>'s are the track parameters

 $x_i$ 's are the hit coordinates in the silicon layers.

A<sub>ij</sub> & B<sub>i</sub> are pre-stored constants determined from full simulation or real data tracks.

Several ways to implement: ~20K constants

$$p_i = \sum A_{ij} x_j + B_i$$



### INFN Associative Memory chips



#### FNAL VIPRAM\_L1CMS talk by J. Hoff on Friday



| Version     | Year | Design                    | Tech.  | Area (cm²) | Patterns | Frequency<br>(MHz) | Power (W) |
|-------------|------|---------------------------|--------|------------|----------|--------------------|-----------|
| AM03        | 2004 | Std. cells                | 180 nm | 1          | 5k       | 40                 | 1,26      |
| AM04        | 2012 | Std. cells+<br>Custom     | 65 nm  | 0,12       | 8k       | 100                | 3,70      |
| AM05        | 2013 | Std. cells+<br>Custom+ IP | 65 nm  | 0,12       | 1k+2k    | 100                | <1        |
| AM06        | 2014 | Std. cells+<br>Custom+ IP | 65 nm  | 1,7        | 128k     | 100                | 2-3       |
| <b>AM07</b> | 2016 | Std. cells+<br>Custom     | 28 nm  | 0,1        | 16k      | 200                | 0,1       |

AM05: technology testing chip

AM06: production chip, used in FTK ATLAS track trigger

AM07: technology testing chip, under design



# 28 nm AM07 chip



#### AMO7 ARCHITETURAL FEATURES

Area: 10 mm<sup>2</sup>

Memory depth: 16 kpatterns

400 bumps

4 independent cores

LVDS or LVCMOS interface

Working frequency: 200 MHz



1/4 area and power consumption w.r.t. AM06

MO7 goals

Provide a working AM chip at 28 nm

Test two different CAM design

Aim for pattern/unit area 4x AM06

Design for 200+ MHz clock speed

Lower energy/comparison/bit

Include and validate LVDS I/O





### Pattern Recognition Engine flow







### Two different mezzanines





PRM05: first PRM developed for CMS L1 track trigger, designed as pilot board with technology-test AM05 chip. We have been developing and testing the reconstruction FW in this board before porting it to PRM06.



PRM06: designed to be used in the demonstrator, once AM06 chip be available (Spring 2016), with its 1.5 million pattern bank can cover a full trigger tower pattern bank (0.5-1M patterns). Profits of a Ultrascale FPGA



# Pattern recognition mezzanines



#### INFN PRM05



#### INFN PRM06



| Logic blocks    | Number of resources (PRM05) | Number of resources (PRM06) |
|-----------------|-----------------------------|-----------------------------|
| AM Patterns     | 32 kpatterns                | 1,5 Mpatterns               |
| External Memory | 18 Mbit                     | 1,1 Gbit                    |
| Logic Cells     | 356 k                       | 726 k                       |
| Block RAM       | 25 Mb                       | 38 Mb                       |
| DSP Slices      | 1444                        | 2760                        |
| Transceivers    | 24 @ 8 Gbps                 | 28 @ 10.3 Gbps              |
| I/O Pins        | 300                         | 104, 416                    |



### PRM06 validation





Prototype delivery date:

Prototype validation date:

2 additional PRM06s:

Pulsar-PRM06 testing date:

June 2016

July 2016

September 2016

PRM06 hardware tested and working as expected

- GTH links: IBERT PRBS7 on all links
- AM06 communication and configuration: JTAG communication (each AM06 tested before to be mounted on PCB: serdes' and bank memory with built-in test)
- Serdes links: PRBS on all links
- LVDS links going to FMC connector: loopback on evaluation board, static test
  - Update w.r.t. May: we tested all the links with the Pulsar and they are all ok
- RLD3RAM: using Xilinx tools, checked the reading and writing
- External flash memory: using OpenCores IP we tried basic communication
- Test of the GTH links between PulsarIIb and PRM06





### **Firmware**







### Main FW steps



#### Data Organizer



- · Stores up to 1k stubs/layer/event
- No fixed max on stubs per SS
- Supports Dont Care Bits & missing layers
- Ping-pong operation w/ 2 instances
- 45 cycle read latency (first SSID in, first stub out), many stubs output per cycle, currently 450 MHz max



#### INFN PRM

- Seed using pairs (<12) of PS stubs</li>
- Project into 2S, send out only closest projected combination for given road
- Helps tail events w/ many matched roads w/ many combinations
- ~120 ~200 cycle latency, 200-300
   MHz, can include track estimation
- More resource/latency intensive, but lightens load on downstream TF

Track Candidate
Builder

#### Track Fitter



- Fmax of 500MHz+
  - clock cycle of 2ns or less
- Latency of 47 clock cycles
  - 94ns @500MHz
- One fit/clock cycle after that initial latency
  - 2ns/fit @500MHz



### Track Candidate Combiner



#### It selects one combination for each set of stubs of a matched road

- Use the innermost PS modules to build seeds, which are then extrapolated to the outer layers, where compatible stubs are searched for and the one closest to the extrapolation are retained
- Very efficient (in tt+PU200 is 98.5%), it provides excellent track parameters determination, used in the PCA track fitter
- After TCB, only 5% of the original stubs are retained, 70% of which belong to a primary particle



Current seed stub



### Track fitter: INFN PCA



Principal component analysis

Track parameters:  $p_i = \sum A_{ij} x_j + B_i$ 

 $\bigcirc$  Fit separately r- $\phi$  and r-z views

 $\bigcirc$ R-z: 20 bins in  $\eta$  (size of 0.05) - only use precise PS modules information

- 20 set of constants (also including 2/3)
- z0 resolution better than 1 mm

 $\bigcirc$ R- $\phi$ : 2(charge) x 7(p<sub>T</sub>) bins (will be more) - from the TCB

●98 set of constants (also including 5/6)



| rz plane 6/6    |                                                    |  |  |  |
|-----------------|----------------------------------------------------|--|--|--|
|                 | 20 bins in η                                       |  |  |  |
| Δη              | 0.0024                                             |  |  |  |
| $\Delta z_0$ cm | 0.089                                              |  |  |  |
| rф plane 6/6    |                                                    |  |  |  |
|                 | 7 bins in pT                                       |  |  |  |
| Δφ rad          | 0.00022 to 0.0018                                  |  |  |  |
| Δc/pT           | 0.8% to 4.1%<br>0.8% to 3.7%<br>(without tower 27) |  |  |  |

| rz plane 5/6    |                                                 |  |  |  |
|-----------------|-------------------------------------------------|--|--|--|
|                 | 20 bins in η                                    |  |  |  |
| Δη              | 0.0024 (6) to 0.0045 (7                         |  |  |  |
| $\Delta z_0$ cm | 0.090 (6) to 0.17 (5)                           |  |  |  |
| rф plane 5/6    |                                                 |  |  |  |
|                 | 7 bins in pT                                    |  |  |  |
| Δφ rad          | 0.00024 to 0.0018 (9)<br>0.00031 to 0.0019 (10) |  |  |  |
| Δc/pT           | 0.8% to 4.2% (9)<br>1.2% to 6.4% (10)           |  |  |  |



### PRM05 FW features



- Includes entire reconstruction chain (all blocks integrated)
- Operates with 16 AM05 chips.
- Features:
  - Multiple clock domains: DO and TF @ 200 MHz, TCB @ 100 MHz
  - Per layer DC bits
  - Partial trigger tower coverage (32k patterns though covering ~30% of the full tower)
  - Missing layer (5/6) handling
- **Resources**: about 50% of the Kintex7 resources have been used. Expect to have better performance once ported to PRM06 (Kintex Ultrascale 060)
- Power consumption with the loaded FW and AM chip configured: less than 50 W



| Resource | Utilization | Available | Utilization % |
|----------|-------------|-----------|---------------|
| LUT      | 157413      | 298600    | 52.72         |
| LUTRAM   | 23745       | 108600    | 21.86         |
| FF       | 277107      | 597200    | 46.40         |
| BRAM     | 669         | 955       | 70.05         |
| DSP      | 503         | 1920      | 26.20         |
| 10       | 88          | 380       | 23.16         |
| GT       | 24          | 28        | 85.71         |
| BUFG     | 13          | 32        | 40.63         |
| MMCM     | 1           | 8         | 12.50         |



### Testing events with PRM05



- Tested using a Virtex6 evaluation board (limitations on GTX speed and FMC connectors 1 HPC, 1 LPC)
- Bank file: 32k patterns of the most probable patterns out of 0.5M patterns for barrel TT (18)
- Event root files: single muon, and complex events
- Simulated track parameters are in agreement with what we obtain from the actual implementation of the hardware. Small differences between software simulation and FW results (level of %) due to integer-float representation differences between the simulated and implemented TF.
- Validation of the firmware and simulations is ongoing.





# PRM05 processing time



- Processing time measured for a complex event in the test stand (32k pattern bank), sampling at 200 MHz
- Goal of the FW was to deal with big pattern banks and with multiple chips, processing time reduction is considered a secondary goal:
  - Inefficient stub transmission (8 layers multiplexed in 4 serial links due to evaluation board limitations)
  - · Clocks: no clock domain optimization, slowed down clocks for testing
  - · No optimization of the resources: 12 TCB, relaxed delay parameters
  - · Presence of special pattern/signals to control the AM chip response

Processing time , from the first road out AM05 chips to the last track out: ~2.1  $\mu$ s, reducible with faster clocks

First TC

out DO (lay0) out TCB (ch0)

26

1 busy tt+PU140 event, max(stubs/layer)=79, matched roads from multiple chips First ~delay before AM p.m. AM stub First stubs track track pat. rec. trasmission latency in TCB (ch0) out TF out TF (not optimized) Value 1,000 |200





### Processing time in ModelSim simulation , APPN



- Processing time measured for a complex events in simulation (ModelSim) with the same banks and chip configuration
- Extrapolation of the processing time in case of faster clocks. Case study:
  - AM chip: same frequency of AM05
  - Data Organizer: 400 MHz
  - TCB: 300 MHz
- Track Fitter: 500 MHz
- The same event of previous slide has been sent to simulated FW/HW

Processing time , from the first road out AM05 chips to the last track out reduce to:  $0.8~\mu s$ 

#### 1 busy tt+PU140 event, max(stubs/layer)=79, matched roads from multiple chips





### Conclusions



- The Associative Memory + FPGA based demonstrator for the Level-1 Track Trigger of CMS is based on partitioning the tracker into  $6(\eta)x8(\phi)$  towers, with a factor 20 time-multiplexing
  - ☑ Each tower requires between 500k to 1M patterns, corresponding to between 4 to 8 AM06 chips
  - Pattern Recognition Mezzanines have been developed by INFN aiming to demonstrate the ability to reconstruct tracks with full tower number of patterns with state of the art technology (AM06 chip and KU060 FPGA)
  - The full FW has been developed, integrated and tested with 32k patterns and is being ported to 1.5M patterns mezzanine and well on track for the demonstration
  - The latency measured in the HW indicate that the target of  $4\mu$ s reconstruction is well within reach
- The next month will be crucial for demonstrating the full size pattern behaviour

Backup material



### Status of PRM06 FW



- Basic firmware with basic interfaces is ready
  - Serial links (GTH)
  - Basic memory interface
  - I2C and Flash interfaces
- Ongoing developments:
  - Data Organizer 
     ⇔ External-memory interface
  - PRM06 communication using IPBus: master sitting on Evaluation board FPGA and slave registers in the PRM FPGA. It is currently working fine on PRM05.
- Firmware migration (DO, TCB, TF, FSMs,...):
  - Move the current modules with the internal pattern memory (limited number of patterns: 32k patterns)
  - Use the external memory (full trigger-tower bank coverage)
  - Increase the frequency of the clock domains. Target freq. in ref. [1]



### PRM data distribution



### Data input



### Data output





Efficiency

# p<sub>T</sub> modules performances







#### Prototype module test beam at DESY (2-4 GeV e<sup>+</sup>)



