# Fine-grained hierarchical placement constraining for timing closure (and more)

# Álvaro Navarro-Tobar











# Presentation and outline

Ciemat has been part of CMS since its construction (myself for "just" 15 years) doing, among many other things, FPGA stuff

- Drift Tubes construction: readout frontend and backend (Altera, Spartan-IIe)
- Phase1 upgrade: readout backend (Spartan 6, Virtex 7)
- Phase2 upgrade: detector frontend (Polarfire) and trigger backend (Virtex UltraScale+)

#### Outline

- Python-generated hierarchical placement constraints
  - Introduction and motivation (3-6)
  - Goals (7-8)
  - Description (9-12)
  - Results (13)
- [VHDL record serializer/deserializer methodology (14)]
- [clock\_strobes entity to do help with data handover across related clocks (15)]

# CMS Muon Trigger Primitive generation Phase-2 upgrade

•<u>Analytical Method</u> (proposed by CIEMAT) has been developed to implement reconstruction of the barrel (DT+RPC) trigger primitives in HL-LHC

•Drift Tube chambers hits received asynchronously and need to be combined on each Superlayer to form a track.

•Laterality uncertainty and time drift uncertainty (400 ns)

•Exploits maximum achievable resolution, bringing the hw system closer to offline performance capabilities. [10.1016/j.nima.2023.168103]

In collaboration with UAM and Univ.
 Oviedo. Uni Ovi also participates in
 OMTF



# Need for speed

<u>Throughput</u>, *throughput*, **throughput**: our algorithm suffers from combinatorial explosion. Squeezing the last %'s of efficiency requires increasing substantially the number of hit combinations that can be analyzed in a fixed latency

Due to system constraints (experiment, hardware, firmware framework), the most "natural" frequencies to use are 240, 360 and 480 MHz (6x, 9x and 12x LHCclk)

480 MHz seemed like a good idea, challenging but achievable. Target 2.079 ns



# Not so fast

The DT AM algorithm can be split in ~15 relatively small sub-units (entities)

My bottom-up approach

- 1. Design each entity so that it is able to run **very fast**, close timing with a big margin  $(1.7-1.8 \text{ ns}) \rightarrow$  challenging, 100 ps at this scale is a big achievement
  - out of context runs
  - 1.7 ns = 588 MHz, global clock buffers and BRAM and DSPs performance in the range 650-750 MHz
- Integrate. With proper piping between modules, the margin on first step should be enough to absorb the difficulty to put the whole design together
- 3. Fail: I ended up thanking closure at 360 MHz

# The Usual Suspects

Problems identified in my design and workflow

- SLR boundary crossings
- Not enough piping between modules (e.g. some modules output directly from unregistered output RAM)
- High fanout control signals not sufficiently piped (reset, BX, enables...). Sometimes not so easy to detect, as the violation may not appear in the high-fanout signal but in the entity's inner RTL which is being pulled too tightly.
- Vivado doesn't make a good enough job in placing big design (500k LUTs/FFs) with many paths close to critical timing in big area; complexity is too high. Rolling dice a million times almost guarantees you get a bunch of violations. They don't appear consistently in the same paths, or the same instances of the same entities.





# Approaching the placement problem

Floorplanning Techniques 🚠 🖶 🦻

Consider gate-level floorplanning for a design that has never met timing, and in which changing the netlist or the constraints are not good options.

**Recommended:** Try hierarchical floorplanning before considering gate level floorplanning.



Vivado Design Suite User Guide

Design Analysis and Closure Techniques

Device AMD VU13P (4 SLRs)  $\Rightarrow$ try 2 chambers/SLR (2nd-level depth)  $\rightarrow$  Fail!!  $\rightarrow$  need finer grained placement, but...

- How deep will I end up needing? All the way down?
- How would my constraints file look like, iterating over repeated blocks with 6-levels-deep inner structure?
- How maintainable will it be when the algorithm evolves?
- How easy is it to quickly draft and test pblocks for small instances?

7

# I wish I had a placement helper framework which...

- Is user-friendly, backend does the heavy lifting
- Is **aware of resources and utilization**, automatically resize pblocks if needed, and reports on utilization with configurable safety margins
- Allows for **top-down workflow**: start with coarse placement, when it doesn't work, just subdivide a pblock to have inner structure, and result is offset to be placed where the original pblock was instantiated
- Allows for **bottom-up workflow**: when need arises to repeatedly test smaller, critical portions of design, it can generate constraints only for it, then run OOC, then progressively integrate on bigger portions of the design
- Reasonably easy to evolve with the design, **maintainable**, not a pain to remake everything when one of the modules in the middle of the algorithm is redesigned and grows or shrinks









# Backend: X-Y coordinates and FPGA definition



#### \*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

## VU13P definitions

## Horizontal

```
## Each character (s, d, r, u) represents a column: Slices, DSP, RAM, UltraRAM
## Placement numeration starts in X0
```

## A few of the "S" columns sometimes are replaced by "Laguna" blocks, basically ## top and bottom of each SLR

#### COLUMNS =

#### ## Vertical

```
## VU13P has 4 SLR, each one of them has 4 clock distribution rows. Each row
## contains 60 Slices, 24 DSP, 12 RAM36, 16 UltraRAM
N_SLRS = 4
VATOMS_PER_SLR = 16
VATOM = {'s':15, 'd':6, 'r':3, 'u':4} # Vertical ATOM
LUT_PER_SLICE = 8
FF_PER_SLICE = 16
```

Horizontally, each character in the COLUMNS string represents a fabric column

- Slices, RAM, DSP
- Others are still missing (PCIE, CFGIO...), they affect routability
- Non uniform distribution, pblocks that use RAM or DSPs cannot really be offset horizontally
- Slice-only pblock with care

Vertically, divided in "vertical atoms", the minimum vertical step without fractional number of RAM and/or DSP

- Then stacked up to form SLR, then the full FPGA
- Turns out to be too coarse, favors generation of high-aspect-ratio, vertical pblocks, which is bad for routing.

# User interface: pblocks and boxes (bonus: top-down workflow example)



```
this_thing_name = {
    'pblock': {'boundaries': (48,220,0,7),
    'options': {},
    'name': 'this_thing_pblock_name',},
    'contents':[
        {'resources': {'lut': 100,'ff': 200,'r': 0,'u':0,'d': 0},
        'path': 'gen_label[*]/sub_entity_A',},
        {'resources': {'lut': 100,'ff': 200,'r': 0,'u':0,'d': 0},
        'desc': 'just add again the resources without paths',},
        {'resources': {'lut': 500,'ff': 500,'r': 3,'u':0,'d': 3},
        'path': 'sub_entity_B',},
    ],
    }
}
```

pblock

- Define boundaries, if one edge left undefined, auto-size
- Options: IS\_SOFT, CONTAIN\_ROUTING, other features..
- Contents, each item:
  - Defines resource utilization
  - Defines paths (if path field present, otherwise just add the resources)

this thing name = [ { 'thing' : sub entity A, 'name prefix': 'SEA0 ', 'path prefix':'gen label[0]/', 'offset':(0,0)}, { 'thing' : sub\_entity\_A, 'name prefix': 'SEA1 ', 'path prefix':'gen\_label[1]/', 'offset':(0,3)}, { 'thing' : sub\_entity\_B, 'name prefix': '' 'path prefix':'' , 'offset':(0,0)}, ]

box

- Organizer, contains a list of "things" (pblocks or other boxes)
- 3 alterations recursively applied down the hierarchy for each thing:
  - Prefix the name of inner pblocks
  - Prefix each path in all inner pblocks
  - Offset the fabric placement of each inner pblock

### pblock output on console and tcl

- Shows columns for current pblock and neighboring towards left and right
- Shows slice shape and boundaries
- Shows used vs available resources
  - Warns when over-utilized or above safety margins
- Outputs pblocks to constraints file

| SSSDSSRSSDSSSSS |        | USSDSSSS |         | SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS |
|-----------------|--------|----------|---------|-----------------------------------------|
| Slice shape: 6> | 30, bo | unda     | ries=(4 | 48,55,0,1)                              |
| Resource utiliz | ation: |          |         |                                         |
| dsp             | 0      | 1        | 12      | (0.0%)                                  |
| ff              | 700    | 1        | 2880    | (24.3%)                                 |
| lut             | 950    | 1        | 1440    | (66.0%)                                 |
| bram            | 0      | 1        | 0       | (0.0%)                                  |
| ultraram        | 4      | 1        | 8       | (50.0%)                                 |
| *************** | ****** | ####     | *****   | ******                                  |
| ortadelos input | sorter |          |         |                                         |
| SDSSSSSUSSDSSSS | 5 55   | SSSS     | SDSSRSS | SDSSR SSDSSSSSSSSSSSS                   |
| Slice shape: 13 | x15, b | ound     | aries=  | (56,72,0,0)                             |
| Resource utiliz | ation: |          |         |                                         |
| dsp             | 0      | 1        | 12      | (0.0%)                                  |
| ff              | 1000   | 1        | 3120    | (32.1%)                                 |
| lut             | 700    | 1        | 1560    | (44.9%)                                 |
|                 | 0      | 1        | 6       | (133.3%)                                |
| bram            | 0      | 4        | -       | ()                                      |



create\_pblock sl\_filter
resize\_pblock sl\_filter -add { SLICE\_X88Y0:SLICE\_X116Y59 DSP48E2\_X12Y0:I
set\_property IS\_SOFT FALSE [get\_pblocks sl\_filter]
set\_property CONTAIN\_ROUTING TRUE [get\_pblocks sl\_filter]
add\_cells\_to\_pblock [get\_pblock sl\_filter] [get\_cells sl\_filter\_inst ]

# Bottom-up: single entity. w/o contain routing



- Single entity plus input/output registers of wrapper
- Closes timing, seems ok
- But uses routing resources outside of pblock
- Will not work due to congestion in final design





Nork in orogres

# Attained results





(\*) RTL for SLR crossings and piping optimized thanks to fine-grained placement, then removed placement

Álvaro Navarro Tobar - 12/6/2024

FD

# [VHDL record serializer/deserializer methodology]

end function;

You've got your data nicely structured in a vhdl record and want to convert a to a std\_logic\_vector and back again (e.g. for a fifo or tx over serial link). You &-concatenate. You slice the output. Your change the record. You take a deep breath. You feel miserable. You *bug*.

```
-- Flatten/Unflatten mydata_t
type mydata_combo_t is record
slv : std_logic_vector(MYDATA_SIZE-1 downto 0);
obj : mydata_t;
end record;
function flatten_mydata(obj : mydata_t) return std_logic_vector is
begin
return flatten_unflatten_mydata( obj => obj ).slv;
end function;
function unflatten_mydata(slv : std_logic_vector) return mydata_t is
begin
return flatten_unflatten_mydata( slv => slv ).obj;
end function;
```



Álvaro Navarro Tobar - 12/6/2024

# [Clock phase strobes]

Working family of related clocks (eg. LHC clock x1, x9, x12)

Sometimes code running in fast clock needs to know phase relationship to parent clock:

- data passing between parent and its derived clock
- data passing between sibling clocks, forcing write and read on edges known to be safe

• ...

clock\_strobes.vhd takes the two clocks and generates array of phase strobes (example N=4):



Occupies <5 Luts and <20 FFs  $\Rightarrow$  don't bother distributing strobes, just instantiate a copy wherever you need

# Conclusions

- I thought it might be useful to have a framework to help with placement that makes it a little more friendly and I am working on it.
- It's very beta but already making my life easier with placement
- Placement helps a lot with identifying problems with RTL
- Placement can make a difference in getting those last 100 ps
- Wishlist/future work in backup
- Record (de)serializaton and clock\_strobes also save me time (and bugs)

# Discussion

- Does it sound interesting?
- Any advice on how to make it better or what else to include?
- Or any tips on timing closure in general that you want to share?
- ...?

#### end package body;

## Wishlist/future work, sorted by increasing unlikeliness

- refactoring for better quality code
- non-rectangular pblocks, could improve RAM/DSP resources usage
- increase the vertical granularity
- nested pblocks: could make sense to give space to an entity, and hand place a
  particularly critical subset of its registers. Currently achieved by just doing 2 pblocks,
  one that includes the other
- some columns have either slices or laguna depending on their vertical position on the SLR. currently just assuming they're always slices
- LUTs used for RAM are not taken into account. makes no distinction on different kinds of slices or usages for LUTs
- Provide feedback on costly routing: some horizontal/vertical paths have more dramatic effects on timing than others (crossing columns of DSPs or IO blocks) Could this info be incorporated to provide more info to the user doing the placement?
- automatic reading of vivado outputs to calculate resource utilization
- other FPGAs? other vendors?
- a nice GUI