## Digital Signal Processors: fundamentals & system design



### Lectures plan



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 2/36



Chapter 8: RT design flow - analysis & optimisation

Chapter 9: RT design flow - system design

Chapter 10: RT design flow - system integration

Chapter 11: Putting it all together ...

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 3/36





- 8.1 Introduction
- 8.2 Optimiser ON
- 8.3 Analysis tools
- 8.4 Optimisation guidelines Summary

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 4/36



### 8.1 Introduction

- Optimisation: speed, memory usage, I/O BW, power consumption.
- Steps: Debug  $\rightarrow$  set optimiser ON  $\rightarrow$  analyse & optimise (*if needed*).
- Debug & optimise: different & conflicting phases!

- Tuneable configurations.
  - Debug: debug features enabled.
  - Release: optimised (size /speed) version.
  - Custom



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 5/36

## 8.2 Optimiser ON



- Many optimisation phases (levels)
   size vs. speed.
- Power consumption often critical factor, too!
- Careful: optimiser rearranges code!



unsigned int \*ctrl;

BAD!

while (\*ctrl !=0xFF);

Desired action may be modified:



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 6/36

OK

### 8.2 Optimiser ON [2]

- Recommended code development flow:
  - <u>Phase 1</u>: write C/C++ code.
  - <u>Phase 2</u>: optimize C/C++ code
  - <u>Phase 3</u> (if needed): code time-critical areas in linear assembly.
  - <u>Phase 4</u> (if needed): code time-critical areas by hand in assembly.



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 7/36

## 8.3 Analysis tools

- Know what to optimise! 20% of the code does 80% of the work.
- Know when to stop!  $\rightarrow$  diminishing returns.
- TI tools:
  - Compiler consultant: recommendations to optimize performance.
  - Cache tune: optimizes code size vs. cycle count.
  - Code size tune: graphical visualisation of memory reference patterns, to identify conflict areas.

Enabling Compiler Consultant for a project



NB: tools limitations with h/w. Use simulator!

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 8/36

### **8.4 Optimisation guidelines**

#### ... i.e. how to write more efficient code from the start

- Make the common case fast.
- Allocate memory wisely ( $\rightarrow$  linker!) & use DMA.
- Keep pipeline full.
  - Small code may fit in internal memory
  - Software pipelining: memory has edges!
- Native vs. emulated data types: faster execution on native data types (h/w vs. emulated arithmetic). → KNOW YOUR DSP !
- Function calls: pass few parameters (if no more registers available, parameters passed via stack → slow!)

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 9/36

### 8.4 Optimisation guidelines [2]

- Data aliasing: multiple pointers may point to same data → compiler doesn't optimise → compilation switches to state aliasing YES/NO.
- Loops:
  - Avoid function calls & control statement inside loops.



- Move operations inner  $\rightarrow$  outer loops (compilers focus on inner loops)
- Keep loop code small (local repeat optimisation).
- Loop counter: int/unsigned int instead of long.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 10/36

### 8.4 Optimisation guidelines [3]

- Time-consuming operations:
  - division. Often no h/w support for single-cycle division. Use shift when possible.
  - cos, sin, atan: (+ high resolution) often needed by accelerator systems!
    - → CERN LEIR LLRF: Taylor-expansion. Resolution comparable to VisualDSP++ emulated double floating point but faster!

|          |                                         | Execution time [µs]                            |                                                 |
|----------|-----------------------------------------|------------------------------------------------|-------------------------------------------------|
| Function | CERN single precision<br>implementation | VisualDSP++ single<br>precision implementation | VisualDSP ++ double<br>precision implementation |
| cosine   | 0.25                                    | 0.59                                           | 5.5                                             |
| sine     | (for a sine/cosine couple)              | 0.59                                           | 5.3                                             |
| atan     | 0.4125                                  | 1.4                                            | 5.6                                             |

CERN LEIR LLRF: optimised & high-resolution functions implementations.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 11/36

### 8.4 Optimisation guidelines [4]

- Use libraries: optimisation done @algorithmic level (FFT, FIR, IIR...).
  - Sometimes data format not fully IEEE compatible for speed opt.
  - ADI Blackfin BF533 : IEEE-compliant vs. non IEEE-compliant library functions.

|   | operation | fast-ft<br>[cycles] | IEEE-ft<br>[cycles] | ratio |
|---|-----------|---------------------|---------------------|-------|
|   | multiply  | 93                  | 241                 | 0.4   |
|   | add       | 127                 | 264                 | 0.5   |
|   | subtract  | 161                 | 329                 | 0.5   |
|   | divide    | 256                 | 945                 | 0.3   |
| " | pow       | 8158                | 17037               | 0.5   |

- Power optimisation: s/w plays big role!
  - Minimise access to off-chip memory.
  - Use power-management API (not task!).
  - RTOS can help.

Power management (PWRM) added to DSP/BIOS for 'C5x DSPs.



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 12/36

### Chapter 8 summary

- Code optimisation: size, speed, power.
- Compiler optimisation rearranges code  $\rightarrow$  turn optimisation ON after debugging!
- If compiler optimisation not enough  $\rightarrow$  linear / hand-coded assembly.
- Development environment provides analysis tools: compiler consultant, cache/code size tune.
- Write efficient code from the start  $\rightarrow$  optimisation guidelines.



- 9.1 Introduction: DSP & architecture choice.
- 9.2 DSP : fixed vs. floating point
- 9.3 DSP: benchmarking
- 9.4 Architecture: multi-processing option
- 9.5 Architecture: radiation effects
- 9.6 Architecture: interfaces
- 9.7 Code design: interrupt-driven vs. RTOS
- 9.8 Code design: good practices
- 9.9 General recommendations Summary

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 14/36

### 9.1 Intro: DSP & architecture

- DSP choice in industry: "4P" law (<u>Performance</u>, <u>Power consumption</u>, <u>Price</u>, <u>Peripherals</u>).
- DSP choice in accelerator sector: "Power consumption" factor negligible.
  - Standardisation in laboratory.
  - System evolution / migration to other machines.
- Possible synergies.
- Existing know-how / tools / hardware.



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 15/36



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 16/36

# 9.2 DSP: fixed vs. floating point Number format: influences DSP architecture 32-bits Dynamic range<sub>dB</sub> Fixed point ~ 180 dB Floating point ~1500 dB

Fixed point: integer arithmetic

Scaling operations needed (*but* DSP features help, ex saturation)

- ⊙ Fast (*but* scaling operations needed...)
- Algorithms (ex: MPEG-4 compression) bit-exact: made for fixed-point.

#### Floating point: integer/real arithmetic

- (B) High power consumption & slow speed (but scaling NOT needed)
- Expensive. DSP format often not fully IEEE-compliant (speed).
- (c) high dynamic range helps many algorithms (ex.: FFT).

#### NB: Variable gap between numbers.

Large numbers  $\rightarrow$  large gaps; small numbers  $\rightarrow$  small gaps.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 17/36

### 9.2 DSP: fixed vs. floating point [2]

Floating point: often choice for accelerator but ... CAREFUL!

LHC BC example

- Cavities @ ~400.78 MHz.
- F out: \_\_\_\_
  - format: unsigned int (16 bits)
  - ✓ resolution 0.15 Hz
  - range: 10 kHz from 400.7819 MHz
- Single floating: number spacing > 1 @400 MHz !
- TigerSHARC: h/w singlefloat, emulated double
   → loops calculations as offset from 400.7819 MHz.

LHC beam control: zoom onto beam loops part.



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 18/36

### 9.3 DSP: benchmarking

#### Performance commonly judged via metric set

| METRI <i>C</i>      | UNIT          |
|---------------------|---------------|
| Max clock frequency | [MHz]         |
| Power consumption   | [W or W/MIPS] |
| Execution speed     | [MIPS, MOPS]  |
| Memory bandwidth    | [Mbytes/s]    |
| Memory latency      | clock cycles] |

- Metrics often give peak/projected values. Difficult comparison!
  - Clock frequency can differ from instruction frequency.
  - MIPS: VLIW DSPs have simple instruction set → one instruction does less work.
  - MOPS: often based on MAC. Not included: control instructions & memory bottlenecks.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 19/36

### 9.4 Architecture: multi-processor option

### a) Multi-DSP

- Many DSPs collaborate to carry out processing.
- Essential: good application partition across DSPs.
- Inter-DSP communication channels: essential!
  - Cluster bus: resource sharing (ex: memory) & info broadcasting.
  - Point-to-point bus: direct communication among processing elements.

#### ADI SHARC DSP EXAMPLE



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 20/36

### 9.4 Architecture: multi-processor option [2]

#### b) Multi-core

- Multiple cores in same physical device: performance increase without architectural change.
  - Boosts effective performance: more gain for small core freq. increase.
  - Already-used philosophy: coprocessors (ex: Viterbi decoders).



- Two main flavours:
  - Symmetric Multi-Processing (SMP): similar/identical DSPs.
  - Asymmetric Multi-Processing (AMP): DSP + MCU.
- Architecture options:
  - Cores operate independently (DSP farm).
  - Core interaction for task completion.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 21/36

### 9.4 Architecture: multi-processor option [3]

#### b) Multi-core

- Resource partitioning:
  - @board level (like single-core case)
  - @ device level (added complexity).

Example of multi-core bus & memory hierarchy.



- Inter-core communication must be available.
- Programming complex : re-entrancy rules.
  - Needed to keep one's core processing from corrupting data of another core's processing.
  - Single-core follows re-entrancy rules for multitasking, too.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 22/36

### 9.4 Architecture: multi-processor option [4]



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 23/36

### 9.5 Architecture: radiation effects

- Single Event Upset (SEU): radiations-induced circuit alterations.
- General mitigation techniques:
  - Device level: extra doping layers to limit substrate charge collection.
  - Circuit level: decoupling resistors/diodes/transistors... for SRAM.
  - System level:
    - ✓ Error Detection & Correction (EDAC) circuitry.
    - ✓ Algorithm-based fault tolerance (ex: Weighted Checksum Code). → difficult with floating point!
- ADI/TI: no rad-hard DSP offered (third-party companies offer ADI/TI rad-hard versions).
- Application example: CERN LHC power supply controllers.
  - TI DSP 'C32 + MCU (non rad-hard).
  - EDAC circuit for SRAM protection.
  - Watchdog to restart system if it crashes.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 24/36

### 9.5 Architecture: interfaces

- DSP interface to define:
  - DSP-DSP
     DSP-Master VME (control + diagnostics)
  - DSP-FPGA
     DSP-daughtercards
  - Timing
- Don't hard-code addresses in DSP code use linker!
- Create data access libraries  $\rightarrow$  modular approach (upgradeable!).



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 25/36

### 9.6 Code design: interrupt-driven vs. RTOS

Fundamental choice. Depends on: System complexity Available resources

- Interrupt-driven : threads defined / triggered by interrupts.
  - Optimum resource use
  - OK for limited interrupt number.
- RTOS-based : RTOS manages threads + priority + trigger.
  - Some resources used by RTOS
  - Clean design + built-in checks



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 26/36



ADI SHARC: emuclk registers and their use.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 27/36

### 9.8 General recommendations

- Careful with new DSPs could be beta-versions !
- Look @ DSP anomalies list.
- Gain s/w experience with development environment & simulator.
   ADI & TI give 90-days fully-functional free evaluation of their tools.
- Gain s/w + h/w experience with eval. boards: easy prototyping.
  Helps solving technical uncertainties!



TI C6713 DSK: picture & block diagram.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 28/36



### Chapter 9 summary

- DSP & architecture choice: influenced by many factors.
- Which DSP:
  - Fixed-point vs. floating point : LHC BC example
  - Benchmarking: careful, often misleading!
- System architecture:
  - Multi-DSP / multi-core.
  - Radiation effects.
- DSP: code design
  - Interrupt-driven vs. RTOS
  - Good practices
- General recommendations:
  - β-beware, anomalies
  - Fully functional s/w evaluations
  - Use evaluation boards!

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 29/36



Chapter 10 topics

### **RT design flow: system integration**

10.1 Introduction10.2 Good practices



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 30/36



### **10.1 Introduction**



- Different developers (groups) involved: Instrumentation + Controls + Operation → coordination & specification work needed.
- Possibly slow: developers (groups) have different priorities.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 31/36

### 10.2 Good practices

#### Work in parallel

• Start planning all layers  $asap \rightarrow do not wait for low-level completion!$ 

#### Interfaces

- Clear, documented & agreed upon.
- Useful: recipes on how to setup/interact etc.
- Keep documents updated & on server !

#### Spare parameters (in/out)

- Mapped DSP  $\Rightarrow$  application prg.
- Small upgrades / debug added without new iterations.

#### Code releases

• Save current release + description. Going back easier if troubles.

#### Code validation

Define data set + procedure for sub-system validation.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 32/36





Putting it all together...

### A digital system example:

### **CERN LEIR LLRF**



i.e. how now you know more on DSP fundamentals & system design than two days ago!

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 33/36

### 11. Example: CERN LEIR LLRF



Components:
 [→ chapter 9]

DSP: beam loops implementation
FPGA: fast processing + glue-logic.
PowerPC: LLRF management & controls interface.

- Architecture: interrupt-driven + multi-DSPs  $[\rightarrow chapter 9]$
- DSP-DSP comms.: linkports + chained DMA.  $[\rightarrow$  chapters 4, 3, 9]
- SRAM shared DSP-Master VME (FPGA access arbitration). [ $\rightarrow$  chapter 4]



M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 34/36

## 11. Example: CERN LEIR LLRF [2]

 $[\rightarrow chapter 4]$ 

 $[\rightarrow \text{ chapter 6}]$ 

- **DSP boots** from on-board FLASH memory.
- Languages: C + assembly (ISR + shadow registers).
- Diagnostics buffers: four, 1024-word buffers / DSP board. Userselectable decimation & signal (~50 / DSP board) [→ chapter 9]



CERN LEIR LLRF system: radial position (red line) & B field (blue line) during a cycle.

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 35/36

### DSP fundamentals & system design: summary

- DSPs born early '80s: evolution in h/w + s/w tools.
- DSP architecture shaped by DSPing.
- DSP peripherals integrated & varied.
- RT design flow: s/w development.
  - Languages: assembly, C, C++, graphical. RTOS
  - Code building process: compiler, assembler. linker
- RT design flow: debugging
  - Simulation/Emulation
- RT design flow: analysis & optimisation
- RT design flow: system design & integrations

M. E. Angoletta, "DSP fundamentals & system design – LECTURE 3", CAS 2007, Sigtuna 36/36