## Bare-metal Programming on Zynq UltraScale+ for the FGC4 Power Converter Controller

Martin Cejp, Dariusz Zielinski 3rd CERN SoC Workshop 2023-10-04





- Project introduction
- Architecture
- Our needs and constraints
- Possible solutions
- Bare-metal bootloader and utilities

# FGC4 project





# FGC4 project – basic principle







# CERN

#### We want to use Linux to:

- Reuse code with the FEC software.
- Take advantage of all available libraries, network stack, file system, etc.
- Make it easier to develop, debug and maintain our software.

#### We want to use Linux to:

- Reuse code with the FEC software.
- Take advantage of all available libraries, network stack, file system, etc.
- Make it easier to develop, debug and maintain our software.

#### But we also want to:

- Achieve around 100 kHz current regulation frequency.
- Be able to run hard-real-time code with absolute determinism.
- Be able to run lots of calculations in a short amount of time.



| Problem type      | What you need                                        | Typical solution                                                                                 |  |
|-------------------|------------------------------------------------------|--------------------------------------------------------------------------------------------------|--|
| Real-time problem | Determinism<br>Tasks have to finish at the deadline. | Microcontrollers, DSPs - simpler<br>CPUs (e.g. Cortex-M, Cortex-R),<br>FPGAs<br>Bare-metal, RTOS |  |
|                   |                                                      |                                                                                                  |  |
|                   |                                                      |                                                                                                  |  |



| Problem type             | What you need                                                                                                | Typical solution<br>Microcontrollers, DSPs - simpler<br>CPUs (e.g. Cortex-M, Cortex-R),<br>FPGAs<br>Bare-metal, RTOS |  |
|--------------------------|--------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|--|
| Real-time problem        | Determinism<br>Tasks have to finish at the deadline.                                                         |                                                                                                                      |  |
| Number-crunching problem | Throughput<br>With margin, even if a task overruns,<br>it's still the same number of tasks per unit of time. | SoCs, servers - more complex<br>CPUs (e.g. Cortex-A, x86-64, etc.)<br>Linux                                          |  |
|                          |                                                                                                              |                                                                                                                      |  |



| Problem type              | What you need                                                                                                | Typical solution                                                                                                                                                                |  |  |
|---------------------------|--------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Real-time problem         | Determinism<br>Tasks have to finish at the deadline.                                                         | Microcontrollers, DSPs - simpler<br>CPUs (e.g. Cortex-M, Cortex-R),<br>FPGAs<br>Bare-metal, RTOS<br>SoCs, servers - more complex<br>CPUs (e.g. Cortex-A, x86-64, etc.)<br>Linux |  |  |
| Number-crunching problem  | Throughput<br>With margin, even if a task overruns,<br>it's still the same number of tasks per unit of time. |                                                                                                                                                                                 |  |  |
| High-frequency regulation | Determinism and throughput                                                                                   | ???                                                                                                                                                                             |  |  |

### How to address these needs?



Idea #1

# Let's just use Linux...



### Frequency aim for the FGC4:

100 kHz ------ 10 us for an iteration

Preliminary tests showed that a regulation iteration takes: 4-6 us (depending on a case)

> Which gives us a worst-case margin of ~4 us (not counting FGPA communication latency yet)

### Possible solutions - Idea #1

### Can we force Linux to be deterministic? We tried:

- Applying real-time patch (RT patch).
- Setting CPU affinity for the process.
- Setting interrupts affinity.
- Memory locking
- Kernel tweaking (isolcpus, nohz\_full, rc\_nocbs, irq\_affinity)

### $\rightarrow$ Best achieved interruption time: 4 us.

While the available margin is 4 us....

The Kernel interrupts the process around every 4ms (250 Hz) which is the "Kernel tick" – present even in the "tickless" kernels.

Conclusion

You cannot have hard real-time determinism on Linux



#### In summary we tried:

- **RT patch**  $\rightarrow$  Not enough
- Kernel config tweaking  $\rightarrow$  Not enough
  - **Isolation patch**  $\rightarrow$  Not ready yet (*hobby project*)
- Xenomai project

Jailhouse project

- → Quite complex (Jailhouse seems better)
- ightarrow Not so popular, difficult set-up

•





### Idea #2

# Let's follow Xilinx recommendation and use Cortex-R5...



#### We run benchmarks on Cortex-A53:

| Scenario FPGA | Logica | l             | Mean rate [kHz] | Slov | west rate [kHz] | Slowest  | iteration [µs] |       |
|---------------|--------|---------------|-----------------|------|-----------------|----------|----------------|-------|
| Scenario      | access | ccess Logging | NXP             | Zynq | NXP             | Zynq     | NXP            | Zynq  |
| Idle          | No     | No            | 788             | 284  | 757             | 273      | 1.32           | 3.66  |
| ldle          | No     | Yes           | 285             | 106  | 240             | 100      | 4.16           | 10.00 |
| Idle          | Yes    | No            | 389             | 249  | 378             | 242      | 2.64           | 4.12  |
| Idle          | Yes    | Yes           | 215             | 100  | 195             | 93       | 5.12           | 10.74 |
|               |        |               |                 |      |                 | $\smile$ |                |       |
| Direct        | No     | No            | 724             | 261  | 694             | 253      | 1.44           | 3.94  |
| Direct        | No     | Yes           | 335             | 119  | 266             | 111      | 3.76           | 8.94  |
| Direct        | Yes    | No            | 377             | 230  | 373             | 222      | 2.68           | 4.50  |
| Direct        | Yes    | Yes           | 238             | 111  | 208             | (101     | 4.80           | 9.86  |

Benchmarks results.

We barely achieved 100 kHz on Cortex-A53 running 1.2 GHz. Cortex-R5 has a simpler microarchitecture and runs at ~500 MHz

| Conclusion |                        |
|------------|------------------------|
|            | Cortex-R5 is too slow. |



### Idea #3

### Let's run bare-metal on Cortex-A53...

### Possible solutions - Idea #3

#### Bare-metal approach:

- Limit the **RAM memory** and **number CPU of cores** visible in Linux (*by adjusting the Device Tree*).
- **Compile** the binary for **bare-metal target**.
- Load the binary at given memory address and start CPU core at this address.



### Possible solutions - Idea #3

#### Bare-metal approach:

- Limit the **RAM memory** and **number CPU of cores** visible in Linux (*by adjusting the Device Tree*).
- Compile the binary for bare-metal target.
- Load the binary at given memory address and start CPU core at this address.

#### Challenges:

- Proper compilation of bare-metal applications.
- Handling different Exception Level.
- Starting, interrupting, reloading and monitoring of the bare-metal program.
- Cache, MMU, and interrupts configuration.
- Inter-processor communication via shared memory.



### How to run bare-metal app on Cortex-A53?

## Bare-metal apps on Cortex-A53



#### Naive approach:

- 1. Write and compile code
- 2. Copy it to memory (through Linux)
- 3. Clear reset bit of BM core
- 4. Celebrate your success



#### Just a few obstacles...

- Code may be erroneous → application must be able to restart after a crash, while protecting the rest of the system
- 2. Cannot reliably stop application once started
- Some peripherals are shared between Linux and bare metal (interrupt controller) and must be protected accordingly

#### Solution: OpenAMP?

#### **Our solution: Bmboot**

- A minimalist loader & monitor for bare-metal code
- Executes on a high privilege level (EL3) to protect itself from errors in user code
- Zero overhead when user code executing, but can be called upon to intervene





### Bmboot

# CERN

#### How to use it?

```
#include <bmboot/payload_runtime.hpp>
```

```
int main(int argc, char** argv)
{
    bmboot::notifyPayloadStarted();
    printf("Hello, world!\n");
    for (;;) {}
}
```

root@diot:~# bmctl exec cpu3 hello\_world.bin
Hello, world!

- Write C++ code (being aware of limitations of the bare-metal environment)
- Link to the Bmboot SDK
- Launch monitor + user application from Linux (via CLI or API)

### Bmboot







\$ bmctl startup cpu3





#### \$ bmctl startup cpu3





- \$ bmctl startup cpu3
- \$ bmctl exec cpu3 hello\_world.bin





- \$ bmctl startup cpu3
- \$ bmctl exec cpu3 hello\_world.bin
- \$ bmctl terminate cpu3





- \$ bmctl startup cpu3
- \$ bmctl exec cpu3 hello\_world.bin
- \$ bmctl terminate cpu3





#### Lifecycle can also be managed by user application on Linux via API





#### Lifecycle can also be managed by user application on Linux via API





#### **Communication via shared memory**



- Range of reserved memory determined ahead-of-time
- Cache coherence ensured by hardware Snoop Control Unit (no need to "flush" cache)
- No kernel driver necessary access via /dev/mem special device

### Bmboot



#### Other features

Crash handling (core dump)

root@diot:~# bmctl exec cpu1 real\_time\_app.bin ... root@diot:~# bmctl status cpu1 crashed\_payload v root@diot:~# bmctl core cpu1 Writing to core.elf root@diot:~# gdb real\_time\_app.elf core.elf

→ Inspect with GDB: registers, stack trace, memory snapshot

Useful also for post-mortems in operation

### Bmboot

#### Feature summary

- Execution environment with no run-time overhead
- Shared-memory communication
- Interrupt handler registration
  - Periodic *tick* callback (opt-in)
- Crash handling and recovery
- API and CLI for control from Linux

#### The bigger picture

- We feel that other groups must be solving a similar problem
- We would be happy to elevate this to a collaborative project
- We would like to hear from you







Thank you! Questions?