

# DAQ, Online, and Software Triggers summary

AL



V.V. Gligorov, CERN On behalf of the Trigger/Online/Offline/Computing preparatory group ECFA HL-LHC workshop, Aix-les-Bains, 23/10/2014





# Talk overview

A summary of the DAQ and software trigger plans for the experiments in HL-LHC (n.b. LHCb/ALICE upgrades coming in Run3)

- 1) Overview of DAQ architectures
- 2) Common assumptions and technologies
- 3) Software reconstruction in the HL-LHC era

4) Software triggers and real-time data analysis As Wesley already said, a big thank you to all the working group members whose slides/results I have stolen!

# What is a "software trigger"?

- => A trigger implemented in "COTS" commodity processors, generally CPUs but possibly with GPU/FPGA or other "coprocessors" to help
- => Generally taken to mean a trigger which can perform something close to a "full event reconstruction" even if it doesn't in practice.

Another way to say this : anything which is not fixed-latency custom electronics. Important to realize though that in the multi-core era the actual underlying hardware may well be far from homogenous.



# 

The basic approach of all four collaborations can be summarized as follows : put as much as DAQ will allow into software triggers

Nevertheless "physics" and hardware constraints are leading to implementation differences

# DAQ overview

|                                        | ALICE                        | LHCb                        | CMS                           | ATLAS                       |
|----------------------------------------|------------------------------|-----------------------------|-------------------------------|-----------------------------|
| Hardware<br>trigger                    | No No                        |                             | Yes                           | Yes                         |
| Software<br>trigger input<br>rate      | 50 kHz Pb-Pb<br>200 kHz p-Pb | 30 MHz                      | 500/750 kHz for<br>PU 140/200 | 0.4 MHz                     |
| Baseline<br>processing<br>architecture | CPU/GPU/FPGA/<br>Cloud&Grid  | CPU farm<br>(+coprocessors) | CPU farm<br>(+coprocessors)   | CPU farm<br>(+coprocessors) |
| Software<br>trigger output<br>rate     | 50 kHz Pb-Pb<br>200 kHz p-Pb | 20-100 kHz                  | 5-7.5 kHz                     | 5-10 kHz                    |

# DAQ overview

|                                        | ALICE                        | LHCb                        | CMS                           | ATLAS                       |
|----------------------------------------|------------------------------|-----------------------------|-------------------------------|-----------------------------|
| Hardware<br>trigger                    | No                           | No                          | Yes                           | Yes                         |
| Software<br>trigger input<br>rate      | 50 kHz Pb-Pb<br>200 kHz p-Pb | 30 MHz                      | 500/750 kHz for<br>PU 140/200 | 0.4 MHz                     |
| Baseline<br>processing<br>architecture | CPU/GPU/FPGA/<br>Cloud&Grid  | CPU farm<br>(+coprocessors) | CPU farm<br>(+coprocessors)   | CPU farm<br>(+coprocessors) |
| Software<br>trigger output<br>rate     | 50 kHz Pb-Pb<br>200 kHz p-Pb | 20-100 kHz                  | 5-7.5 kHz                     | 5-10 kHz                    |

# **ALICE DAQ**

ALICE's online and offline data processing integrated into a single workflow

Aim is to compress events, not throw them away : driven by the fact that traditional "physics" probes have low S/B, hence event filtering not an efficient approach.



ALICE performs event compression, not selection, in their software "trigger"

# **ALICE DAQ**

| Detector | Input to<br>Online<br>System<br>(GByte/s) | Peak Output to Local<br>Data Storage<br>(GByte/s) | Avg. Output to<br>Computing<br>Center (GByte/s) |
|----------|-------------------------------------------|---------------------------------------------------|-------------------------------------------------|
| TPC      | 1000                                      | 50.0                                              | 8.0                                             |
| TRD      | 81.5                                      | 10.0                                              | 1.6                                             |
| ITS      | 40                                        | 10.0                                              | 1.6                                             |
| Others   | 25                                        | 12.5                                              | 2.0                                             |
| Total    | 1146.5                                    | 82.5                                              | 13.2                                            |

Input rate 1TByte/s

Goal is to achieve around 100x compression

Later compression stages perform detector calibrations which are fed back into earlier stages. The compression explicitly preserves the ability to recalibrate offline.



ALICE performs event compression, not selection, in their software "trigger"

# ALICE DAQ

| Detector | Input to<br>Online<br>System<br>(GByte/s) | Peak Output to Local<br>Data Storage<br>(GByte/s) | Avg. Output to<br>Computing<br>Center (GByte/s) |
|----------|-------------------------------------------|---------------------------------------------------|-------------------------------------------------|
| TPC      | 1000                                      | 50.0                                              | 8.0                                             |
| TRD      | 81.5                                      | 10.0                                              | 1.6                                             |
| ITS      | 40                                        | 10.0                                              | 1.6                                             |
| Others   | 25                                        | 12.5                                              | 2.0                                             |
| Total    | 1146.5                                    | 82.5                                              | 13.2                                            |

The data compression begins separately within each subdetector (the First Level Processors) and then continues once the whole event is built within the Event Processing Node farm.



ALICE performs event compression, not selection, in their software "trigger"

# DAQ overview

|                                        | ALICE                        | LHCb                        | CMS                           | ATLAS                       |
|----------------------------------------|------------------------------|-----------------------------|-------------------------------|-----------------------------|
| Hardware<br>trigger                    | No                           | No                          | Yes                           | Yes                         |
| Software<br>trigger input<br>rate      | 50 kHz Pb-Pb<br>200 kHz p-Pb | 30 MHz                      | 500/750 kHz for<br>PU 140/200 | 0.4 MHz                     |
| Baseline<br>processing<br>architecture | CPU/GPU/FPGA/<br>Cloud&Grid  | CPU farm<br>(+coprocessors) | CPU farm<br>(+coprocessors)   | CPU farm<br>(+coprocessors) |
| Software<br>trigger output<br>rate     | 50 kHz Pb-Pb<br>200 kHz p-Pb | 20-100 kHz                  | 5-7.5 kHz                     | 5-10 kHz                    |

# LHCb DAQ

LHCb's DAQ network built around a bidirectional eventbuilding farm.

Note that about 80% of the CPU in the event-building PCs remains free for implementing the "lowlevel trigger" (selecting on muon and CALO primitives) and/or the first stages of the event reconstruction.

Low-level trigger to be implemented in software, will NOT act on the front-end. Must read all events out regardless.

Need to transport/build 40 Tbit/s



LHCb's upgrade trigger aims to perform an offline-like event reconstruction/selection 12

# LHCb DAQ

A critical part of the DAQ is the ability to buffer events onto hard disks located in the EFF nodes ("deferred triggering").

Serves two purposes : multiply the available processing time, and allow real-time detector calibration/alignment.

Deployed in Run1 gaining 20% in HLT processing time, will be used more aggressively in Run2.





LHCb's upgrade trigger aims to perform an offline-like event reconstruction/selection 13

# DAQ overview

|                                        | ALICE                        | LHCb                        | CMS                           | ATLAS                       |
|----------------------------------------|------------------------------|-----------------------------|-------------------------------|-----------------------------|
| Hardware<br>trigger                    | No                           | No                          | Yes                           | Yes                         |
| Software<br>trigger input<br>rate      | 50 kHz Pb-Pb<br>200 kHz p-Pb | 30 MHz                      | 500/750 kHz for<br>PU 140/200 | 0.4 MHz                     |
| Baseline<br>processing<br>architecture | CPU/GPU/FPGA/<br>Cloud&Grid  | CPU farm<br>(+coprocessors) | CPU farm<br>(+coprocessors)   | CPU farm<br>(+coprocessors) |
| Software<br>trigger output<br>rate     | 50 kHz Pb-Pb<br>200 kHz p-Pb | 20-100 kHz                  | 5-7.5 kHz                     | 5-10 kHz                    |

# **CMS/ATLAS DAQ**

Hardware trigger aside, the CMS architecture is not far from what LHCb is planning. Important to note that the L1 tracking trigger will provide seeds for the HLT reconstruction however, which should significantly reduce the computing burden.

ATLAS plans for a slightly smaller HLT input rate due to two-stage hardware trigger design.





I NEED HELP MAKING UNREALIS-TIC ASSUMPTIONS TO SUPPORT A BUSINESS CASE FOR A BAD IDEA.



# Common assumptions and technologies

## Microprocessor Transistor Counts 1971-2011 & Moore's Law



# Actually a bit more complicated

|        | Fabric<br>Architectural change on |               | Micro                       | <b>C</b> - <b>1</b>                 |                        | Processors        |                         |
|--------|-----------------------------------|---------------|-----------------------------|-------------------------------------|------------------------|-------------------|-------------------------|
| Archit | ectural change                    | on<br>process | on architectu<br>process re |                                     | Codenames Release date |                   | 4P/2P Server/WS         |
| Tick   | Die shrink                        | 65 nm         | P6,<br>NetBurst             | Presler,<br>Cedar<br>Mill,<br>Yonah | January 5,<br>2006     |                   |                         |
| Tock   | New<br>microarchitecture          |               | Core                        | Merom                               | July 27,<br>2006       | Tigerton          | Woodcrest<br>Clovertown |
| Tick   | Die shrink                        | 45 nm         | Core                        | Penryn                              | November 11,<br>2007   | Dunnington        | Harpertown              |
| Tock   | New<br>microarchitecture          |               | Nehalem                     | Nehalem                             | November 17,<br>2008   | Beckton           | Gainestown              |
| Tick   | Die shrink                        | 32 nm         | Menarem                     | Westmere                            | January 4,<br>2010     | Westmere-EX       | Westmere-EP             |
| Tock   | New<br>microarchitecture          |               | Sandy                       | Sandy<br>Bridge                     | January 9,<br>2011     | (Skipped)         | Sandy Bridge-EP         |
| Tick   | Die shrink                        | 22 nm         | Bridge                      | Ivy Bridge                          | April 29,<br>2012      | Ivy Bridge-<br>EX | Ivy Bridge-EP           |
| Tock   | New<br>microarchitecture          |               | Haswell                     | Haswell                             | June 2, 2013           |                   | We are                  |
| Tick   | Die shrink                        | 14 nm         | naswett                     | Broadwell                           | 2014                   |                   | here!                   |

## Stolen from Beat Jost



# Future microprocessor evolution?

|         |                          | Fabricatio | Micro            | Cadanana       | Release | Pr              | ocessors        |
|---------|--------------------------|------------|------------------|----------------|---------|-----------------|-----------------|
| Archite | ectural change           | n process  | architec<br>ture | Codename<br>s  | date    | 8P/4P<br>Server | 4P/2P Server/WS |
| Tick    | Die shrink               | 14 nm      | Haswell          | Broadwel<br>1  | 2014    |                 |                 |
| Tock    | New<br>microarchitecture |            |                  | Skylake        | 2015    |                 |                 |
| Tick    | Die shrink               |            |                  |                |         |                 |                 |
| Tock    | New<br>microarchitecture | 10 nm      | Skylake          | Cannonla<br>ke | 2016    |                 |                 |
|         |                          |            |                  |                | 2017    |                 |                 |
| Tick    | Die shrink               | 7          |                  |                | 2018    |                 |                 |
| Tock    | New<br>microarchitecture | 7 nm       |                  |                | 2019    |                 |                 |
| Tick    | Die shrink               | Γ          |                  |                | 2020    |                 |                 |
| Tock    | New<br>microarchitecture | 5 nm       |                  |                | 2021    |                 |                 |

Take home message: expect tick-tock and die shrinking to continue for the next years 19

# Extrapolating to the future

Clearly 25% performance improvement per year is not the same as doubling the performance every 2 years (more like 3).



## 20

# Extrapolating to the future

Clearly 25% performance improvement per year is not the same as doubling the performance every 2 years (more like 3).

However also important to notice that this is a power law, so small changes in the assumed %/year lead to big differences on a 10-20 year timescale.



CPU performance growth

Relative growth to 2010 HLT reference node



Number of 2010 equival

# Extrapolating to the future

Clearly 25% performance improvement per year is not the same as doubling the performance every 2 years (more like 3).

However also important to notice that this is a power law, so small changes in the assumed %/year lead to big differences on a 10-20 year timescale.

CMS and LHCb somewhat more optimistic than CERN computing, backed up by observed performance improvements. But nobody betting the farm on  $\pm 5\%$ .

## Critical point : must fully exploit the new many core architectures!



- look at the power of the HLT nodes
  - bought in 2008, 2011, 2012
  - and foreseen for 2015
- extrapolating to 2023 we could estimate increase by a factor  $\times 10$
- this still leaves a factor ×2 (x4)

|                                        | ALICE    | LHCb     | ATLAS    | CMS      |
|----------------------------------------|----------|----------|----------|----------|
| Assumed online<br>performance<br>gains | 25%/year | 35%/year | 25%/year | 35%/year |

## CMS observed performance improvements



# Software event reconstruction



# What remains after Moore's law

Will need to make significant gains in computing performance on top of Moore's law projections, typically another factor 2-5.

This comes down to exploiting the many-core architectures more intelligently.

A personal comment : we often discuss absolute performance in terms of algorithm speed, but for software triggers latency is basically irrelevant. We should focus on physics/CHF.



# ALICE's GPU tracking



ALICE are fully committed to a GPU reconstruction for the TPC in particular. Already commissioned in Run I! Achieves a threefold increase in performance compared to CPU.

25

# LHCb's 30 MHz reconstruction

**Offline Tracking** 



LHCb's vertex detector outside the dipole magnet makes it a slightly special case

# LHCb's 30 MHz reconstruction



LHCb's vertex detector outside the dipole magnet makes it a slightly special case. Reconstruction timing is basically linear with instantaneous lumi/pileup. Because we want to catch low momentum tracks crossing the full detector volume it is not trivial to parallelize the track finding, although a lot work is ongoing into GPU coprocessors. 27

# ATLAS/CMS reconstructions

Enormously challenging environment, and both experiments are significantly upgrading the tracking hardware to cope (not topic of this talk)





ATLAS/CMS software trigger tracking will be seeded by the L1 track trigger candidates 28

# ATLAS/CMS reconstructions

Already a lot of work for Run2, vectorizing code is a hot topic (also on LHCb/ALICE). Also lots of work on optimal tracking algos for pileup.

ATLAS reports x3 gain for CPU, CMS x2. Will need more gains like that going towards HL-LHC!







ATLAS/CMS software trigger tracking will be seeded by the L1 track trigger candidates 29

# **ATLAS/CMS reconstructions**

Also more aggressive ideas being studied, e.g. different tracking inside/outside the signal ROI.

Already used in RunI for brems/muon efficiency recovery. Expect to expand on these strategies.









## ATLAS/CMS software trigger tracking will be seeded by the L1 track trigger candidates 30

# Software trigger menus and real-time analysis

# Big data, big opportunities

Input data rate of the LHCb upgrade post LS2 = 5 TB/second



## This means ~20000 PB of data every year

32

# A pinch of salt is needed but...





today

Triggers in the future

While I am going to mention menus, there are enormous "parasitic" opportunities for physics beyond the core programmes at the HL-LHC, and we should expect these to evolve and compete for output bandwidth with the "core" physics for both ATLAS/CMS and LHCb as we approach the HL-LHC era. Remember : ALICE keeps all interactions, hence no HLT "menu" as such.

# LHCb HLT menus

Because of the offline-like reconstruction, can in principle select any Beauty/Charm decay to charged tracks (and some with neutrals) at HLT level.

Several output rate scenarios being considered, main driver is what we want to do with charm physics. 2-10 Gb/s output rate foreseen.





## Exclusive selections



- Main trigger for B decays is based on a Boosted Decision Tree
- Inclusive trigger for 2, 3, 4-body detached vertices
- Preselect tracks based on distance to PV, scalar and vector sum of  $p_T$
- BDT inputs:  $p_T$ ,  $IP_{\chi^2}$ , flight distance  $\chi^2$ , mass and corrected mass:

$$m_{corr} = \sqrt{m^2 + \left| p_{T_{miss}} \right|^2} + \left| p_{T_{miss}} \right|$$

Tim Head (EPFL) 7 September 2014

Key challenges: combinatorics and output rate

$$B^0$$
,  $D^0 \rightarrow h^+ h^-$ 

► Timing: 0.13 ms

- $B^{0} \rightarrow h^{+}h^{-} \sim 1 \text{ kHz}$   $D^{0} \rightarrow K^{-}\pi^{+} \sim 20 \text{ kHz}$   $D^{0} \rightarrow K^{+}\pi^{-}, \pi\pi \sim 40 \text{ kHz}$
- $D^0 \rightarrow KK \sim 2 \,\mathrm{kHz}$
- $B_s \rightarrow \phi (\rightarrow KK) \phi (\rightarrow KK)$ 
  - ► Timing: 0.1 ms, Rate: ~ 12 Hz

Tim Head (EPFL) 7 September 2014

# ATLAS/CMS menus

| CMS Category              | L1 Triggers                | L1 rate<br>(w/ overlaps) | Required reduction | HLT rate |
|---------------------------|----------------------------|--------------------------|--------------------|----------|
| Muons                     | Eloup                      | 21 kHz                   | ~ 21               | 1 kHz    |
| E/Gamma                   | е, ее, іко-е,<br>у, уу     | 102 kHz                  | ~ 102              | 1 kHz    |
| Taus                      | τ, ττ,<br>e+τ, μ+τ         | 75 kHz                   | ~ 75               | 1 kHz    |
| Hadronic                  | jets, e+MHT,<br>μ+MHT, HTT | 138 K A.                 | ~ 138              | 1 kHz    |
| Others                    | MET,<br>others             | 160 kHz                  | <b>160</b>         | 1 kHz    |
| Total rate (w/o overlaps) |                            | 500 kHz                  | 100                | 5 kHz    |

Somewhat different foreseen HLT rejection rates

100:1 for CMS and 40:1 for ATLAS.

Menus very sketchy at present, which is understandable because really the reconstruction questions are more pressing.

# **Real time detector calibration**



Both LHCb and ALICE plan a real-time detector alignment and calibration. In the LHCb case this is absolutely critical because it enables hadronic particle identification to be used in the trigger. Not clear whether CMS/ATLAS need or want to go down this road. 36



# Real time multivariate analyses







Well known that multivariate analyses perform better than so-called "cut-based" approaches. Now making their way into HLT algorithms, e.g. LHCb's inclusive b-physics trigger in Run I. Real-time data analysis is an area where the private sector invests a lot, expect significant improvements as a result of collaborations over coming years. 37

## MDDAG, Benbouzid, Kegl et al.

# Ceterum censeo...



The basic approach of all four collaborations can be summarized as follows : put as much as DAQ will allow into software triggers.

Nevertheless "physics" and hardware constraints are leading to implementation differences.

Will be critical to fully exploit multi-core architectures and opportunities for parallelism in algorithms if software triggers are to reach their full potential!

Another big thank you to all the working group members whose slides/results I have stolen!

Backups

# ALICE's GPU tracking

## Why GPUs

- GPUs use their silicon for Aus
- CPUs use their silican mainly for caches, branch prediction, etc.



Naively, GPUs gain as long as the cores don't have to talk to each other.

# LHCb DAQ

LHCb's DAQ network built around a bidirectional eventbuilding farm.

Note that about 80% of the CPU in the event-building PCs remains free for implementing the "lowlevel trigger" (selecting on muon and CALO primitives) and/or the first stages of the event reconstruction.

Need to transport/build 40 Tbit/s

| Max. inst. luminosity                       |
|---------------------------------------------|
| Event-size (mean – zero-suppressed) [kB]    |
| Event-building rate [MHz]                   |
| # read-out boards                           |
| link speed from detector [Gbit/s]           |
| output data-rate / read-out board [Gbit/s]  |
| # detector-links / readout-board            |
| # farm-nodes                                |
| # links 100 Gbit/s (from event-builder PCs) |
| final output rate to tape [kHz]             |

LHCb's upgrade trigger aims to perform an offline-like event reconstruction/selection 41

| LHCb Run1 & 2          | LHCb Run 3  |
|------------------------|-------------|
| 4 x 10^32              | 2 x 10^33   |
| ~ 60 (L0 accepted)     | ~ 100       |
| 1                      | 40          |
| ~ 330                  | 400 - 500   |
| 1.6                    | 4.5         |
| 4                      | 100         |
| up to 24               | up to 48    |
| ~ 1000 (+ 500 in 2015) | 1000 - 4000 |
| n/a                    | 400 - 500   |
| 5                      | 20 - 100    |