# ICEPP QC and HPC Research Activity



3rd Nov, 2023

The 7th Asian Tier Center Forum



Thank: Y. Iiyama : Computer cluster for QC simulation S. Chen : QC hardware researches

# HPC

## **HPC in ATLAS experiments**



May 2021

- EuroHPC Vega (#166 in Top500) is in production for ATLAS form May 2021.
- ATLAS uses a lot of HPC resources. The use of HPC is very promising.

## HPC top500

Top500 (June 2023)

|    | System                                                              | Cores<br>(k) | Rmax<br>(PFlops) | Rpeak<br>(PFlops) | Power<br>(MW) |
|----|---------------------------------------------------------------------|--------------|------------------|-------------------|---------------|
| 1  | Frontier, DOE/SC/Oak Ridge National Laboratory, United States       | 8,700        | 1,194            | 1,680             | 22.7          |
| 2  | Supercomputer Fugaku, RIKEN Center for Computational Science, Japan | 7,631        | 442              | 537               | 29.9          |
| 3  | LUMI, EuroHPC/CSC, Finland                                          | 2,220        | 309              | 429               | 6.02          |
| 4  | Leonardo, EuroHPC/CINECA, Italy                                     | 1,825        | 239              | 304               | 7.40          |
| 5  | Summit, DOE/SC/Oak Ridge National Laboratory, United States         | 2,415        | 149              | 201               | 10.1          |
| 6  | Sierra, DOE/NNSA/LLNL, United States                                | 1,572        | 95               | 126               | 7.44          |
| 7  | Sunway TaihuLight, National Supercomputing Center in Wuxi, China    | 10,650       | 93               | 125               | 15.4          |
| 8  | Perlmutter, DOE/SC/LBNL/NERSC, United States                        | 762          | 71               | 94                | 2.59          |
| 9  | Selene, NVIDIA Corporation, United States                           | 556          | 63               | 79                | 2.65          |
| 10 | Tianhe-2A, National Super Computer Center in Guangzhou, China       | 4,982        | 61               | 101               | 18.5          |

- "Fugaku" is in second place.
- The total number of CPU cores in the WLCG is ~1M cores. If HPC can be used, it will be a very promising computing resource.

## HPC top500 (in Japan)

|     | System                                                                                   |
|-----|------------------------------------------------------------------------------------------|
| 2   | Supercomputer Fugaku, RIKEN Center for Computational Science                             |
| 24  | ABCI 2.0, National Institute of Advanced Industrial Science and Technology (AIST)        |
| 25  | Wisteria/BDEC-01 (Odyssey), Information Technology Center, The University of Tokyo       |
| 41  | TOKI-SORA, Japan Aerospace eXploration Agency                                            |
| 50  | ???, Japan Meteorological Agency                                                         |
| 63  | Earth Simulator -SX-Aurora TSUBASA, Japan Agency for Marine-Earth Science and Technology |
| 80  | TSUBAME3.0, GSIC Center, Tokyo Institute of Technology                                   |
| 84  | Plasma Simulator, National Institute for Fusion Science (NIFS)                           |
| 97  | Flow, Information Technology Center, Nagoya University                                   |
|     | •••                                                                                      |
| 136 | Wisteria/BDEC-01 (Aquarius), Information Technology Center, The University of Tokyo      |
| 140 | Oakbridge-CX. Information Technology Center, The University of Tokyo                     |

- There are several high-performance HPCs in Japan.
- Information Technology Center of the University of Tokyo manages some of them.
  - We have advanced R&D running grid jobs on the ITC HPCs.

### **History of HPC utilization in ICEPP**

- We started R&D on ITC/UTokyo HPC from 2019 using **Reedbush** system (2016-2020)
- From 2020, we moved to the next generation system: Oakbridge-CX (2019-2023/09)

 $\rightarrow$  We report a **summary** of the integration of HPCs into the Tier2 grid.

• The next generation system is Wisteria/BDEC-01 (2021-)

 $\rightarrow$  We report an **overview** of the system and the **difficulties** in using it.

|     | System                                                                                    | Cores (k) | Rmax<br>(PFlops) | Rpeak<br>(PFlops) | Power<br>(kW) | Year      |
|-----|-------------------------------------------------------------------------------------------|-----------|------------------|-------------------|---------------|-----------|
| 25  | Wisteria/BDEC-01 (Odyssey), Information<br>Technology Center, The University of<br>Tokyo  | 369       | 22.1             | 26.0              | 1,468         | 2021-     |
| 136 | Wisteria/BDEC-01 (Aquarius),<br>Information Technology Center, The<br>University of Tokyo | 42        | 4.4              | 5.8               | 184           | 2021-     |
| 140 | Oakbridge-CX, Information Technology<br>Center, The University of Tokyo                   | 77        | 4.3              | 6.6               | 845           | 2019-2023 |
| —   | TOKYO Tier2                                                                               | 11        | 1.2              | -                 | 120           | 2022-     |

## Oakbridge-CX

- Compute nodes (only CPU, no GPU)
  - 1368 compute node, 6.61 PFlops
  - 56 cores / node, 1148 HS06 / nodes
- File system
  - Lustre, 12.4 PB
- Batch system
  - FUJITSU Software Technical Computing Suite (TCS)
- Network connectivity
  - ssh to login nodes, where we can submit jobs and read/write to shared FS.
  - No connections to computing nodes.
  - $\rightarrow$  Grid jobs cannot access storage element, external DB, etc.
- No root privilege → We cannot use CVMFS



## HPC (Oakbridge-CX)



- Singularity container image is used.
  - contains all necessary files
  - processes simulation jobs only
- Input/output files are transferred by ARC.

- All necessary files on cvmfs are predownloaded to the shared FS on HPC, which can be accessed by compute nodes.
- No negligible overhead.

Before using Singularity container, we used parrot\_run + cvmfs\_preload.

## Jobs accounting (History)

<u>Grafana</u>

### Wall clock time (successful jobs) (HS23 sec)



- 120,000 jobs processed
- 330 G HS06 seconds  $\rightarrow$  ~15 days of current Tokyo Tier2 full power

## Job status on Oakbridge-CX

finished WallClock Consumption of Successful and Failed Jobs - Time Stacked Bar Graph failed 80 Mil cancelled 70 Mil closed 60 Mil 50 Mil 40 Mil 30 Mi 20 Mil 10 Mil 0 Aug, 2022 <sup>06/01</sup>July,<sup>07/</sup>2023 Jan<sup>°1/01</sup>2023<sup>°01</sup> 10/01 03/01 04/01 05/01 11/01 12/01 min finished 0 69.9 Mil 17.8 Mil 6.50 Bil 266 Mil failed 0 22.0 Mil 728 K 9.74 Mil cancelled 4.26 Mil 26.7 K 0 0 0 0 closed 0

## **CPU/Wall efficiency on Oakbridge-CX**



Successful

### Next HPC systems of ITC/UTokyo: Wisteria/BDEC-01

- Consists of two systems
  - Simulation node cluster (25.9 PFlops) → ARM CPU
    - FUJITSU Processor A64FX used at Fugaku (the top HPC in Japan)
  - Data/training node cluster (7.2 PFlops) → GPU
    - Nvidia A100 x8
  - → Suitable for large-scale parallel computing and machine learning
- This HPC cannot use HEP-standard processing unit, such as Intel x86\_64 CPU.
  - A lot of R&D is needed to use them with high efficiency.
  - ARM is already supported in ATLAS  $\rightarrow$  Benchmarking (see next page)
  - GPU as a production job is not yet supported in ATLAS.



## Performance of A64FX: basic benchmark



Need to code optimization for Geant4 jobs with A64FX 13

## Performance of A64FX: HEPSCORE23



- The A64FX has a small amount of memory (32GiB) compared to the number of cores (48).
  - Multithreading code is required to use the memory efficiently.
- To maximise the performance of the A64FX, we may need to fully utilize SVE. 14

# Quantum Computer





**System One** installed in Kawasaki Japan

Quantum computer test bed, in Quantum Hardware Test Center of UTokyo.



CERN

Slide from Alberto Di Meglio at al, CERN QTI 2020

HEP laboratories, e.g. Fermilab, TRIUMF, DESY and ICEPP, have started researches in QC around the beginning of 2020'th

# Classical computer v.s. QC

- 1 qubit = two floating numbers.
- The number of quantum states increases exponentially as a function of the number of qubits.
  - For simulating a 27 qubit system using a classical computer, we need 1 GB of memory.
  - For a **37 qubit system**, **O(1 TB)** is needed.
  - And for around 49 qubit or more, simulation is getting impossible even by HPC.
  - The number of quantum states that are possible with only
     256 qubits exceeds the number of atoms in the solar system



Number of qubits

~1000 4000

<sup>433</sup> 

## QC researches in ICEPP, U-Tokyo



Machine Learning and Quantum Computing for High-Energy Physics

## **ICEPP computer cluster for QC and ML**

Cluster shared by QC and machine learning researchers

### Resource:

- Main investment in GPU
   1 DGX A100, 1 custom node with 10 A100s, 3 various GPU nodes
- Storage 320TB (mostly for ML workloads)
- 2TB & 1.5TB RAM on the two A100 machines

### QC usage:

- Qiskit and qulacs heavily used
- Qiskit and related libraries packaged into singularity containers and delivered to users over NFS
  - Spares installation troubles & improves research reproducibility
- GPU utilized extensively in pulse-level simulation of qudits using qutip and JAX

Y. liyama

## Example : Parton shower simulation



## Parton shower simulation

### Berkley group

#### 1904.03196



## Parton shower simulation : Result Berkley group



The effect of interference is observed in difference between blue and red histograms.

## Quantum Dynamics Simulation

$$i\frac{d}{dt} |\psi(t)\rangle = H |\psi(t)\rangle$$
  

$$\Rightarrow |\psi(t)\rangle = \lim_{\substack{N \to \infty \\ N\Delta t = t}} \prod_{k=1}^{N} e^{-iH(k\Delta t)\Delta t} |\psi(0)\rangle$$

- Initial state of a many-body system:  $\psi(0) \rangle \rightarrow$  Initial QC state
- Time evolution in unit time:  $e^{-iH(k\Delta t)\Delta t} \rightarrow$  gate operation  $\psi(t)$  can be obtained approximately

Note: The same calculation can be done classically.

- Initial state → State vector
- Time evolution  $\rightarrow$  Tensor calculation

But the size of the vector and tensor can be too large to practically calcuate:  $2^{50}$  complex (128byte) numbers  $\rightarrow 2^{54}$  bytes for 50 qubit system



## Quantum neural network









## **SUSY Classification**



### Compared with BDT and DNN :

- BDT and DNN models
   optimized at each training set
   to avoid over-training
- Classical algorithms outperform at large training set

Performance of quantum algorithm comparable to BDT/DNN at small training set with small # of variables

## Quantum circuit optimization : AQCEL

*W. Jang et al. Quantum 6, 798 (2022). Github: UTokyo-ICEPP/aqcel* 

In physics simulation, many events are generated using a *Github: UTokyo-ICEPP/aqcel* single program with a fixed initial state.

Circuits can be shorter, namely the number of gate operations can be less, by optimising it depending on the initial state.

qc A for any initial states



General qc optimizer (qiskit, tket, etc...) <u>Preserve circuit equivalence</u> <u>AQCEL (Advancing Quantum Circuit by ICEPP and LBNL)</u> Optimize qc depending on initial states. Circuit equivalence is not always preserved. Strong reduction can be applied.

### AQCEL result for parton shower simulation

| Number of gates |            | Original | tket       | AQCEL(CC)                | AQCEL(QC,25%)            |
|-----------------|------------|----------|------------|--------------------------|--------------------------|
|                 | CNOT       | 527      | 616 (117%) | 178 ( <mark>34%</mark> ) | 64 ( <mark>12%</mark> )  |
|                 | U1, U2, U3 | 362      | 331 ( 91%) | 102 ( <mark>28%</mark> ) | 24 ( <mark>6.7%</mark> ) |
|                 | Total      | 889      | 947 (107%) | 280 ( <mark>31%</mark> ) | 88 ( <mark>9.9%</mark> ) |



F<sub>sim</sub> (Fidelity in case of no QC noise) is not decreased by due to the approximation in AQCEL.

F<sub>meas</sub> is much improved, ~0.4  $\rightarrow$  0.9, by the fewer operations, optimised by AQCEL.

# Qutrit : Quantum trit

Energy level of transmon



By programming a customised pule, we can use  $1\rangle \rightarrow 2\rangle \stackrel{*}{\Rightarrow} 2\rangle \rightarrow 1\rangle$  transition in IBM Quantum devices.



Multi controlled bit (Toffoli) gate using Qutrit



# Qutrit : gate fidelity

Gate fidelity of Qutrit toffoli gate is measured on ibmq\_kolkata



· Gate time

- · Qutrit Toffoli : 2.5 µs
- · Qubit Toffili : 3.1 µs
- · Fidelity after calibration
  - 0.928±0.007 (1 hour later)
  - · 0.896±0.036 (1day later)

Qutrit Toffili fidelity is 5-7% higher than Qubit Toffoli.

# Ising Machines : Annealer

|                                     | Vender              | Product                      | Number of bits                                   |
|-------------------------------------|---------------------|------------------------------|--------------------------------------------------|
| Quantum                             | D-Wave<br>Systems   | D-Wave<br>Advantage          | 5760                                             |
|                                     | Hitachi             | CMOS<br>Annealing            | 147k (ASIC),<br>256k (GPU)                       |
|                                     | Fujitsu             | Digital<br>Annealer          | 8192                                             |
| Quantum-<br>inspired<br>(classical) | Toshiba             | SQBM+                        | 10M                                              |
| <b>X 7</b>                          | Fixstars<br>Amplify | Fixstars<br>Amplify AE       | 131k (Full connect),<br>< 4.3B (Partial connect) |
|                                     | NTT                 | Coherent<br>Ising<br>machine | 100k                                             |



#### Developed by Berkley group

# Annealing tracking with ATLAS Data

Waseda university group

- minBias trigger
- Relative efficiency to offline tracking
- ~90% efficiency for pT > 1 GeV
- Annealing time is similar with read data and MC simulation



Average pre-processing time for data is ~0.6 sec. (single core, 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz)

# Hardware development

S. Chen

Qubit as a sensor : Direct detection of light dark matter (Axion, darkphoton)

Full-stack development capability established in the first year



#### **Packaging & Measurement**





#### R&D of new exotic quantum devices



# Conclusion

- Not only traditional computing services, ICEPP is carrying out researches on new technologies
  - HPC, Cloud (not in this talk), ML (not including this talk), QC.
- HPCs in Japan may potentially increase available resource for HEP, but more researches are needed to use them with full computing power.
- QC may outperform classical computer in future. We are try to use QC for different types of problems and seeing how well it works.



## Quantum circuit optimization : AQCEL

*W. Jang et al. Quantum* 6, 798 (2022).

In physics simulation, many events are generated using a *Github: UTokyo-ICEPP/aqcel* single program with a fixed initial state.

Circuits can be shorter, namely the number of gate operations can be less, by optimising it depending on the initial state.

Example:



**CX deleted** 

**Bit-control deleted** 

**Bit-control deleted** 

|                    | 1st CX               | 2nd CX         | CCX                                               |
|--------------------|----------------------|----------------|---------------------------------------------------|
| Quantum state      | 0 <mark>0</mark> 0 > | 0 <b>1</b> 0 > | $\frac{1}{\sqrt{2}} 011>+\frac{1}{\sqrt{2}} 111>$ |
| Control bit states | <b>'</b> 0'          | '1'            | <b>'01', '11'</b>                                 |
| Deletion           | CX                   | Bit-control    | Bit-control                                       |