The 7<sup>th</sup> Asian Tier Center Forum, 2023

# Driving massively scalable simulations of quantum circuits in supercomputers

Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

Principal Researcher Lead of Quantum Information R&D Group Korea Institute of Science and Technology Information



# **Existing Platforms of Physical Qubits**

#### **Current status of "circuit-based" quantum computers**



#### Up-to-date Status of "Universal" Quantum Computers & Technical Roadmap



- Superconductor & trapped ions lead the industry
- Cloud-based service available for processors having physical qubits > 400
- Near-future processors will have over-1000 physical qubits

# In Terms of Utilization

#### **Quantum volume & Algorithmic qubits**

#### # of Physical Qubits ≠ the Real Capability

- Quantum volume & Algorithmic qubit  $\log_2 V_Q = \underset{n \leq N}{\arg \max \{\min [n, d(n)]\}}$ 
  - → Indicators to represent the largest complexity of algorithm circuits that a QPU device can run

| Date                                                                        | VQ                     | Notes                                         |
|-----------------------------------------------------------------------------|------------------------|-----------------------------------------------|
| 2020 Aug                                                                    | $64 = 2^6$ (6 qubits)  | Falcon R4 "Montreal" (27 physical qubits) [1] |
| 2020 Dec                                                                    | $128 = 2^7$ (7 qubits) | Falcon R4 "Montreal" (27 physical qubits) [2] |
| 2022 Apr                                                                    | $256 = 2^8$ (8 qubits) | Falcon R10 "Prague" (32 physical qubits) [3]  |
| 2022 May                                                                    | $512 = 2^9$ (9 qubits) | Falcon R10 "Prague" (32 physical qubits) [4]  |
| [1] https://www.zdnet.com/article/ibm-hits-new-quantum-computing-milestone/ |                        |                                               |

- [2] https://twitter.com/jaygambetta/status/1334526177642491904
- [3] https://research.ibm.com/blog/quantum-volume-256
- [4] https://twitter.com/jaygambetta/status/1529489786242744320

### Large-scale Logic, e.g., Ones in the NISQ Region?

 Needs for simulations of quantum circuits in a very huge computing environment → a.k.a. Supercomputer





# **Classical Treatment of Quantum Logic Operations**

**Classical representation of gate-based quantum circuits** 





- All complex-valued
- Huge memory consumption for representation of the unitary

 $\rightarrow$  N = 20  $\rightarrow$  16 TB

→ Reduction can be done for specific cases; (e.g.) indices for nonzeros are known in advance)

# **Classical Treatment of Quantum Logic Operations**

#### **Classical representation of gate-based quantum circuits**



### **Circuit-based Quantum Computing**

- Unitary: only need to store those for universal gates
- State vectors (in principle) must be fully stored
- Cares must be put for conduction of matrix-vector multiplication
  - → The size of unitary is not equal to that of a state vector



• All complex-valued

Huge memory consumption for representation of the unitary

 $\rightarrow$  N = 20 ~ 16 TBytes

Reduction can be done for specific cases; (e.g.) indices for nonzeros are known in



# **Classical Treatment of Quantum Logic Operations**

**Classical representation of gate-based quantum circuits** 



#### Matrix-vector Multiplier: Circuit-based Quantum Computing

• A simple example: Conduction of X (Pauli-X) gating



- Mapping of state indices where corresponding elements need to be updated according to the logic operation
   → Details depend on the type (category) of universal gates
- Size of circuits to be simulated
  - $\rightarrow$  <u>Memory consumption</u> required by a quantum state

#### X gating against a N-qubit state





# **Large-scale Circuit Simulations**

#### **Objective of this talk**



#### Memory Consumption of State Representation: **BIG DEAL!**

- A single 30-qubit state: 2<sup>30</sup> elements (amplitudes) x 16 Bytes = 16 GB
  - $\rightarrow$  A 40-qubit one: 16 TB & A 50-qubit one: 16 PB
  - → Total memory (8,305 nodes) of the National Supercomputer of Korea: ~778.6 TB (~0.76 PB)

#### Large-scale Circuit Simulations in Classical Computers?

- A distributed computing system: physically separated nodes that are connected with network
  - → Can use the whole memory with communications (distributed computing)
- Can use storage & partially load the state vector as needed

#### What we cover in this talk...

- A SW package for classical simulations with a distributed computing
- A brief overview of the cloud-based service framework currently under development: the gateway for public service of the code package

# **Workload Parallelization**

#### **Distributed computing with Message Passing Interface (MPI)**

#### **Decomposition of State Vectors**

- Decomposed blocks are stored in different memory locations  $\rightarrow$  Local to each MPI process
- (e.g.) Let's say that 2<sup>N</sup> amplitudes of a **N**-qubit state are distributed over 2<sup>M</sup> MPI processes
  - $\rightarrow$  Each MPI has a local vector of 2<sup>L</sup> =2<sup>(N-M)</sup> amplitudes, where L indicates the qubit size of a local state

#### **Index-dependent Parallel Operations of Universal Gates**

- SU(4): U(**a**,**b**) where **a** & **b** are qubit-indices against which a gating operation is conducted
  - $\rightarrow$  If both **a** & **b** <= **L**, then no communication is needed among inter-MPI processes: Embarrassingly Parallel (EP)
  - $\rightarrow$  MPI communications must happen otherwise: operation against a local vector may update another local vectors allocated in other MPI processes. (e.g.) SWAP(2,4)



1

1

**M** = 1

L = 3

# **Workload Parallelization**

#### **Distributed computing with Message Passing Interface (MPI)**



#### Index-dependent Parallel Operations of Gates (Cont.)



#### **Ops. supporting a parallel computing (so far)**

| <u>State</u>    | Prepares a single computational basis state.                              |  |
|-----------------|---------------------------------------------------------------------------|--|
|                 | The controlled-NOT operator                                               |  |
|                 | The controlled-Rot operator                                               |  |
|                 | The controlled-RX operator                                                |  |
|                 | The controlled-RY operator                                                |  |
|                 | The controlled-RZ operator                                                |  |
| <u>mard</u>     | The Hadamard operator                                                     |  |
| <u>&lt;</u>     | The Pauli X operator                                                      |  |
| <u>(</u>        | The Pauli Y operator                                                      |  |
| -               | The Pauli Z operator                                                      |  |
| <u>Shift</u>    | Arbitrary single qubit local phase shift                                  |  |
| olledPhaseShift | The controlled phase shift.                                               |  |
| StateVector     | Prepare subsystems using the given ket vector in the computational basis. |  |
|                 | Arbitrary single qubit rotation                                           |  |
|                 | The single qubit X rotation                                               |  |
|                 | The single qubit Y rotation                                               |  |
|                 | The single qubit Z rotation                                               |  |
|                 | The single-qubit phase gate                                               |  |
|                 | The single-qubit T gate                                                   |  |
|                 |                                                                           |  |

Hoon Ryu / Driving massively scalable simulations of quantum circuits in HPC

# **Workload Parallelization**

#### **Communication between MPI processes: State vectors**





# **Scalability: Element-gate Operation**

#### **Index-dependent performance**





Hoon Ryu / Driving massively scalable simulations of quantum circuits in HPC

#### **Node Spec.** 10

- Intel® Xeon Phi KNL 7250
- Single processor / 68 cores

Nat'l Supercomputer of ROK

(The NURION System)

**Index-dependent performance** 

- 96GB DDR4

#### 8,305 Computing Nodes

- Total DRAM ~ 0.76PB
- Up to 44-qubit circuits

#### **Compiler & Setup**

- MVAPICH2 2.3.6

- GNU 10.2
- MPI-only (64 procs/node)

# **Scalability: Element-gate Operation**







#### Messages

- T (index of target qubit) is equal to or less than L (size of local qubit)
  - $\rightarrow$  No data-transfer via MPI comm.

• T > L

- $\rightarrow$  Data-transfer via MPI comm.
- $\rightarrow$  Communication overhead increases as T >> L



#### Hoon Ryu / Driving massively scalable simulations of quantum circuits in HPC

 $R_{1}^{1}$ 

## **Scalability: A Realistic Case**

Universal quantum circuit for N-qubit quantum gate

#### Nat'l Supercomputer of ROK (The NURION System)

#### Node Spec.

- Intel® Xeon Phi KNL 7250
- Single processor / 68 cores
- 96GB DDR4

#### 8,305 Computing Nodes

- Total DRAM ~ 0.76PB
- Up to 44-qubit circuits

#### **Compiler & Setup**

- MVAPICH2 2.3.6

- GNU 10.2
- MPI-only (64 procs/node)



 $R_1^n$ 

#### $R_{2}^{n+1}$ $R_2^n$ $R_2^1$ $R_{2}^{2}$ $R_{2}^{3}$ $R_3^2$ $R_{3}^{n+1}$ $R_{3}^{1}$ $R_{3}^{3}$ $R_3^n$ $R_{4}^{3}$ $R_4^2$ $R_4^{n+1}$ $R_4^1$ $R^n_{\Delta}$ $\oplus$ $R_{5}^{2}$ $R_{5}^{n+1}$ $R_{5}^{3}$ $R_{5}^{1}$ $R_5^n$ $R_n^3$ $R_n^2$ $R_n^1$ $R_n^n$ $R_n^{n+1}$ Ф Ð . . .

 $R_{1}^{3}$ 

• 3\*N\*(N+1) + N\*(N-1) parameters, where N = qubit size

 $R_{1}^{2}$ 

• Single case: All the R's = X & All the CNOT's are employed





 $R_1^{n+1}$ 



Hoon Ryu / Driving massively scalar



**Scalability: A Realistic Case** 

# **Strategies for Service**

National flagship project ongoing in ROK

#### **Project Overview**

- A full-stack & superconductor-based 50-qubit quantum computer (circuit-based)
  → Project launched in 2022-Jun under support from NRF & MSIT of ROK
- Research consortium and KISTI R&R:





# **Strategies for Service**

**Cloud-based service framework** 



#### **Quantum Computing Service Framework** Web Portal & User Interface **Quantum Computer** Service mesh API **User Storage Q-Resource API Server** QC Cloud Web Service Request 2 Service Platform Saga Database **Orchestrator** Registry Router **O-Device Controller** Account Resource Pulse Code 2 API API Resource Notification **Documents** Account Jobs Data esponse Request Service Service Service Service Pulse API ····· **Notification** Document Gateway 4 3 **Quantum Emulator** Storage Authentication JupyterLab Job API API Request Service Service Service Service Response 🙆 Q-programming web service **Q-Resource API Server** Micro Service Event Subscribe Event Publish **JupyterLab** API **Q-Emulator Controller** Message broker Response

#### Overview: Technical Components & Flow of KISTI-powered Cloud Service Framework RESOURCE SERVICE FRAMEWORK WEB INTERFACE

KRISS Powered
 The parallelized classical simulator (emulator?) will be served as one of resources
 → Beta-version service in early 2025

## **Summary & Remarks**



#### A Massively-scalable Classical Quantum Circuit Simulator

- Message Passing Interface (MPI) to support distributed computing in HPCs
  → A brief discussion on state-mapping & parallelization scheme
- Demonstration: simulations of up to 41-qubit circuits
  - $\rightarrow$  The Universal quantum circuit for N-qubit quantum gates
  - $\rightarrow$  Possible to handle up to 44-qubit circuits in the 5th national HPC of ROK

#### **Overview: KISTI-powered Cloud-based Service Framework**

- Target resources: Classical simulator & KRISS-powered quantum computer
- Beta-version service of the simulator through our framework: Early 2025

# **Thank You for Attention**