#### 2EO1-01 # Execution of Stored Program in a Single-Flux-Quantum Microprocessor with Embedded Memories at 50 GHz <u>Akira Fujimaki</u>, Ryo Sato<sup>1</sup>, Yuki Hatanaka<sup>1</sup>, Yuichi Matsui, Kazuyoshi Takagi<sup>2</sup>, Naofumi Takagi<sup>2</sup>, Masamitsu Tanaka<sup>1</sup> <sup>1</sup>Nagoya University, <sup>2</sup>Kyoto University #### Acknowledgment This work was supported by JST ALCA and JSPS KAKENHI (Grant Numbers 16H02340, 26220904, and 16H02796), and the VLSI Design and Education Center of the University of Tokyo, in collaboration with Cadence Design Systems, Inc. The circuits were fabricated in CRAVITY of AIST. ### **Outline** - History of development of SFQ microprocessors - Execution of stored programs - First demonstration except Si technology - Bit parallel ALU for lowered latency - Other issues for practical SFQ microprocessors - Summary ## "Power-Wall Problem" in Large-Scale ### Computing - Supercomputers - □ 10–20 MW - Use many MPU cores operating at 1–2 GHz - Requirements for 10<sup>18</sup> FLOPS: 1 GW power & 1 km<sup>2</sup> area - Data centers - 1.3% of all electricity use for the world (2010) <sup>[1]</sup> New Technology for energy-efficient high-performance computing is desired. ### **Special Features of SFQ Circuits** - Signal propagation at the speed of light with small distortion in interconnects based on waveguides. - No recharge process both in logic operation and interconnects. - Scaling law ### **Appealing Feature of SFQ Circuits** ### **Development of RSFQ Microprocessors** - CORE: simple, bit-serial processing - Ease circuit complexity. Use hardware (& energy) efficiently. - High-frequency clock operations (15–100 GHz) CORE1a v5 (2003) 4999 JJs, 15 GHz 167 MIPS, 1.6 mW $J_c$ =2.5 kA/cm<sup>2</sup> Prototype CORE1 $\beta$ v9e (2006) 10955 JJs, 25 GHz 1400 MOPS, 3.3 mW $J_c$ =2.5 kA/cm<sup>2</sup> Pipeline processing #### **CORE1α LV (2013)** 3869 JJs, 35 GHz 400 MIPS, 0.23 mW $J_c$ =10 kA/cm<sup>2</sup> Energy efficiency #### **CORE100 (2015)** 3073 JJs, 100 GHz 800 MIPS, 1.0 mW $J_c$ =20 kA/cm<sup>2</sup> Ultrahigh frequency ### Main Issues Left for Practical Microprocessors - Execution of meaningful programs stored in cryogenic memory with energy-efficient SFQ circuits - High-frequency operation of bit-parallel processing for small latency - Energy-efficient power supply for dc-powered SFQ circuits - Large capacity memory and voltage amplifiers with small footprints ### **Comparison of Energy-Efficient Circuits** - Typical case in 10 kA/cm<sup>2</sup> Process; may vary in targets. - AC circuits require smaller but high-frequency power currents that are identical to operating speeds. Easy to be replaced | | RSFQ<br>(Base) | LR-bias<br>LV-RSFQ | ERSFQ<br>eSFQ | RQL | AQFP | |------------------------|-------------------------|-------------------------|-------------------------|----------------------------------|------------------------| | Power supply | DC | DC | DC | AC | AC | | Static power | 1 | 0.1x | 0 | 0 | 0 | | Dynamic power | 1 | ~1x | 1.1–2x | 2x | << 0.1x | | <b>Clock frequency</b> | 50 GHz | 20–50 GHz | 20–50 GHz | 10–20 GHz | 5 GHz | | Power<br>(per 1k JJs) | 2.5 mW<br>(2.5 mV, 1 A) | 0.2 mW<br>(0.2 mV, 1 A) | 0.1 mW<br>(0.1 mV, 1 A) | $20~\mu W$ (0.6 mA, 50 $\Omega)$ | 25 nW<br>(20 μA, 50 Ω) | | Gate Energy | 100 aJ | 5–10 aJ | 2-4 aJ | 0.5–1 aJ | 0.01 aJ | [LR-bias] T. Nishidai et al. *Physica C* **445-448** (2006) 1029; [LV-RSFQ] M. Tanaka et al. *Jpn. J. Appl. Phys.* **51** (2012) 053102; [ERSFQ] D. E. Kirichenko et al, *IEEE Trans. Appl. Supercond.* **21** (2011) 776; [eSFQ] M. Volkmann et al. *Supercond. Sci. Technol.* **26** (2013) 015002; [RQL] Q. P. Herr et al. *J. Appl. Phys.* **109** (2011) 103903; [AQFP] Takeuchi et al. *J. Appl. Phys.* **115** (2014) 103910. ### **Main Issues Left for Practical Microprocessors** - Execution of meaningful programs stored in cryogenic memory with energy-efficient SFQ circuits - High-frequency operation of bit-parallel processing for small latency - Energy-efficient power supply for dc-powered SFQ circuits - Large capacity memory and voltage amplifiers with small footprints ### **Next Big Milestone** Bit-serial microprocessors integrated with sufficient RAMs to demonstrate "stored-program computing". We hoped that the stored programs included the program for finding the greatest divisor, which had been demonstrated with the first stored-program computer. # Manchester Small-Scale Experimental Machine (1948) (First stored-program computer) - $\checkmark$ 5.2 × 2.2 m, 1000 kg - √ 550 vacuum tubes - √ 3.5 kW F. C. Williams and T. Kilburn, "Electronic Digital Computers" Nature **162** (1948) 487 ### **CORE e Series** | | CORE e2 (v5h) | CORE e4 (v5) | | | |-------------------|----------------------------------------------------------------------|------------------------------------------------------|--|--| | # of Registers | 2 | 4 | | | | # of Instructions | 13 | 20 | | | | Memories | 2 x 128 bits | 2 x 256 bits | | | | Performance | 333 MIPS | | | | | | (50 GHz bit-serial operation; 2 GHz clock / 6 cycle per instruction) | | | | | Power | 2.52 mW | 4.57 mW | | | | JJ Count | 10603 | 20330 | | | | Feature | Minimal microarchitecture design with reduced RAMs | Full-featured CORE e MPU with a rich instruction set | | | ### Microarchitecture (e2) ### Challenges in design: - ✓ High frequency operation and signal transmission - ✓ Adjustment of timings in the two different clocks - ✓ Compact layouts in limited space - ✓ Saving bias currents ### **Controller Unit** - We carefully assigned opcode. - op7 and op6: the primary opcode - op5: a flag of ALU operation - op4: corresponds to reg1\_read or a flag for conditional branch. - op3: a flag of indirect-addressing memory access. - op2, op1, op0: correspond to reg0\_read, alu\_inv, and alu\_inc - ◆ BDD node count: 19 - 763 JJs, 213 μW - 0.300 mm<sup>2</sup>, 275k JJ/cm<sup>2</sup> ### **CORE e2 ALU** - Supporting Operations - ✓ ADD, SUB, MV, INC, DEC - ✓ Comparison (NE, LT) - 348 JJs, 98.3 μW - 0.13 mm<sup>2</sup>, 274k JJ/cm<sup>2</sup> ### **256-bit Shift-Register Memory** - 6597 JJs, 1.41 mW (Min. $I_c = 70 \mu A$ ) - ◆ 1.38 mm × 1.95 mm - ◆ 300k JJs/cm<sup>2</sup> - ◆ Access time (at 50 GHz) - Decode: 60 ps - Readout: 285 ps - Erase: 222 ps ### CORE e2 Chip (128-bit memories) ◆ Bias current: 1007 mA ◆ Area: 5.64 mm² ### **Test Programs** Boundary: Small-scale programs written within 16 lines - sum - Calculate 1 + 2 + ... + N - total - Calculate the sum of an array - div - Integer division - aliq - Compute the greatest divisor (Demonstrated program in SSEM) - gcd - Euclidean Algorithm to find the greatest common divisor (GCD) (Algorithm described in BC 300) | Inst. | Name | Instruction Meaning | |-------|-------------------|-----------------------------------------| | HLT | Halt | Stop. | | ADD | Add | Reg0 ← Reg0 + Reg1 | | SUB | Subtract | Reg0 ← Reg0 - Reg1 | | MV | Move | Reg0 ← Reg1 | | INC | Increment | Reg0 ← Reg0 + 1 | | DEC | Decrement | Reg0 ← Reg0 - 1 | | SKNE | Skip if not equal | if (Reg0 != Reg1) PC ← PC + 2 | | SKLT | Skip if less than | if (Reg0 < Reg1) PC $\leftarrow$ PC + 2 | | LDR | Load register | Reg1 ← DM[Reg1] | | STR | Store register | DM[Reg1] ← Reg0 | | JMP | Jump | PC ← addr | | LD | Load | Reg1 ← DM[addr] | | ST | Store | DM[addr] ← Reg0 | All the programs have successfully been demonstrated. ### **Program Demonstration (Aliquot X=21)** The third demonstration of the Aliquot program; Vacuum tube (1948), Si, and SFQ. ### **Frequency Dependence** - Successfully executed programs composed of up to 200 instructions. - Obtained comparable bias margins for five different test programs. - ✓ Overlap margin was 5% at 50 GHz - ✓ Maximum frequency of bit-serial operation was 61 GHz ### **Next Phase** #### Bit-Serial (Complexity-Reduced) **CORE1β MPU (2006)** 25 GHz, 10955 JJs **CORE1α MPU (2003)** 15 GHz, 4999 JJs CORE e2 MPU (2014–2016) 50 GHz, 10000–20000+ JJs **CORE100 MPU (2015)** 100 GHz, 3073 JJs #### **High energy-efficiency** Manycore (Many-CORE e) **Low-Latency** Bit-Parallel, Ultrapipeline ### **Prospect of Performance and Efficiency** (32 bit parallel or compatible) Power Efficiency (w/ cooling penalty) (MIPS/W) ### **Next Phase** #### Bit-Serial (Complexity-Reduced) **CORE1β MPU (2006)** 25 GHz, 10955 JJs **CORE1α MPU (2003)** 15 GHz, 4999 JJs CORE e2 MPU (2014–2016) 50 GHz, 10000–20000+ JJs **CORE100 MPU (2015)** 100 GHz, 3073 JJs #### High energyefficiency Manycore (Many-COREe) Bit-Parallel, Ultrapipeline ### **Bit-Parallel Gate-Level Pipelined ALU** 4868 JJs 50 GIPS 1.4 mW 36000 GIPS/W Collaboration with Prof. Inoue, Kyushu Univ. #### **Specification** - ✓ Target frequency: 50 GHz - ✓ Datapath: Ultra pipelining (gatelevel pipelining) - ✓ Bitwidth: 8 bits - ✓ Functions: ADD, SUB, AND, OR, XOR, NOR, etc. #### Based on Brent-Kung adder - Minimum number of logic gates (w/o D flip-flops) - Sparse wiring tracks - Small fanouts (Max. 3) - Maximum logic depth R. Brent and H. Kung, IEEE Trans. Comput. G31 (1982) 260 ### Addition/subtraction in Parallel ALU ### **Prospect of Performance and Efficiency** (32-bit parallel or compatible) Power Efficiency (w/ cooling penalty) (MIPS/W) ### **Main Issues Left for Practical Microprocessors** - Execution of meaningful programs stored in cryogenic memory with energy-efficient SFQ circuits - √ resolved - High-frequency operation of bit-parallel processing for small latency - resolved - Energy-efficient power supply for dc-powered SFQ circuits - √ resolved by using superconducting diodes - Large capacity memory and voltage amplifiers with small footprints - ✓ resolved by nano-cryotrons and CMOS memory ### **Summary** - Classical RSFQ circuits have matured over the decades. - Programs stored in embedded memories have been demonstrated at 50 GHz, which is the first demonstration except Si technology - Eight-bit-parallel processing has been executed at 50 GHz with the gate-level-pipelining technique. - By introducing new technologies such as superconducting diodes or nano-cryotrons (nTrons), the issues for the practical applications are resolved. - SFQ technologies should be combined with quantum information processing technologies for the nextgeneration supercomputers. ### Thank you for your attention ### AC/DC Converter for DC-Powered SFQ Circuits After introduction of superconducting diodes AC/DC converter is essential for DC-powered SFQ circuits. External Field (Arb. Unit) # Superconducting Diode Based on Residual Magnetization 1500 -2000 -2500 In-line-type JJ (20 μm x 1 μm) - $\triangleright$ A diode with $V_{th}$ =0 is obtained. - Critical currents can be controlled. ### **Rectification with Superconducting Diodes** We can control DC output voltages by changing the phase of the switching. This might open superconducting power electronics. ### **Issue for Energy-Efficiency** R<sub>b</sub> is used for providing a constant current to each Josephson junction. Power consumption at $R_b$ (Static power consumption) $$P_{\text{bias}} = \frac{V_{\text{b}}^2}{R_{\text{b}}} \approx 0.7 I_c V_b$$ Example: DFF $P_{\text{bias}} = 1.8 \mu\text{W}$ Power consumption at $R_s$ (Dynamic power consumption) $$P_{\rm shunt} = f I_{\rm c} \Phi_0$$ Example: DFF $f$ : operating frequency $P_{\rm shunt} = 36n{ m W}$ Typically, $I_c \Phi_0 \approx 2 \times 10^{-19} (J)$ Necessity for eliminating static power consumption. ### **DC-Powered Energy-Efficient SFQ Circuits** Bias resistors are replaced with inductors and junctions. #### ERSFQ circuit (Hypres) D. E. Kirichenko, et al., IEEE Trans. Appl. Supercond., **21**, 776(2011). #### Advantage - The base of design has been established because resources obtained from the RSFQ circuits can be used. - □ PTLs can be used as interconnects. - Possibly suitable to higher density because no mutual coupling is used. #### Disadvantage ■ Difficult to make energy-efficient voltage supply around 0.1 mV. ### **AC-Powered Energy-Efficient SFQ Circuits** Circuits are driven by AC currents provided via transformers. #### Example # Reciprocal Quantum Logic (Northrop Grumman) Q. P. Herr, et al., J. Appl. Phys., **109**, 103903 (2011). #### Advantage - □ Provided AC currents are used as clock signals. - NOT logic is easy to be made. - ☐ The above means the RQL can be made up of smaller number of junctions. #### Disadvantage - □ Transformers are needed for all the gates, indicating downsizing to sub-micron scale is difficult. - High-frequency design technique is essential for operation. ### **AC-Powered Energy-Efficient SFQ Circuits** Circuits are driven by AC currents provided via transformers. #### Example # Adiabatic Quantum Flux Parametron (Yokohama Nat'l Univ.) #### Advantage - □ Very small energy consumption because of no phase jump in switching. - All the logic operations are achieved based on a single 'majority' gate, leading to the robustness to the process variation. #### Disadvantage - Operating frequency is relatively low. - □ Difficult to make long interconnects. - DC offset currents are needed for ### **Energy-Efficiency in Integrated Circuits** **Energy Consumption** Total power x Clk cycle Number of devices STP: AIST 2.5-kA/cm<sup>2</sup> Nb/AIO<sub>x</sub>/Nb Standard Integrated Circuit Process. ADP: AIST 10-kA/cm<sup>2</sup> Nb/AIO<sub>x</sub>/Nb Advanced Integrated Circuit Process ### Nanowire Cryotron (nTron) A. N. McCaughan and K. K. Berggren, Nano Lett. 14 (2014) ### nTron Family (MIT) Single nanowire SNSPD Gate isolated cryotron (hTron) Current crowding cryotron (yTron) Courtesy of Dr. Zhao (MIT) ### Demo. of nTron for Driving Semicon. Tr nTron can serve as a voltage amplifier needed between SFQ circuits and semiconductor circuits. Q.-Y. Zhao et al, Supercond. Sci. Technol. 30 (2017) ### NbTiN nTron + CMOS memory cell ### **Issues for Larger-Scale Integration** Bias feed resistor Storage loop (~20 pH) Shunt-resistor-free JJs ERSFQ/eSFQ (Elimination of resistors) D Flip-Flop: $40 \times 80 \mu m$ (Assuming Min. $I_c = 50 \mu A$ ) 20 x 40 μm - High Sheet Inductance (NbN, etc.) - > JJ Stack Vertically stacked structure of two JJs and the bias resistor **Elimination of shunt resistors** 8 x 16 µm Equipment update 2 x 4 µm $0.25 \mu m$ Line and space: 1µm