Recent development of large-scale reconfigurable data-paths using RSFQ circuits

36
Recent development of large- scale reconfigurable data- paths using RSFQ circuits Nobuyuki Yoshikawa Department of Electrical and Computer Engineering, Yokohama National University, Yokohama, Japan Yokohama National Yokohama National University University 21 21 st st International International Symposium on Symposium on Superconductivity Superconductivity Tsukuba, Japan Tsukuba, Japan October 27-29, 2008 October 27-29, 2008 Coworker H. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi Yokohama National University I. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki, M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University H. Honda, K. Inoue, K. Murakami Kyusyu University S. Nagasawa, M. Hidaka SRL/ISTEC

description

21 st International Symposium on Superconductivity Tsukuba, Japan October 27-29, 2008. Yokohama National University. Recent development of large-scale reconfigurable data-paths using RSFQ circuits. Nobuyuki Yoshikawa Department of Electrical and Computer Engineering, - PowerPoint PPT Presentation

Transcript of Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Page 1: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Recent development of large-scale reconfigurable data-paths using

RSFQ circuits

Recent development of large-scale reconfigurable data-paths using

RSFQ circuitsNobuyuki Yoshikawa

Department of Electrical and Computer Engineering,

Yokohama National University, Yokohama, Japan

Nobuyuki YoshikawaDepartment of Electrical and Computer Engineering,

Yokohama National University, Yokohama, Japan

Yokohama Yokohama National National UniversityUniversity

Yokohama Yokohama National National UniversityUniversity

2121stst International Symposium on International Symposium on SuperconductivitySuperconductivityTsukuba, JapanTsukuba, JapanOctober 27-29, 2008October 27-29, 2008

2121stst International Symposium on International Symposium on SuperconductivitySuperconductivityTsukuba, JapanTsukuba, JapanOctober 27-29, 2008October 27-29, 2008

CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi

Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,

M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University

H. Honda, K. Inoue, K. MurakamiKyusyu University

S. Nagasawa, M. HidakaSRL/ISTEC

CoworkerH. Park, H. Hara, K. Taketomi, T. Kainuma, Y. Yamanashi

Yokohama National UniversityI. Kataeva, R. Kasagi, S. Iwasaki, H. Akaike, A. Fujimaki,

M. Tanaka, K. Obata, Y. Ito, K. Takagi, N. Takagi Nagoya University

H. Honda, K. Inoue, K. MurakamiKyusyu University

S. Nagasawa, M. HidakaSRL/ISTEC

Page 2: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Outline of This Talk

Background Architecture Target system Component developments

Floating-point adders/multipliers (FPA/FPU)2 x 2 RDP

New process and cell library Road map Summary

Page 3: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Demand on High-Performance Computer

Calculation amount of electronic structure of molecules using the molecular orbital method

A molecule with 1000 atoms

600 TB of ERI calculations composed of a lot of product-sum operations

O(N4)O(N4)

Page 4: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

0.2

0.4

0.6

0.81

3

5

1998 1999 2000 2001 2002 2003 2004

Pentium 4

Pentium IIICeleronXeon

1.6x / year

1.1x / year

Clo

ck fr

eque

ncy

[GH

z]

http://www.intel.com/

Breakdown of Moore’s Law

Trends of the clock frequency of recent microprocessors

Page 5: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Problem in High-Performance Computersand Our Approach

Large power consumption Memory wall problem

(Single Flux Quantum circuits + new architecture) solves these problems

(Single Flux Quantum circuits + new architecture) solves these problems

Josephson junction

0 = h/2e

= 2.07 mV. ps

Page 6: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Large-Scale Reconfigurable Data-Path ( LSRDP ) using RSFQ Circuits

A lot of FPUs+

Reconfigurable network

The data are directly transferred between FPUs.The data are directly transferred between FPUs.

Reduction of memory wall problemReduction of memory wall problem

N. Takagi et al. IEICE Technical Report, SCE2006-36, January 2007.

Page 7: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Example of Application of LSRDP

tei(4,4,4,4)=(((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(1,t))/(p*q*(p+q))(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)\+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(q*(p+q)**5)+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(4*PQx*q*(QCx+QDx)*(3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*PQx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(q*(p+q)**6)+(2*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(p*(p+q)**6)

787 MUL, 261 ADD, 69 FUNCData-flow graph mapped to the LSRDP

Electron repulsion integral calculations of molecular orbit

while (I < 1000):

I = I+1:

Page 8: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

LSRDP Architecture: Suitable for RSFQ Circuits

Data flow in one direction.

No loop structure. Need high throughput.

Latency is not so important.

Suitable for bit-serial processing.

Reduced requirement on memory band width.

High switching activity. Heating is serious in semiconductor circuits

Page 9: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Application Fields of LSRDP Processors

Molecular orbit calculation

Diffusion equation

Wave equation

Poisson equation

etc.

Page 10: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Target System:10-TFLOPS RSFQ-LSRDP Computer

SMACSMAC

:...:::

SMAC

SB

ORN

...

ORN

...

: : : :

ORN

...

ORN

FPU SFQ RDP( 32FPU×32chips )(4 GFLOPS /FPU)

4.2 K

SFQ Streaming Buffer( 64Kb×2chips )

CMOSCPU

(1chip)

Memory band width per MCM : 256GB/ s(=16GB/s ×16 channels)

1024FPU@MCM(34 chips ) ×4MCM

2TB memory module( FB-DIMM

[DDR3@1333MHz, 128GB]×16 modules )

SFQ 0.5um process

Page 11: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Organization of the Project

Profs. K. Murakami, H. Honda (Kyushu Univ.) LSRDP architecture, compiler, algorithm

Profs. N. Takagi, K. Takagi (Nagoya Univ.) CAD for logic design, arithmetic circuits

Prof. N. Yoshikawa (Yokohama National Univ.) RSFQ-FPU chip, cell library

Profs. A. Fujimaki, H. Akaike (Nagoya Univ.) Network, RSFQ-LSRDP chip, cell library

Dr. S. Nagasawa (SRL) Advanced process

Page 12: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Component Development

Floating-point adder (FPA) Floating-point multiplier (FPM) Operand routing network (ORN) 2 x 2 LSRDP prototype

Page 13: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Floating-Point Numbers

Sign Exponent Fraction

Half-precision 1 5 11

Single-precision 1 8 24

Double-precision 1 11 53

S ( 1bit )

E ( 8 bit ) F ( 23 bit )

S: Sign

E: Exponent

F: Significand or Fraction(-1)S×F×2E

Example (single precision, 32 bit) : 1.101×24

0 11000011 10100000000000000000000

Data format in IEEE754 standard

Page 14: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Bit-Serial Floating-Point Calculation

MS

B

LS

B

MS

B

LS

B

Significand

ExponentSign

nf

ne

t

Two bit-serial data-paths are used for the calculation of significand and exponent.

Page 15: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Timing Parameters in Bit-Serial Calculation

(clock)

Input 1Input 1LS

B

MS

B

Input 2Input 2LS

B

MS

B

Input 3Input 3LS

B

MS

B

Output 1Output 1LS

B

MS

B

(data)

TimeTime

(clock)

(data)

(clock)

(data)

(clock)

(data)Operation UnitOperation UnitInputInput

MS

BM

SB

LS

BL

SB

OutputOutput

MS

BM

SB

LS

BL

SB

Page 16: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Floating-Point Addition: Example

Page 17: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Block Diagram of Bit-Serial FPA

Shifter of A

Shifter of B

Buffer

Buffer

Normalizer  &

Sign and Exponent‘s Combine

circuit

Comparator of

magnitude

(1) Align significand& Rounding

(2) Addition(or subtraction)

(3) Normalization

Significand of ASignificand of A

Significand of BSignificand of B

Exponent & Sign of A

Exponent & Sign of A

Exponent & Sign of B

Exponent & Sign of B

Significand of Result

Significand of Result

Exponent & Sign of Result

Exponent & Sign of Result

Fa

Fb

SaSb

Ea

Eb

Resu

lt o

f “A

-B

”S

hift

valu

e

Eff

ect

ive O

pera

tion

Am

ou

nt

of

Corr

ect

ion

Sin

g o

f R

esu

lt

Resu

lt o

f O

pera

tion

Normalizer

Controller

A >

B

Separator

circuit

: Data signals: Control signals

Page 18: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Chip Photograph of Half-Precision FPA

CONNECTcooperated with SRL, NiCT, NU & YNU

*nf : bit length of significand

Page 19: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

DC Bias Margin of Each Component Circuits @20GHz

Page 20: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Floating-Point Multiplier

Significand part is calculated by a systolic-array multiplier.

Ze=Xe+Ye

Zf=XfYf

Exponent part is calculated by a bit-serial adder.

S ( 1bit )

E ( 8 bit ) F ( 23 bit )

S: SignE: Exponent F: Fraction

(-1)S×F×2E

Page 21: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Systolic-Array Multiplier

- Composed of 1D array of 1-b processing element (PE).- Small hardware cost: ∝ (bit length)- High throughput : ~ 1/(bit length)

InputInput

MS

BM

SB

LS

BL

SB OutputOutput

MS

BM

SB

LS

BL

SB

Page 22: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Chip Photograph of Half-Precision FPM

CONNECTcooperated with SRL, NiCT, NU & YNU*nf : bit length of significand

Page 23: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Test Result of FPM@25GHz

[Calculation of exponent part ]

Correct operation was confirmed at high speed.

(10) + (-2) + 1 = 9EX EY Carry from

fraction part

LSB MSB

(10)

(-2)

FX : 11010110111 EX: 11001

FY: 11001010011 EY: 01101

FXY: 10101001110 EXY: 11000

Maximum operating frequency: 31.5 GHz

Page 24: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Summary of Half-Precision FPUs

Floating Point Adder Floating Point Multiplier

# of JJs 11700 11044

Size (mm2) 6.76 x 4.96 6.22 x 3.78

Minimum interval (clocks) 12 ( nf + 1)

Latency (clocks) 23 (2 nf + 1)

CONNECTcooperated with SRL, NiCT, NU & YNU

nf : bit length of fraction part

Page 25: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

NDRO-based and crossbar-based architectures of ORN

FPU

FPU

FPU

FPU

FPU

FPU

NDRO

“+”: small number of Josephson junctions required

“–”: irregular non-pipelined structure => with the increase of the complexity becomes cumbersome

NDRO

NDRO

NDRO

NDRO

NDRO

FPU

FPU

FPU

FPU

FPU

FPU½CBT

½CBT

½CBT

CBT

CBT

CBT

CBT

CBT

CBT

CBT

“+”: scalable pipelined easily re-designed for any number of N and M

“–”: large number of Josephson junctions required

ORN requirements: 1-to-N connections where N is an odd number connections to either input of the FPU

M-FPUs

Page 26: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Comparison of the ORN architectures

ORN complexity

latency, ps

skew, ps minimum interval

number of control lines

bias current, A

power, mW

number of J J

N=3, M=8 ~60 ~60 nf+60ps 96 0.6 1.5 ~5500

N=5, M=10 ~80 ~80 nf+80ps 200 0.9 2.25 ~8000

N=9, M=32 ~100 ~100 nf+100ps 1152 5.5 13.75 ~50500

ORN complexity

latency, clocks

skew, ps minimum interval

number of control lines

bias current, A

power, mW

number of J J

N=3, M=8 6 ~300 nf 100 0.63 1.575 6230

N=5, M=10 10 ~500 nf 208 1.41 3.525 13930

N=9, M=32 18 ~900 nf 1168 8.28 20.7 77440

Crossbar-based ORN

NDRO-based ORN

Number of J Js of NDRO-based ORN in a table is an estimation based on a design of the switch for RDP prototype (N=3, M=4) that consisted of 2750 J Js and requires 300 mA bias current (Iwasaki, not published yet)

A crossbar switch with broadcasting function: 296 J Js

Note that almost the same number of J J s are required for both ORNs if isometric (equal length wirings) network is employed in the NDRO-based ORN.

Page 27: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

1-to-2ORN test

CBT0CBT2

CBT1

din0

ladder

clkin

_lfi

n

din2

din1

clkin_ lfout1clkin_ lfout2

dout01 dout11clkin_hf

dout02 dout12

1-to-2 ORN: 2043 J Js, bias current 226 mA

Total test circuit: 3098 J Js Total bias current: 359 mA

bias_kern1

bias_kern0bias_kern2

CBT0

CBT2

CBT1

bias_kern1 margins for din0 -> dout11 routing

-30.000

-25.000

-20.000

-15.000

-10.000

-5.000

0.000

5.000

10.000

15.000

20.000

10.842 12.679 14.324 15.858 17.241 18.818 20.345 21.854 23.480 upper margin

lower margin

Example: open466, no. 4 chip F2

completely functional, exhaustive test bias_kern0 = -14.6/ 5.3 % does not depend on the pattern bias_kern1 = -16.1/ 18.3 % for din0 -> dout11, dout12 bias_kern2 = -20.7/ 12.6 % for din0 -> dout11, dout12 minimum! bias_kern1 = -40.3/ 17.2% for din1 -> dout01 bias_kern2 = -38/ 12.6% for din2 -> dout02, dout12 maximum!

din0

cross10bar00clkin_ lfin

dout01dout11

clkoutclkout1clkout2

dout02dout12

cross11cross01

bar02bar12

Example of the low frequency test:din0 -> dout01, dout02, dout12

Frequency dependence of the bias margins: din0 -> dout11

Page 28: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Design of 2x2 SFQ-RDP

ALUInput SR

Output SR

ALU ControllerORN

Buffer1 mm

Buffer

• 11 pipeline stages• Designed frequency:25 GHz• InSR & OutSR length:16-bits• Data length: 7-bits

• Bias current: 1.27 A• Circuit area:5.90 x 3.68 mm2

• 10839 JJs

Page 29: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Demonstration of 2x2 SFQ-RDP

Frequency characteristic of RDP

Input patterns Output patterns

Maximum operating frequency23 GHz

The function for each ALU is chosen as shown above.

Page 30: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Device Structure of Nb 10-layer Fabrication Process

Page 31: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Layout■ DCP

(M1)■ Bias Pillar

(C1, 2, 3, 4, 5, 6, GC)5 x 5 mm2

■ 6 layers Moat (M2, 3, 4, 5, 6, 7)

□ PTL(M3, 5)Width: 4.8 – 5.5 mm

□ Via of PTLsless then 12 x 12 mm2

30 mm Maximum current value: 2.4 mA(limited by size of contacts)

Page 32: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Cell library

30μm

DC/SFQ SFQ/DC

CBE D2FF

DFF JAND

30μm

JANDFJNOR

JNOT JOR

RTFFB SPL3

SPLL

T1

Jc: 10 kA/cm2

c = 2

Page 33: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Design of Bit-Serial Half Adder using a New Cell Library

Jc: 10 kA/cm2

c = 2

Logic simulation results of bit-serial half adder

Page 34: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Clock Generator

Shift Register for Input

Shift Register

for Output

Bit-Serial Adder

On-Chip High-Speed Test Results of Bit-Serial Half Adder

Jc: 10 kA/cm2

c = 2

Page 35: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Road Map of RSFQ LSRDP Processor

2008 2009 2010 2011 2012 20132007 2014 -

2.5 kA/cm2

Process

10 kA/cm2

Process

40 kA/cm2

Process

25GHz FPU/RDP

60 GHz FPU & LSRDP prototype

100 GHz FPU & LSRDP prototype

10 TFLOPS LSRDP system development

Page 36: Recent development of large-scale reconfigurable data-paths using RSFQ circuits

Summary

Our target is to make a fundamental technology for high-end supercomputers based on large-scale reconfigurable data-path (LRDP) architecture.

Some key components were designed and implemented using standard Nb process, and their correct operations were demonstrated. Half-precision RSFQ FPA and FPU Operand routing network (ORN) 2 x 2 RDP

Structure of the SRL advanced II process was determined and a new cell library is under development. 85 GHz operation of bit-serial half-adder was demonstrated.