TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon,...
Transcript of TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon,...
![Page 1: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/1.jpg)
1
TensorDIMM: A Practical Near-Memory Processing Architecture
for Embeddings and Tensor Operations in Deep Learning
KAIST
Youngeun Kwon, Yunjae Lee, and Minsoo Rhu
![Page 2: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/2.jpg)
2
Research Scope
![Page 3: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/3.jpg)
3
Primarily focused on “dense” DNN layers (e.g., CNNs, RNNs, …)
DL architecture research so far
* Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, ISCA-2014* Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks”, ISCA-2016* Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA-2016* Parashar et al., “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks”, ISCA-2017* Yazdanbakhsh et al., “GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks”, ISCA-2018
![Page 4: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/4.jpg)
4
“Non” conventional DNN layers are causing a bottleneck
Emerging DL applications?
* Devlin et al., “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arxiv.org, 2018* Graves et al., “Neural Turing Machines”, arxiv.org, 2014* Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems”, arxiv.org, 2019
BERT Neural Turing Machine Recommendation
![Page 5: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/5.jpg)
5
“Sparse” embedding layers (rather than dense DNNs) are the bottleneck
Personalized recommendation models
![Page 6: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/6.jpg)
6
Projection of sparse features into dense vector dimension (e.g., word2vec)
What is an embedding?
* Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality”, NIPS-2013
KING
QUEEN
UNCLE
AUNT
MAN
WOMAN
KING
QUEEN
KINGS
QUEENS
![Page 7: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/7.jpg)
7
Stored as a large look-up table containing millions-to-billions of entries
What is an embedding?
User ID Embedding (vector)
0: Sam [0.49, 0.52, 0.23, 0.69, 0.32, …]
1: Harry [0.24, 0.27, 0.13, 0.09, 0.79, …]
2: Matt [0.31, 0.71, 0.46, 0.91, 0.07, …]
3: John [0.83, 0.43, 0.81, 0.57, 0.09, …]
4: Elicia [0.31, 0.83, 0.23, 0.69, 0.86, …]
… …
N: Danny [0.77, 0.18, 0.71, 0.59, 0.46, …]
N: can be millions
![Page 8: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/8.jpg)
8
Goal: predict a preference of user-item pair
Recommendation model 101
Sam
![Page 9: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/9.jpg)
9
Goal: predict a preference of user-item pair
Recommendation model 101
[Embedding table]
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
Sam
![Page 10: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/10.jpg)
10
Goal: predict a preference of user-item pair
Recommendation model 101
[Embedding table]
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
Sam
![Page 11: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/11.jpg)
11
Goal: predict a preference of user-item pair
Recommendation model 101
[Embedding table]
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
Sam
DNNs
![Page 12: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/12.jpg)
12
Goal: predict a preference of user-item pair
Recommendation model 101
[Embedding table]
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
Sam
DNNs
🙂?🙁
Prediction
![Page 13: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/13.jpg)
13
Key Primitives in Embedding Layers
![Page 14: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/14.jpg)
14
Copying target embeddings into contiguous address space
#1: Embedding lookup (gather)
[Embedding table]
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
![Page 15: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/15.jpg)
15
Copying target embeddings into contiguous address space
#1: Embedding lookup (gather)
[Embedding table]
Movie 7
[Output tensor]
Movie 4
Movie 2
Movie 6
Movie 1
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
“Contiguous address space”
![Page 16: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/16.jpg)
16
e.g., Averaging multiple embeddings, element-wise addition/multiplication
#2: Tensor operation (reduction)
Movie 7
Movie 4
Movie 2
Movie 6
Movie 1
![Page 17: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/17.jpg)
17
e.g., Averaging multiple embeddings, element-wise addition/multiplication
#2: Tensor operation (reduction)
Movie 7
Movie 4
Movie 2
Movie 6
Movie 1+
![Page 18: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/18.jpg)
18
e.g., Averaging multiple embeddings, element-wise addition/multiplication
#2: Tensor operation (reduction)
+
Movie 7
Movie 4
Movie 2
Movie 6
Movie 1
<Average movie embedding>
![Page 19: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/19.jpg)
19
Embedding gathers/reductions are extremely memory-bandwidth sensitive
Key challenges of embedding layers
[Embedding lookup]
Movie 7
Movie 4
Movie 2
Movie 6
Movie 1
Movie 0
Movie 1
Movie 2
Movie 3
Movie 4
Movie 5
Movie 6
Movie 7
…
Movie N
Averaged embedding
[Tensor operation]
+
![Page 20: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/20.jpg)
20
Current Solutions forRecommendation Systems
![Page 21: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/21.jpg)
21
The memory wall for “Inference”
+
+
DNNsReductionEmbedding lookup
Size of embedding tables can reach hundreds of GBs
![Page 22: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/22.jpg)
22
Size of embedding tables can reach hundreds of GBs
The memory wall for “Inference”
+
+
DNNsReductionEmbedding lookup
CPU
Capacity limited
![Page 23: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/23.jpg)
23
Size of embedding tables can reach hundreds of GBs
The memory wall for “Inference”
+
+
DNNsReductionEmbedding lookup
CPU
Capacity limited
![Page 24: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/24.jpg)
24
CPU stores entire embedding tables, but DNNs executed using GPUs
Design#1: Hybrid CPU-GPU approach
+
+
DNNsReductionEmbedding lookup
CPU GPU
Capacity limited Compute limited
![Page 25: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/25.jpg)
25
Challenges: need to copy multiple embeddings via narrow PCIe channel
Design#1: Hybrid CPU-GPU approach
+
+
DNNsReductionEmbedding lookup
CPU GPU
Capacity limited Compute limited
Communication Bottleneck
PCIe
![Page 26: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/26.jpg)
26
The CPU handles the entire steps of inference
Design#2: CPU-only approach
+
+
DNNsReductionEmbedding lookup
CPU
![Page 27: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/27.jpg)
27
Challenges: low throughput of CPUs slows down DNN computation
Design#2: CPU-only approach
+
+
DNNsReductionEmbedding lookup
CPU
Computation Bottleneck
![Page 28: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/28.jpg)
28
Unbuildable, oracular solution assuming infinite GPU memory capacity
Design#3: GPU-only approach?
+
+
DNNsReductionEmbedding lookup
GPU
![Page 29: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/29.jpg)
29
Unbuildable, oracular solution assuming infinite GPU memory capacity
Design#3: GPU-only approach?
+
+
DNNsReductionEmbedding lookup
GPU
No Communications
![Page 30: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/30.jpg)
30
Unbuildable, oracular solution assuming infinite GPU memory capacity
Design#3: GPU-only approach?
+
+
DNNsReductionEmbedding lookup
GPU
No Communications
Fast DNN execution
![Page 31: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/31.jpg)
31
Unbuildable, oracular solution assuming infinite GPU memory capacity
Design#3: GPU-only approach?
+
+
DNNsReductionEmbedding lookup
GPU
No Communications
Higher Bandwidth
Fast DNN execution
![Page 32: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/32.jpg)
32
CPU-only and hybrid CPU-GPU (vs. GPU-only)
Key challenges of existing solutions?
7.3x slowdown
10x slowdown
(Oracular) GPU-only
CPU-only
Hybrid CPU-GPU
Communication Bottleneck
Computation Bottleneck
![Page 33: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/33.jpg)
33
Our Approach: “Near”-Memory Acceleration(so Near-Memory Processing, NMP)
![Page 34: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/34.jpg)
34
Augment buffer device to add NMP cores for embedding gathers/reductions
TensorDIMM: a NMP for embeddings
DRAM
Buffer device
DRAM DRAM DRAM
Vector ALU
DD
R
PH
Y NMP-local Memory Controller
Input Q(A)
Input Q(B)
Output Q(C)
Pro
toco
l E
ng
ine
LocalDRAM
DDRxMemoryChannel
(a) NMP core (b) TensorDIMM
![Page 35: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/35.jpg)
35
Augment buffer device to add NMP cores for embedding gathers/reductions
TensorDIMM: a NMP for embeddings
DRAM
Buffer device
DRAM DRAM DRAM
Vector ALU
DD
R
PH
Y NMP-local Memory Controller
Input Q(A)
Input Q(B)
Output Q(C)
Pro
toco
l E
ng
ine
LocalDRAM
DDRxMemoryChannel
(a) NMP core (b) TensorDIMM
![Page 36: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/36.jpg)
36
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Memory Controller(Memory Channel)
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
ProcessorProcessor
![Page 37: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/37.jpg)
37
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Processor
Memory Controller(Memory Channel)
Processor
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
Embedding gathers/reductionsare done “locally” within a DIMM
![Page 38: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/38.jpg)
38
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Memory Controller(Memory Channel)
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
Memory bandwidth: 100
Memory bandwidth: 100
ProcessorProcessor
![Page 39: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/39.jpg)
39
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Memory Controller(Memory Channel)
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
Memory bandwidth: 100
DRAM
Buffer device
DRAM DRAM DRAM
Memory bandwidth: 200
ProcessorProcessor
![Page 40: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/40.jpg)
40
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Memory Controller(Memory Channel)
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
Memory bandwidth: 100
DRAM
Buffer device
DRAM DRAM DRAM
DRAM
Buffer device
DRAM DRAM DRAM
Memory bandwidth: 300
ProcessorProcessor
![Page 41: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/41.jpg)
41
“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks
Key advantage of TensorDIMM
Memory Controller(Memory Channel)
Memory Controller(Memory Channel)
DRAM
Buffer device
DRAM DRAM DRAM
Current system TensorDIMM approach
Memory bandwidth: 100
DRAM
Buffer device
DRAM DRAM DRAM
DRAM
Buffer device
DRAM DRAM DRAM
DRAM
Buffer device
DRAM DRAM DRAM
Memory bandwidth: 400
ProcessorProcessor
![Page 42: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/42.jpg)
42
Leverage rank-level parallelism for maximal bandwidth utilization
Mapping embedding tables in DRAMs
InputEmbedding0
InputEmbedding1
OutputEmbedding
InputEmbedding2
InputEmbeddingn
Subje
ct for
ele
ment-
wis
e o
pera
tions
Rank 0 Rank 1 … Rank 15
64B 64B … 64B
…… ……
NMP Core
(DIMM0)
NMP Core
(DIMM1)
…NMP Core
(DIMM15)
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
![Page 43: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/43.jpg)
43
Leverage rank-level parallelism for maximal bandwidth utilization
Mapping embedding tables in DRAMs
InputEmbedding0
InputEmbedding1
OutputEmbedding
InputEmbedding2
InputEmbeddingn
Subje
ct for
ele
ment-
wis
e o
pera
tions
Rank 0 Rank 1 … Rank 15
64B 64B … 64B
…… ……
NMP Core
(DIMM0)
NMP Core
(DIMM1)
…NMP Core
(DIMM15)
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
![Page 44: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/44.jpg)
44
Leverage rank-level parallelism for maximal bandwidth utilization
Mapping embedding tables in DRAMs
InputEmbedding0
InputEmbedding1
OutputEmbedding
InputEmbedding2
InputEmbeddingn
Subje
ct for
ele
ment-
wis
e o
pera
tions
Rank 0 Rank 1 … Rank 15
64B 64B … 64B
…… ……
NMP Core
(DIMM0)
NMP Core
(DIMM1)
…NMP Core
(DIMM15)
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
64B 64B … 64B
![Page 45: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/45.jpg)
45
A pooled memory architecture aggregated with multiple TensorDIMMs
Tensor“Node” using TensorDIMMs
TensorNode
TensorDIMM
TensorDIMM
TensorDIMM
TensorDIMM
![Page 46: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/46.jpg)
46
Utilize high-speed links (e.g. NVLINK) for inter-device communication
TensorNode as “remote” memory pools
TensorNode
TensorDIMM
TensorDIMM
TensorDIMM
TensorDIMMNVLINK
(150 GB/sec)
GPUs
![Page 47: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/47.jpg)
47
A platform for scalable expansion of both memory bandwidth and capacity
Putting everything together
Addresses thememory capacity
challenge
Addresses thememory bandwidth
challenge
Addresses thecompute & communication
challenges
![Page 48: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/48.jpg)
48
Evaluation
![Page 49: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/49.jpg)
49
Cycle-level DRAM simulator (Ramulator*)
Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)
Combination of cycle-level simulation and emulation on real ML systems
Evaluation methodology
* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015
![Page 50: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/50.jpg)
50
Cycle-level DRAM simulator (Ramulator*)
Memory bandwidth for embedding gathers/reductions under our address mapping
Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)
Combination of cycle-level simulation and emulation on real ML systems
Evaluation methodology
* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015
![Page 51: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/51.jpg)
51
Effective bandwidth scales proportional to number of ranks (avg 4x↑)
Memory bandwidth utilization
![Page 52: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/52.jpg)
52
Cycle-level DRAM simulator (Ramulator*)
Memory bandwidth for embedding gathers/reductions under our address mapping
Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)
Intel’s Math Kernel Library (MKL)
NVIDIA cuDNN / cuBLAS
In-house CUDA implementation of other layers
NVIDIA DGX-1V
• Eight NVIDIA V100 GPUs
• Two Intel Xeon E5-2698 v4
Combination of cycle-level simulation and emulation on real ML systems
Evaluation methodology
* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015
![Page 53: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/53.jpg)
53
A proof-of-concept software prototype to emulate TensorDIMM
TensorNode system modeling
SM SM SM SM
Crossbar
MC
L2
MC
L2
MC
L2
MC
L2
DRAM DRAM DRAM DRAM
Modern GPU architecture
NMP cores
TensorNode’s NMP local memory channels and the associated memory bandwidth
GPU-sideInterconnect
GPU
Used as anormal GPU
GPU
“Emulated” as a TensorNode
NVLINK(150 GB/sec)
![Page 54: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/54.jpg)
54
Four system design points
CPU-only
Hybrid CPU-GPU
TensorDIMM (ours)
GPU-only (oracle)
A proof-of-concept software prototype to emulate TensorDIMM
Evaluation methodology
![Page 55: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/55.jpg)
55
TensorDIMM helps reduce both embedding/MLP latency
Latency breakdown
![Page 56: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/56.jpg)
56
TensorDIMM helps reduce both embedding/MLP latency
Latency breakdown
![Page 57: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/57.jpg)
57
TensorDIMM helps reduce both embedding/MLP latency
Latency breakdown
![Page 58: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/58.jpg)
58
TensorDIMM achieves overall 6-9x speedup against the baselines
Latency breakdown
![Page 59: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/59.jpg)
59
TensorDIMM:A Near-Memory Processing Architecture for Sparse Embedding Layers
The “first” architectural solution tackling sparse embedding layers
A “practical” near-memory processing solution
for an important AI workload
Average “6~9x” performance improvement on state-of-the-art DNN-based recommendation models
![Page 60: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/60.jpg)
60
Questions?
![Page 61: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/61.jpg)
61
Backup Slides
![Page 62: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/62.jpg)
62
TensorDIMM design overheadsIt’s not free, but adding custom logics within DIMM has been done before
IBM centaur DIMM
![Page 63: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/63.jpg)
63
![Page 64: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/64.jpg)
64
![Page 65: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”](https://reader033.fdocuments.us/reader033/viewer/2022061001/60b049256ba03c77d458a58f/html5/thumbnails/65.jpg)
65