Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks...

46
Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Why DNN Works for Speech and How to Make it More Efficient? Joint work with Y. Bao*, J. Pan*, P. Zhou*, S. Zhang*, O. Abdel-Hamid * University of Science and Technology of China, Hefei, CHINA

Transcript of Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks...

Page 1: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA

Why DNN Works for Speech and How to Make it More Efficient?

Joint work with Y. Bao*, J. Pan*, P. Zhou*, S. Zhang*, O. Abdel-Hamid * University of Science and Technology of China, Hefei, CHINA

Page 2: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Outline •  Introduction: NN for ASR • PART I: Why DNN works for ASR -  DNNs for Bottleneck Features -  Incoherent Training for DNNs

• PART II: Towards more efficient DNN training -  DNN with shrinking hidden layers -  Data-partitioned multi-DNNs

• Summary

Page 3: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Neural Network for ASR •  1990s: MLP for ASR (Bourlard and Morgan, 1994)

o  NN/HMM hybrid model (worse than GMM/HMM) •  2000s: TANDEM (Hermansky, Ellis, et al., 2000)

o  Use MLP as Feature Extraction (5-10% rel. gain) •  2006: DNN for small tasks (Hinton et al., 2006)

o  RBM-based pre-training for DNN •  2010: DNN for small-scale ASR (Mohamed, Yi, et al. 2010) •  2011– now: DNN for large-scale ASR

o  20-30% rel. gain in Switchboard (Seide et al., 2011) o  10-20% rel. gain with sequence training (Kingsbury et al., 2012; Su

et al., 2013) o  10% rel. gain with CNNs (Abdel-Hamid et al., 2012; Sainath et al.,

2013)

Page 4: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

NN for ASR: old and new •  Deeper network

more hidden layers (1 6-7 layers)

•  Wider network

more hidden nodes more output nodes (100 5-10 K)

•  More data 10-20 hours 2-10 K hours training data

Page 5: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

GMMs/HMM vs. DNN/HMM • Different acoustic models

o GMMs vs. DNN • Different feature vectors

o  1 frame vs. concatenated frames (11-15 frames)

vs. …

Page 6: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiment (I): GMMs/HMM vs. DNN/HMM

•  In-house 70-hour Mandarin ASR task; •  GMM: 4000 tied HMM states, 30 Gaussians per state •  DNN: pre-trained; 1024 nodes per layer; 1-6 hidden layers

Numbers in word error rates (%) NN-1: 1 hidden layer; DNN-6: 6 hidden layers MPE-GMM: discriminatively trained GMM/HMM

Context window NN-1 DNN-6 MPE-GMM 1 18.0 17.0 16.7

Context window 3 5 7 DNN-6 14.2 13.7 13.5

Context window 9 11 13

DNN-6 13.5 13.4 13.6

Page 7: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiment (II): GMMs/HMM vs. DNN/HMM

•  300-hour Switchboard task, Hub5e01 test set •  GMM: 8991 tied HMM states, 40 Gaussian per state •  DNN: pre-trained; 2048 nodes per layer; 1-5 hidden layers Word error rates (WER) in Hub01 test set (%)

NN-1: 1 hidden layer; DNN-3/5: 3/5 hidden layers MPE-GMM: discriminatively trained GMM/HMM

context window

MPE GMM

NN-1 DNN-3 DNN-5

1 32.8% 35.4% 34.8% 33.1%

11 n/a 31.4% 25.6% 23.7%

Page 8: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Brief Summary (I) •  The gain of DNN/HMM hybrid is almost entirely attributed

to the concatenated frames. o  The concatenated features contain almost all

additional information resulting in the large gain. o But they are highly correlated.

• DNN is powerful to leverage highly correlated features.

Page 9: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

What’s next •  How about GMM/HMM?

•  Hard to explore highly correlated features in GMMs. o  Requires dimensional reduction for de-correlation.

•  Linear dimensional reduction (PCA, LDA, KLT, …) o  Failed to compete with DNN.

•  Nonlinear dimensional reduction o  Using NN/DNN (Hinton et al.), a.k.a. bottleneck features o  Manifold learning, LLE, MDS, SNE, …?

Page 10: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Bottleneck (BN) Feature

Page 11: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiments: BN vs. DNN 300-hour English Switchboard Task (WER in %)

MLE: maximum likelihood estimation; MPE: discriminative training ReFA: realigned class labels; DT: sequence training of DNNs

DT+BN: BN features extracted from sequence-trained BN networks

Acoustic models (Features) Hub5e01 Hub5e00

MLE MPE MLE MPE

GMM-HMMs (PLP) 35.4% 32.8% 28.7% 24.7%

GMM-HMMs (BN) 26.0% 23.2% 17.9% 15.9%

DNN (11 PLPs) 23.7% 16.7%

DNN (ReFA+DT) -- 14.0%

GMM-HMMs (DT+BN) -- 16.6% 14.6%

Page 12: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training •  Bottleneck (BN) works but:

o  BN hurts DNN performance a little o  Increasing BN correlation up

•  A “better” Idea: embed de-correlation into back-propagation of DNN training. o  De-correlation by constraining column

vectors of weight matrix W o  How to constrain?

Page 13: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training • Define coherence of DNN weight matrix W as:

•  Intuition: a smaller coherence value indicates all

column vectors are more dissimilar. • Approximate coherence using soft-max:

GW =maxi, j

gij =maxi, j

wi ⋅wj

wi wj

GW

Page 14: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training • All DNN weight matrices are optimized by minimizing a

regularized objective function: • Derivatives of coherence:

• Back-propagation is still applicable…

F (new) = F (old ) +α ⋅maxW

GW

∂GW

∂wk

Page 15: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training: De-correlation

Applying incoherent training to one weight matrix in BN

Page 16: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training: Data-driven

•  If only applying to one weight matrix W:

Covariance matrix of Y:

• Directly measure correlation coefficients based on the above covariance matrix:

with Cx is estimated from each mini-batch of data

GW =maxi≠ j

gij

Y =WTX + bCY =WTCXW

Page 17: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training: Data-driven

• After soft-max, derivatives can be computed as:

where • Back-propagation still applies except Cx is computed from

each mini-batch

∂GW

∂wk

Page 18: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Incoherent Training: Data-driven De-correlation

When applying Incoherent Training to one weight matrix

Page 19: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiment: Incoherent Training

300-hour English Switchboard Task, WER (%) on Hub5e01 DNN: 6 hidden layers of 2048 nodes

BN: extracted from 5 hidden layers of bottleneck DNN

BN Feature / models MLE MPE DNN 23.7%

Baseline BN 26.0% 23.2% Weight-Matrix Incoherent BN 25.7% --

Mini-batch-data Incoherent BN 25.6% 22.8%

Page 20: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Brief Summary (II) • Promising to use DNN as feature extractor for the

traditional GMM/HMM framework. • Beneficial to de-correlate BN using the proposed

incoherent training. • Benefits over hybrid DNN/HMM:

o  Slightly better or similar performance o  Enjoy other ASR techniques (adaptation, …) o  Faster training process o  Faster decoding process

Page 21: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Outline •  Introduction: NN for ASR • PART I: Why DNN works for ASR -  DNNs for Bottleneck Features -  Incoherent Training for DNNs

• PART II: Towards more efficient DNN training -  DNN with shrinking hidden layers -  Data-partitioned multi-DNNs

• Summary

Page 22: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Towards Faster DNN Training • DNN Training is extremely slow …

o  Taking weeks to months to train large DNNs in ASR

• How to make it more efficient? o  Simplify DNN model structure … o  Use training algorithms with faster convergence (than

SGD) … o  Use parallel training methods with many CPUs/GPUs …

Page 23: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Simplify DNN: Exploring Sparseness

• Sparse DNNs (Yu et al, 2012 ): zeroing 80% of DNN weights leads to no performance loss. o  Smaller model footprint but no gain in speed

• How to explore sparseness for speed: o DNN weight matrix low-rank factorization (Sainath et

al, 2013 and Xue et al, 2013)

o DNN with shrinking hidden layers (to be submitted to ICASSP’14)

Page 24: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Weight Matrix Factorization

•  IBM (Sainath et al, 2013): 30-50% smaller model size, 30-50% speedup. •  Microsoft (Xue et al, 2013): 80% smaller model size, >50% speedup. •  Our investigation shows 50% speedup in training, 40% in testing.

W A B

Page 25: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

DNN: Shrinking Hidden Layers

•  Significantly reduce number of weights, particularly the output layer. •  Lead to faster matrix multiplication.

Sigmoid layer

Softmax layer

Sigmoid layer

Sigmoid layer

Sigmoid layer

x Y

Page 26: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiments: Shrinking DNNs •  Switchboard task: 300-hour training data, Hub5e00 test set •  Cross-entropy training (10 epochs of BP; minibatch 1024)

Baseline DNN (6 hidden layer): 429-2048*6-8991 Shrinking DNN I (sDNN-I): 429-2048*2+1024*2+512*2-8991

Shrinking DNN II (sDNN-II): 429-2048-1792-1536-1280-1024-768-8991

WER DNN Size Training time* (speedup)

DNN 16.3% 41M 10h

DNN (+pre-train) 16.1% 41M ≈ 20h (x0.5)

sDNN-I 16.7% 13M (32%) 4.5h (x2.2)

sDNN-II 16.5% 19M (48%) 5.5h (x1.8)

* Average training time (plus pre-training) per epoch using one GTX670

Page 27: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Parallel Training of DNNs •  If can’t make training even faster, why not parallelize it? •  Stochastic gradient descent (SGD) is hard to parallelize while

second-order optimization (HF) is much higher in complexity. •  GPU helps but still not enough: taking weeks to run SGD to

train large DNNs for ASR.

•  GOAL: Parallel Training of DNN using multiple GPUs o  Parallelized (or asynchronous) SGD not optimal for GPUs. o  Pipelined BP (Chen et al, 2012): data traffic among GPUs.

Page 28: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Data-Partitioned Multi-DNNs

Traditional DNN

Page 29: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Data-Partitioned Multi-DNNs

Multi-DNNs

Pr( | )js X

Traditional DNN

X

Page 30: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Data Partition

• Unsupervisedly cluster training data into several subsets that have disjoint class labels

•  Train DNN on each subset to distinguish labels within each cluster

•  Train another top-level DNN to distinguish different clusters

Page 31: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Multi-DNNs: Merging Probabilities

Multi-DNNs

Pr( | )js X

Traditional DNN

X

Pr( | ) Pr( | ) Pr( | , ) ( ).j i j i j is X c X s c X with s c= ∈

Page 32: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Data-driven Clustering •  Iterative GMM based bottom-up clustering of data from all

classes (like normal speaker clustering):

Page 33: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Model Parallelization

• Advantage of Multi-DNNs:

o Easy to parallelize

o Each DNN is learned only from one subset

0,,

nK

km mK k

k CyEey t k Cy a

∉⎧∂∂= = ⎨ − ∈∂ ∂ ⎩

Page 34: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Parallel Training Scheme

1C

2C

nC

2DNN

nDNN

0DNN

1DNN

ZERO communication data across GPUs

Page 35: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiment: Multi-DNNs

•  Switchboard training data (300 hour, 8991 classes) •  Test sets: Hub01 and Hub00 • GMM-based bottom-up clustering method to group

training data into 4 subsets to train 4 DNNs in parallel

C1 C2 C3 C4 Num. of states 2450 3661 20 2860

Data % 45.0% 25.3% 9.3% 20.4%

Page 36: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiments: Multi-DNNs Baseline DNN: 4-6 hidden layers of 2048 nodes

Multi-DNNs: DNN1-4 has 6 hidden layers of 2048 nodes DNN0 has 3 hidden layers of 2048 nodes

Hidden layer 4 5 6

baseline DNN

WER 24.4% 23.7% 23.6%

Time (hr) 13.0h 13.5h 15.1h

Multi-DNNs

( 3 GPUs )

WER 24.5% 24.2% 23.8%

Time (hr) 3.7h 4.3h 4.9h

Speed-up x3.5 x3.1 x3.1

Switchboard ASR: WER (in %) on Hub5e01 set and training time per epoch using 3 GPUs

Page 37: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiments: Multi-DNNs

Hidden layer 4 5 6

baseline DNN

WER 17.0% 16.7% 16.2%

Time (hr) 13.0h 13.5h 15.1h

Multi-DNNs

( 3 GPUs )

WER 17.0% 16.9% 16.7%

Time (hr) 3.7 4.3 4.9

Speed-up x3.5 x3.1 x3.1

Baseline DNN: 4-6 hidden layers of 2048 nodes Multi-DNNs: DNN1-4 has 6 hidden layers of 2048 nodes

DNN0 has 3 hidden layers of 2048 nodes

Switchboard ASR: WER (in %) on Hub5e00 set and training time per epoch using 3 GPUs

Page 38: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiments: Smaller Multi-DNNs

Hidden layer 4 5 6

baseline DNN

WER 17.0% 16.7% 16.2%

Time (hr) 13.0h 13.5h 15.1h

Multi-DNNs

(3 GPUs)

WER 17.8% 17.6% 17.4%

Time (hr) 1.8h 2.1h 2.3h

Speed-up x7.0 x6.6 x6.5

Baseline DNN: 4-6 hidden layers of 2048 nodes Multi-DNNs: DNN1-4 has 6 hidden layers of 1024 nodes

DNN0 has 3 hidden layers of 1024 nodes

Switchboard ASR: WER (in %) on Hub5e00 set and training time per epoch using 3 GPUs

ReFA 15.9%

16.4%

Page 39: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

More Clusters for better Speedup •  Clustering SWB training data (300-hr, 8991 classes) to 10 clusters

NN0 DNN1 DNN2 DNN3 DNN4 DNN5 DNN6 DNN7 DNN8 DNN9 DNN10

% 100% 9.7% 7.6% 5.9% 9.3% 8.9% 6.8% 16.8% 13.6% 8.5% 12.9%

class 10 1258 1162 993 1232 1042 1127 40 357 926 854

Page 40: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

More Clusters for Faster Speed •  Baseline: single DNN with 2048 nodes per hidden layer •  Multi-DNNs: 1200 hidden nodes per layer for DNN1-10; NN0 is 4*2048

Hidden Layers 3 4 5 6

Baseline

WER 17.8% 17.0% 16.9% 16.2%

Time (h) 11.0 h 13.0 h 13.5 h 15.0h

Multi-DNN

(10 GPUs)

WER 17.7% 17.7% 17.4% 17.4%

Time (h) 0.6h 0.7h 0.8h 0.9h speedup x18.5 x18.9 x16.6 x16.3

Switchboard ASR: WER (in %) on Hub5e00 set and training time per epoch using 10 GPUs

Page 41: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Sequence Training of Multi-DNNs •  SGD-based sequence training of DNNs using GPUs

o  H-Criterion for smoothing MMI (Su et al., 2013) o  Implementing BP/SGD and lattice computation in GPU(s)

•  For each mini-batch (utterances and word graphs) ①  DNN forward pass (parallel in 1 vs. N GPUs) ②  States occupancy for all arcs (parallel in 1 vs. N GPUs) ③  Process lattices for arc posteriori probs (CPU vs. 1 GPU) ④  Sum state statistics for all arcs (parallel in 1 vs. N GPUs) ⑤  DNN back-propagation pass (parallel in 1 vs. N GPUs)

Page 42: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Process word graphs with GPU

• Sort all arcs based on starting time

Page 43: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Process word graphs with GPU

•  Find splitting nodes in the sorted list based on max starting time and min ending time of arcs.

Page 44: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Process word graphs with GPU

• Split all arcs into subsets for different CUDA launches. •  In each launch, arcs do forward-backward in parallel.

Page 45: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Experiment: Sequence Training of 4-cluster Multi-DNNs

Baseline DNN: 6 hidden layers of 2048 nodes Multi-DNNs: DNN1-4 has 6 hidden layers of 1024 or 1200 nodes

DNN0 has 3 hidden layers of 1200 nodes

CE (FA) DT Speedup (3 GPUs)

Baseline DNN 15.9% 14.2% --

Multi-DNNs 6x1024 16.4% 15.4% x4.1**

Multi-DNNs 6x1200 16.1% 15.2%* x3.6**

CE (FA): 10 epochs of CE training using realigned labels DT: CE(FA) plus one iteration of sequence training * Mismatched lattices; ** based on simulation estimation

Page 46: Why DNN Works for Speech and How to Make it More Efficient? · • 2006: DNN for small tasks (Hinton et al., 2006) o RBM-based pre-training for DNN • 2010: DNN for small-scale ASR

Final Remarks • DNN PAIN: extremely time-consuming to train DNNs.

• Critical to expedite DNN training for big data sets.

• DNN training can be largely accelerated via:

-  Simplify model structure by exploring sparseness

-  Employ parallel training using multiple GPUs