EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care...

47
EN. 601.467/667 Introduction to Human Language Technology Deep Learning II Shinji Watanabe 1

Transcript of EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care...

Page 1: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

EN. 601.467/667 Introduction to Human Language TechnologyDeep Learning II

Shinji Watanabe

1

Page 2: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Today’s agenda

• Basics of (deep) neural network• How to integrate DNN with HMM• Recurrent neural network

2

Page 3: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Speech recognition pipeline

Feature extraction

“I want to go toJohns Hopkins campus”

Acoustic modeling

(HMM)Lexicon Language

modeling

G OW T UW

“go to”“go two”“go too”“goes to”“goes two”“goes too”

G OW Z T UW

p(O|L)<latexit sha1_base64="cMYLn3i1AeIom9oOH+cAMLvYguU=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4KbgTFB1gVnCJ30rTG5jEkd9Qy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQjGxicmp6ZnirNz8wuLS6XlK2dSy3idGWnsTQyOS6F5HQVKfpNYDiqW/DruHPb160dunTD6ErsJbyhoa9ESDNBTV8nW2cvJ9t1SOayEg6GjoJqDMsnn/K5UOI6ahqWKa2QSnLuthgk2MrAomOS9YpQ6ngDrQJvfeqhBcdfIBtft0U3PNGnLWH800gH725GBcq6rYr+pAO/dsNYn/9VUKlFY8zSUj639RiZ0kiLX7Du+lUqKhvYboU1hOUPZ9QCYFf4FlN2DBYa+t2Ix0vyJGaVAN7MIbFvBcy+L+uEmySKraM5FUiiBrjdqEHrU4Lkfg6+/Olz2KKjvVA4q4cVuuXaa/8M0WScbZItUyR6pkSNyTuqEkQfySt7Ie+EzWAxWg7Xv1aCQe1bInwnoF1b0vVA=</latexit><latexit sha1_base64="cMYLn3i1AeIom9oOH+cAMLvYguU=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4KbgTFB1gVnCJ30rTG5jEkd9Qy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQjGxicmp6ZnirNz8wuLS6XlK2dSy3idGWnsTQyOS6F5HQVKfpNYDiqW/DruHPb160dunTD6ErsJbyhoa9ESDNBTV8nW2cvJ9t1SOayEg6GjoJqDMsnn/K5UOI6ahqWKa2QSnLuthgk2MrAomOS9YpQ6ngDrQJvfeqhBcdfIBtft0U3PNGnLWH800gH725GBcq6rYr+pAO/dsNYn/9VUKlFY8zSUj639RiZ0kiLX7Du+lUqKhvYboU1hOUPZ9QCYFf4FlN2DBYa+t2Ix0vyJGaVAN7MIbFvBcy+L+uEmySKraM5FUiiBrjdqEHrU4Lkfg6+/Olz2KKjvVA4q4cVuuXaa/8M0WScbZItUyR6pkSNyTuqEkQfySt7Ie+EzWAxWg7Xv1aCQe1bInwnoF1b0vVA=</latexit><latexit sha1_base64="cMYLn3i1AeIom9oOH+cAMLvYguU=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4KbgTFB1gVnCJ30rTG5jEkd9Qy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQjGxicmp6ZnirNz8wuLS6XlK2dSy3idGWnsTQyOS6F5HQVKfpNYDiqW/DruHPb160dunTD6ErsJbyhoa9ESDNBTV8nW2cvJ9t1SOayEg6GjoJqDMsnn/K5UOI6ahqWKa2QSnLuthgk2MrAomOS9YpQ6ngDrQJvfeqhBcdfIBtft0U3PNGnLWH800gH725GBcq6rYr+pAO/dsNYn/9VUKlFY8zSUj639RiZ0kiLX7Du+lUqKhvYboU1hOUPZ9QCYFf4FlN2DBYa+t2Ix0vyJGaVAN7MIbFvBcy+L+uEmySKraM5FUiiBrjdqEHrU4Lkfg6+/Olz2KKjvVA4q4cVuuXaa/8M0WScbZItUyR6pkSNyTuqEkQfySt7Ie+EzWAxWg7Xv1aCQe1bInwnoF1b0vVA=</latexit>

p(L|W )<latexit sha1_base64="qYK2KgpnsCiH8bB8m2PrFisTtYs=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4EN4IiCtYKTpE7aVpj8xiSO9Yy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQgmJqemZ2bnivMLi0vLK6XVG2dSy3iNGWnsbQyOS6F5DQVKfptYDiqWvB53TgZ6/YlbJ4y+xl7CGwraWrQEA/TUTbJz/lLfvV8ph5VwOHQcVHNQJvlc3pcKZ1HTsFRxjUyCc3fVMMFGBhYFk7xfjFLHE2AdaPM7DzUo7hrZ8Lp9uu2ZJm0Z649GOmR/OzJQzvVU7DcV4IMb1Qbkv5pKJQpruiP52DpsZEInKXLNvuNbqaRo6KAR2hSWM5Q9D4BZ4V9A2QNYYOh7KxYjzbvMKAW6mUVg2wqe+1k0CDdJFllFcy6SQgl0/XGD0OMGz/0YfP3V0bLHQW2vclQJr/bLxxf5P8ySTbJFdkiVHJBjckouSY0w8kheyRt5L3wGy8F6sPG9GhRyzxr5MwH9AmbevVg=</latexit><latexit sha1_base64="qYK2KgpnsCiH8bB8m2PrFisTtYs=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4EN4IiCtYKTpE7aVpj8xiSO9Yy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQgmJqemZ2bnivMLi0vLK6XVG2dSy3iNGWnsbQyOS6F5DQVKfptYDiqWvB53TgZ6/YlbJ4y+xl7CGwraWrQEA/TUTbJz/lLfvV8ph5VwOHQcVHNQJvlc3pcKZ1HTsFRxjUyCc3fVMMFGBhYFk7xfjFLHE2AdaPM7DzUo7hrZ8Lp9uu2ZJm0Z649GOmR/OzJQzvVU7DcV4IMb1Qbkv5pKJQpruiP52DpsZEInKXLNvuNbqaRo6KAR2hSWM5Q9D4BZ4V9A2QNYYOh7KxYjzbvMKAW6mUVg2wqe+1k0CDdJFllFcy6SQgl0/XGD0OMGz/0YfP3V0bLHQW2vclQJr/bLxxf5P8ySTbJFdkiVHJBjckouSY0w8kheyRt5L3wGy8F6sPG9GhRyzxr5MwH9AmbevVg=</latexit><latexit sha1_base64="qYK2KgpnsCiH8bB8m2PrFisTtYs=">AAACZnicbVHLSgMxFE3Hd31WERdugkXQTZmKoO4EN4IiCtYKTpE7aVpj8xiSO9Yy9h/c6p/5Cf6FaR1EWy8EDufcw0lO4kQKh2H4UQgmJqemZ2bnivMLi0vLK6XVG2dSy3iNGWnsbQyOS6F5DQVKfptYDiqWvB53TgZ6/YlbJ4y+xl7CGwraWrQEA/TUTbJz/lLfvV8ph5VwOHQcVHNQJvlc3pcKZ1HTsFRxjUyCc3fVMMFGBhYFk7xfjFLHE2AdaPM7DzUo7hrZ8Lp9uu2ZJm0Z649GOmR/OzJQzvVU7DcV4IMb1Qbkv5pKJQpruiP52DpsZEInKXLNvuNbqaRo6KAR2hSWM5Q9D4BZ4V9A2QNYYOh7KxYjzbvMKAW6mUVg2wqe+1k0CDdJFllFcy6SQgl0/XGD0OMGz/0YfP3V0bLHQW2vclQJr/bLxxf5P8ySTbJFdkiVHJBjckouSY0w8kheyRt5L3wGy8F6sPG9GhRyzxr5MwH9AmbevVg=</latexit>

p(W )<latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="UCVdQAUobHOL6CCgCuMDHI9qnb8=">AAACUnicbVHLSgMxFE3HVx2rtms3wSK4KlM36k5wI7ipYG2hU8qdTNqG5jEkd6xl6A+49evc+iWmD0RbLwQO5+RwknOTTAqHUfRZCnZ29/YPyofhUSU8PjmtVl6cyS3jbWaksd0EHJdC8zYKlLybWQ4qkbyTTO4XeueVWyeMfsZZxvsKRloMBQP0VGtQrUeNaDl0GzTXoE7WM6iVHuPUsFxxjUyCc71mlGG/AIuCST4P49zxDNgERrznoQbFXb9YvnNOLzyT0qGx/mikS/a3owDl3Ewl/qYCHLtNbUH+q6lcorBmupGPw5t+IXSWI9dsFT/MJUVDF1XQVFjOUM48AGaF/wFlY7DA0BcWhrHmU2aUAp0WMdiRgrd5ES/CTVbEVtE1F0uhBLr5tkHobYPnfgy+/eZm19ugfdW4bURPESmTM3JOLkmTXJM78kBapE0YSck7+Sh9BeXgZLWkoLTeVo38maD6DWK+uXQ=</latexit><latexit sha1_base64="O3q582xEhxBOkT56OfOWdZ/scaU=">AAACWXicbVHLSgMxFL0d3/VVFVdugiLopkzdqDvBjeBGwVrBKXInTdtgHkNyRy1Df8Gt/ppf4G+YqUW09ULgcE4OJzk3zZT0FMcflWhmdm5+YXGpuryyurZe21i59TZ3XDS5VdbdpeiFkkY0SZISd5kTqFMlWunjeam3noTz0pobGmSirbFnZFdypJLKDlqHD7W9uB6Phk2DxhjswXiuHjYql0nH8lwLQ1yh9/eNOKN2gY4kV2JYTXIvMuSP2BP3ARrUwreL0WOHbD8wHda1LhxDbMT+dhSovR/oNNzUSH0/qZXkv5rOFUlnnyfyqXvSLqTJchKGf8d3c8XIsrIP1pFOcFKDAJA7GX7AeB8dcgqtVauJEc/cao2mUyToehpfhkVShtusSJxmYy5RUkvyw2mDNNOGwP0YQv2NybKnQfOoflqPr2NYhB3YhQNowDGcwQVcQRM49OEV3uC98hmtRlvfe4oq44Vtwp+Jtr8AZ0S7oA==</latexit><latexit sha1_base64="O3q582xEhxBOkT56OfOWdZ/scaU=">AAACWXicbVHLSgMxFL0d3/VVFVdugiLopkzdqDvBjeBGwVrBKXInTdtgHkNyRy1Df8Gt/ppf4G+YqUW09ULgcE4OJzk3zZT0FMcflWhmdm5+YXGpuryyurZe21i59TZ3XDS5VdbdpeiFkkY0SZISd5kTqFMlWunjeam3noTz0pobGmSirbFnZFdypJLKDlqHD7W9uB6Phk2DxhjswXiuHjYql0nH8lwLQ1yh9/eNOKN2gY4kV2JYTXIvMuSP2BP3ARrUwreL0WOHbD8wHda1LhxDbMT+dhSovR/oNNzUSH0/qZXkv5rOFUlnnyfyqXvSLqTJchKGf8d3c8XIsrIP1pFOcFKDAJA7GX7AeB8dcgqtVauJEc/cao2mUyToehpfhkVShtusSJxmYy5RUkvyw2mDNNOGwP0YQv2NybKnQfOoflqPr2NYhB3YhQNowDGcwQVcQRM49OEV3uC98hmtRlvfe4oq44Vtwp+Jtr8AZ0S7oA==</latexit><latexit sha1_base64="9aYdZ0Xohvj92Q9L2Vaw21RnMm4=">AAACZHicbVHLSgMxFE3Hd7U+cSVIsAh1U6Zu1J3gRhBEwVrBKeVOmraheQzJHbUM/QW3+mt+gb9hpg6irRcCh3Pu4SQncSKFwzD8KAVz8wuLS8sr5dW1yvrG5tb2vTOpZbzJjDT2IQbHpdC8iQIlf0gsBxVL3oqHF7neeuLWCaPvcJTwtoK+Fj3BAHMqqbWOOpvVsB5Ohs6CRgGqpJibzlbpKuoaliqukUlw7rERJtjOwKJgko/LUep4AmwIff7ooQbFXTubXHZMDz3TpT1j/dFIJ+xvRwbKuZGK/aYCHLhpLSf/1VQqUVjzPJWPvdN2JnSSItfsO76XSoqG5n3QrrCcoRx5AMwK/wLKBmCBoW+tXI40f2ZGKdDdLALbV/AyzqI83CRZZBUtuEgKJdCNZw1Czxo892Pw9Temy54FzeP6WT28Davn18U/LJM9ckBqpEFOyDm5JDekSRgZkFfyRt5Ln0El2Al2v1eDUuHZIX8m2P8CigS8eA==</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit><latexit sha1_base64="TXBIpHjV3+jB6mD/iHvO6m+VfPA=">AAACZHicbVHbSgMxFEzXW623qvgkSLAI+lK2IqhvBV8EQRSsFdwiZ9O0DeayJGfVsvQXfNVf8wv8DbN1EW09EBhmzjDJJE6kcBiGH6VgZnZufqG8WFlaXlldq65v3DqTWsZbzEhj72JwXArNWyhQ8rvEclCx5O348SzX20/cOmH0DQ4T3lHQ16InGGBOJfvtg4dqLayH46HToFGAGinm6mG9dBF1DUsV18gkOHffCBPsZGBRMMlHlSh1PAH2CH1+76EGxV0nG192RPc806U9Y/3RSMfsb0cGyrmhiv2mAhy4SS0n/9VUKlFY8zyRj72TTiZ0kiLX7Du+l0qKhuZ90K6wnKEcegDMCv8CygZggaFvrVKJNH9mRinQ3SwC21fwMsqiPNwkWWQVLbhICiXQjaYNQk8bPPdj8PU3JsueBq3D+mk9vD6qNS+LfyiTbbJL9kmDHJMmOSdXpEUYGZBX8kbeS5/BSrAZbH2vBqXCs0n+TLDzBYtEvHw=</latexit>

Page 4: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Feed-forward neural network for acoustic model• Basic problem 𝑝(𝑠$|𝐨$)

• 𝑠$: HMM state or phoneme• 𝐨$: speech feature vector• 𝑡: data sample

• Configurations• Input features

• Context expansion• Output class

• Softmax function• Training criterion

• Number of layers• Number of hidden states• Type of non-linear activations

4

Output HMM state

Log mel filterbank + 11 context frames

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 units

・・・・・・Input speech features

a i u w N・・・

Page 5: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Input feature

• GMM/HMM formulation• Lot of conditional independence assumption and Markov assumption• Many of our trials are how to break these assumptions

• In GMM, we always have to care about the correlation• Delta, linear discriminant analysis, semi-tied covariance

• In DNN, we don’t have to care J• We can simply concatenate

the left and right contexts,and just throw it!

5

Page 6: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Output

• Phoneme or HMM state ID is used• We need to have a pair data of output and input data at frame 𝑡

• First use the Viterbi alignment to obtain the state sequence

• Then, we get the input and output pair

• Make acoustic model as a multiclass classification problem by predicting the all HMM state ID given the observation

• Not consider any constraint in this stage (e.g., left to right, which is handled by an HMM during recognition)

6

Page 7: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Feed-forward neural networks

• Affine transformation and non-linear activation function (sigmoid function)

• Apply the above transformation L times

• Softmax operation to get the probability distribution

7

Page 8: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Linear operation

• Transforms 𝐷(*+,)-dimensional input to 𝐷(*) output𝑓(𝐡(*+,)) = 𝐖(*)𝐡(*+,) + 𝐛(*)

• 𝐖(*) ∈ ℝ5(6)×5(689): Linear transformation matrix• 𝐛(*) ∈ ℝ5(6): bias vector

• Derivatives•: ∑< =><?<@A>

:A>B= 𝛿(𝑖, 𝑖′)

•:(∑< =><?<@A>)

:=>B<B= 𝛿 𝑖, 𝑖G ℎIG

8

Page 9: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Sigmoid function

• Sigmoid function

• Convert the domain from ℝ to [0, 1]• Elementwise sigmoid function:

• No trainable parameter in general• Derivative

• :N(O):O

= 𝜎(𝑥)(1 − 𝜎(𝑥))

9

Page 10: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Softmax function

• Softmax function

• Convert the domain from ℝS to 0, 1 S (make a multinomial dist. → classification)• Satisfy the sum to one condition, i.e., ∑IT, 𝑝 𝑗 𝐡 = 1• 𝐽 = 2: sigmoid function

• Derivative• For 𝑖 = 𝑗: :X(I|𝐡)

:?>= 𝑝(𝑗|𝐡)(1 − 𝑝 𝑗 𝐡 )

• For 𝑖 ≠ 𝑗: :X(I|𝐡):?>

= −𝑝(𝑖|𝐡) 𝑝 𝑗 𝐡

• Or we can write as :X(I|𝐡):?>

= 𝑝(𝑗|𝐡)(𝛿(𝑖, 𝑗) − 𝑝 𝑖 𝐡 ) :𝛿(𝑖, 𝑗): Kronecker’s delta

10

Page 11: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

What functions/operations we can use and cannot use?• Most of elementary functions

• +,−,×,÷, log , exp , sin , cos , tan()• The function/operations that we cannot take a derivative, including

some discrete operation• argmax=𝑝(𝑊|𝑂): Basic ASR operation, but we cannot take a derivative….• Discretization

11

Page 12: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Objective function design

• We usually use the cross entropy as an objective function

• Since the Viterbi sequence is a hard assignment, the summation over states is simplified

12

Page 13: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Other objective functions

• Square error𝐡klm − 𝐡 n

• We could also use p norm, e.g., L1 norm

• Binary cross entropy

• Again this is a special case of the cross entropy when the number of classes is two

13

Page 14: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Building blocks

14

Input: 𝐨$ ∈ ℝ5

Output: 𝑠$ ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

+, −, exp , log , etc.

Page 15: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Building blocks

15

Input: 𝐨$ ∈ ℝ5

Output: 𝑠$ ∈ {1, … , 𝐽}

Linear transformation

Softmax activation

Page 16: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Building blocks

16

Input: 𝐨$ ∈ ℝ5

Output: 𝑠$ ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

Linear transformation

Page 17: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Building blocks

17

Input: 𝐨$ ∈ ℝ5

Output: 𝑠$ ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

+, −, exp , log , etc.

Page 18: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Building blocks

18

Input: 𝐨$ ∈ ℝ5

Output: 𝑠$ ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

Linear transformation

Page 19: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

How to optimize?Gradient decent and their variants• Take a derivative and update parameters with this derivative

• Chain rule

• Learning rate ρ

19

Page 20: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Deep neural network: nested function

• Chain rule to get a derivative recursively• Each transformation (Affine, sigmoid, and softmax) has analytical derivatives

and we just combine these derivatives• We can obtain the derivative from the back propagation algorithm

21

Softmax activationLinear transformation Sigmoid activation Linear

transformation

Page 21: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Minibatch processing

• Batch processing• Slow convergence• Effective computation

• Online processing• Fast convergence• Very inefficient computation

• Minibatch processing• Something between batch and online

processing

22

where

Whole data(batch)

minibatch

minibatch

minibatch

Split the whole data into minibatch

Θuvw Θxly,

Θuvw ΘxlyzΘxly, Θxlyn

Page 22: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

How to set 𝜌?

Θ(|@,) = Θ(|) − 𝜌 } Δ�k�w(|)

• Stochastic Gradient Decent (SGD)• Use a constant value (hyper-parameter)• Can have some heuristic tuning (e.g., 𝜌 ← 0.5 × 𝜌 when the validation loss started to

be degraded. Then the decay factor becomes another a hyperparameter)• Adam, AdaDelta, RMSProp, etc.

• Usecurrentorpreviousgradientinformationadaptivelyupdate𝜌(Δ�k�w

| , Δ�k�w|+, , … )

• Still has hyperparameters to make a balance between current and previous gradient information

• Choice of an appropriate optimizer and its hyperparameters is critical

23

Page 23: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Today’s agenda

• Basics of (deep) neural network• How to integrate DNN with HMM• Recurrent neural network

24

Page 24: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

How to integrate DNN with HMM

• Bottleneck feature• DNN/HMM hybrid

25

Page 25: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Bottleneck feature

• Train DNN, but one layer having a narrow layer

• Use a hidden state vector for GMM/HMM

• Nonlinear feature extraction with discriminative abilities

• Can combine with existing GMM/HMM

26

Output HMM state

Log mel filterbank + 11 context frames

1,000 ~ 10,000 units

・・・・・・Input speech features

a i u w N・・・

Bottleneck layer

Page 26: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

DNN/HMM hybrid

• How to make it fit to the HMM framework?• Use the Bayes rule to convert the posterior to the likelihood

• is obtained by the maximum likelihood (unigram count)

• Need a modification in the Viterbi algorithm during recognition

27

Page 27: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Today’s agenda

• Basics of (deep) neural network• How to integrate DNN with HMM• Recurrent neural network

28

Page 28: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Recurrent neural network

• Basic problem• HMM state (or phoneme) or speech feature is a sequence

𝑠,, 𝑠n, … , s�𝐨,, 𝐨n, … , 𝐨$

• It’s better to consider context (e.g., previous input) to predict the probability of 𝑠$

𝑝 𝑠$ 𝐨$ → 𝑝(𝑠$|𝐨,, 𝐨n, … , 𝐨$)

• Recurrent neural network (RNN) can handle such problems

29

Page 29: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Recurrent neural network (Elman type)

• Vanilla RNN: We ignore the bias term for simplicity

30

xt

yt

yt-1𝐲$ = 𝜎 𝐖

𝐲$+,𝐱�

Page 30: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Recurrent neural network (Elman type)

• Vanilla RNN

31

xt

yt

yt-1

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y2y1 y3 y4 y5

Page 31: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Recurrent neural network (Elman type)

• Vanilla RNN

32

xt

yt

yt-1

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y2y1 y3 y4 y5

Possibly consider long-range effect (but longer weaker)no future context

Page 32: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Recurrent neural network (Elman type)

• Vanilla RNN

33

xt

yt

yt-1

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y2y1 y3 y4 y5

We can compute the posterior distribution 𝑝(𝑠$|𝐨,, 𝐨n, … , 𝐨$)

Softmax activation

𝑝(𝑠�|𝑥,, 𝑥n, … , 𝑥�)

Page 33: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Bidirectional RNN

34

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y3y2 y4 y5 y6

We can compute the posterior distribution 𝑝(𝑠$|𝐨$, 𝐨$@,, … , )

Page 34: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Bidirectional RNN

35

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y3y2 y4 y5 y6

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y2y1 y3 y4 y5

We can compute the posterior distribution 𝑝(𝑠$|𝐨,, 𝐨n, … , 𝐨$, 𝐨$@,, … , )

𝐲$ = 𝜎 𝐖 𝐲$+,𝐱�

𝐲$ = 𝜎 𝐖 𝐲$+,𝐱�

Page 35: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Long short-term memory RNN

• Keep two states• Normal recurrent state: yt

• Memory cell: ct

37

xt

yt

LSTMblock

ctct-1

ytyt-1

Page 36: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

38

σ

𝐱$

𝐲$+,

𝐜$+,

𝐲$

𝐜$

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Operations w/otrainable parameters

Operations w/trainable parameters

Page 37: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

• ”will be” function is implemented by a gate function [0, 1] through the sigmoid activation

39

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 38: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

tanh and sigmoid activations

• sigmoid• Convert the domain from to • Used as a gating (weight the state vector

(information))

• tanh• Convert the domain from to • Allow negative and positive values

40

σ

𝐡$

𝒛$

Page 39: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

41

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 40: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

42

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 41: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

43

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 42: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

44

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 43: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block

• Cell state keeps the history information

1. It will be forgotten2. New information from xt will be

added3. The cell information will be

outputted as yt

45

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 44: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM block summary

• 3 gating functions

• Cell update

• Hidden state update

47

σ

"#

$#%&

'#%&

$#

'#

tanhσ

concat

σ

tanh

LinearLinearLinearLinear

Page 45: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

LSTM RNN

• LSTM

48

xt

yt

yt-1

x1

y1

x2

y2

x3

y3

x4

y4

x5

y5

x6

y6

y2y1 y3 y4 y5

Possibly consider to keep the above initial information dependency by 𝐠muk�l� = 𝟏 and 𝐠�x��� = 𝟎 at 𝑡 > 1

Page 46: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

RNN can be used for both acoustic and language models • HMM/DNN Acoustic model

𝑝(𝑠$|𝐨,, 𝐨n, … , 𝐨$)• Language model𝑝(𝑤�|𝑤,, 𝑤n, … ,𝑤�+,)

49

w0

w1

w1

w2

w2

w3

w3

w4

w4

w5

w5

w6

y2y1 y3 y4 y5

”I” “want” “to” “go” “to” “my

“<s>” ”I” “want” “to” “go” “to”

o1

s1

o2

s2

o3

s3

o4

s4

o5

s5

o6

s6

y2y1 y3 y4 y5

Page 47: EN. 601.467/667 Introductionto Human Language Technology ... · •In DNN, we don’t have to care J •We can simply concatenate the left and right contexts, and just throw it! 5.

Summary of today’s talk

• Basics of deep neural network• Input, output, function block, back propagation, optimization

• Recurrent neural network• Now we can handle a sequence (𝑠,, 𝑠n, … ), (𝑤,, 𝑤n,… ), (𝐨,, 𝐨n, … )

• Integrate deep neural network for speech recognition• GMM/HMM à DNN/HMM or RNN/HMM• RNN language model

• Next lecture will extend this sequence handling function to directly model whole speech recognition in an end-to-end manner

50