CS517: Language Models
-
Upload
emory-university -
Category
Technology
-
view
27 -
download
0
Transcript of CS517: Language Models
Probability
2
Probability of tomorrow being cloudy?
5 days 3 days 2 days
P (cloudy) =C(cloudy)
C(sunny) + C(cloudy) + C(snowy)=
3
10
Conditional Probability
3
Probability of tomorrow being cloudy if today is snowy?
P (cloudy|snowy) = C(snowy, cloudy)
C(snowy)=
1
2
Conditional Probability
4
Probability of tomorrow being cloudy if today and yesterday are snowy?
P (cloudy|snowy, snowy) = C(snowy, snowy, cloudy)
C(snowy, snowy)=
1
1= 1
Joint Probability
5
Probability of next 2 days being cloudy, sunny?
P (cloudy, sunny) = P (cloudy) · P (sunny|cloudy)
Joint Probability
6
Probability of next 3 days being snowy, cloudy, sunny?
P (snowy, cloudy, sunny) = P (snowy)·P (cloudy|snowy)·P (sunny|snowy, cloudy)
N-gram Models
7
1-gram (Unigram)
P (wi) =C(wi)P8k C(wk)
2-gram (Bigram)
P (wi+1|wi) =C(wi, wi+1)P8k C(wi, wk)
=C(wi, wi+1)
C(wi)
=C(wi)
N# of tokens
tokenvs
type?
Emory University Logo Guidelines
#P�KPUVKVWVKQP�CU�NCTIG�CPF�XCTKGF�CU�'OQT[�TGSWKTGU�C�EQPUKUVGPV�XKUWCN�KFGPVKV[�VJCV�WPKƂGU�KVU�XCTKQWU�CHƂNKCVGU��'OQT[oU�EWTTGPV�UVCPFCTFU��YJKEJ�JCXG�DGGP�KP�WUG�UKPEG�������TGKPHQTEG�VJG�WPKSWG�EJCTCEVGT�CPF�SWCNKV[�QH�GCEJ�CECFGOKE�CPF�CFOKPKUVTCVKXG�WPKV��YJKNG�UKOWNVCPGQWUN[�OCMKPI�KV�ENGCT�VJCV�'OQT[�UVCPFU�DGJKPF�GCEJ�QH�VJGO� +P�CFFKVKQP�VQ�VJG�OCKP�7PKXGTUKV[�ITCRJKE�KFGPVKƂGTU��OQUV�UEJQQNU�CPF�OCLQT�WPKVU�JCXG�VJGKT�QYP�EQORNGOGPVCT[�UGV�QH�KFGPVKV[�ITCRJKEU�HQT�RTKPV�CPF�YGD��YJKEJ�YGTG�FGXGNQRGF�KP�ECTGHWN�EQPUWNVC-VKQP�YKVJ�FGCPU�CPF�WPKV�JGCFU��&QYPNQCFCDNG�NQIQU�CPF�YQTFOCTMU�HQT�'OQT[�7PKXGTUKV[��VJG�UEJQQNU��CPF�OCLQT�WPKVU�ECP�DG�HQWPF�QP�VJG�YGD�CV�JVVR���KFGPVKV[�GOQT[�GFW��RTKPV�ITCRJKE�UVCPFCTFU��CPF�JVVR���YGDIWKFG�GOQT[�GFW��YGD�ITCRJKE�UVCPFCTFU���6JG�'/14;�YQTFOCTM�KU�C�HGFGTCNN[�TGIKUVGTGF�VTCFGOCTM��#�UCPEVKQPGF�KFGPVKƂGT�QH�VJG�7PKXGTUKV[tC�UEJQQN�QT�OCLQT�WPKV�NQIQ�VJCV�KPENWFGU�VJG�UJKGNF�U[ODQN�CPF�VJG�YQTFOCTM�'/14;tUJQWNF�CRRGCT�QP�GCEJ�RWDNKECVKQP��+H�[QW�YKUJ�VQ�JCXG�CP�KFGPVKƂGT�WPKV�UKIPCVWTG��ETGCVGF�URGEKƂECNN[�HQT�[QWT�RTQITCO�QT�FGRCTVOGPV��RNGCUG�EQPVCEV�VJG�1HƂEG�QH�$TCPF�/CPCIGOGPV�CV��������������QT�UVCPKU�MQFOCP"GOQT[�GFW��
6JG�V[RGHCEG�)QWF[�KU�TGUGTXGF�HQT�VJG�'OQT[�VTCFGOCTMU�CPF�PGXGT�UJQWNF�DG�WUGF�KP�VGZV�QT�FKURNC[�EQR[��;QW�PGXGT�UJQWNF�CVVGORV�VQ�TGPFGT�VJG�'OQT[�NQIQ�D[�V[RKPI�VJG�NGVVGTU�KP�C�YQTF�RTQEGUUKPI�QT�RCIG�NC[QWV�RTQITCO��0QPG�QH�VJG�NQIQU�KU�C�V[RGF�YQTF�DWV�TCVJGT�KU�URGEKƂECNN[�FGUKIPGF�XGEVQT�CTV�
'OQT[�VKGT�QPG�NQIQU�UJQWNF�TGRTQFWEG�QPN[�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG��6JG�UJKGNF�JGKIJV�UJQWNF�TGRTQFWEG�CV����q�QT�NCTIGT���#�IGPGTCN�TWNG�HQT�URCEKPI�CTQWPF�CP�'OQT[�NQIQ�KU�VQ�KPVGITCVG�CP�QDXKQWU�XKUWCN�UGRCTCVKQPtPQ�FGUKIP�GNGOGPV�QT�VGZV�UJQWNF�DG�PGUVGF�YKVJ�'OQT[�NQIQU�
'OQT[oU�RTKOCT[�EQNQTU�CTG�'OQT[�DNWG�2/5������CPF�[GNNQY�2/5�������'OQT[�7PKXGTUKV[�YQTFOCTMU� ECP�DG�TGRTQFWEGF�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG�QP�CP�'OQT[�DNWG�QT�FCTM�DCEMITQWPF��
YYY�KFGPVKV[�GOQT[�GFWJVVR���YGDIWKFG�GOQT[�GFW
3/8” minimun reproductionheight of shield
Keep a space around the logo equal to the height and width of the “M” in Emory
PMS 280 PMS 131 PMS 130 coated uncoated
Web colors are: Emory blue 002878 gold (dark) d28e00 gold (light) d2b000
N-gram Models• Unigram model
- Given any word w, it shows how likely w appears in context.
- This is known as the likelihood (probability) of w, written as P(w).
- How likely does the word “Emory” appear in context?
- Does this mean “Emory” appears 17.39% time in any context?
- How can we measure more accurate likelihoods?
8
Emory University was found as Emory College by John Emory.Emory University is 16th among the colleges and universities in US.
P (Emory) =4
23⇡ 0.1739
Emory University Logo Guidelines
#P�KPUVKVWVKQP�CU�NCTIG�CPF�XCTKGF�CU�'OQT[�TGSWKTGU�C�EQPUKUVGPV�XKUWCN�KFGPVKV[�VJCV�WPKƂGU�KVU�XCTKQWU�CHƂNKCVGU��'OQT[oU�EWTTGPV�UVCPFCTFU��YJKEJ�JCXG�DGGP�KP�WUG�UKPEG�������TGKPHQTEG�VJG�WPKSWG�EJCTCEVGT�CPF�SWCNKV[�QH�GCEJ�CECFGOKE�CPF�CFOKPKUVTCVKXG�WPKV��YJKNG�UKOWNVCPGQWUN[�OCMKPI�KV�ENGCT�VJCV�'OQT[�UVCPFU�DGJKPF�GCEJ�QH�VJGO� +P�CFFKVKQP�VQ�VJG�OCKP�7PKXGTUKV[�ITCRJKE�KFGPVKƂGTU��OQUV�UEJQQNU�CPF�OCLQT�WPKVU�JCXG�VJGKT�QYP�EQORNGOGPVCT[�UGV�QH�KFGPVKV[�ITCRJKEU�HQT�RTKPV�CPF�YGD��YJKEJ�YGTG�FGXGNQRGF�KP�ECTGHWN�EQPUWNVC-VKQP�YKVJ�FGCPU�CPF�WPKV�JGCFU��&QYPNQCFCDNG�NQIQU�CPF�YQTFOCTMU�HQT�'OQT[�7PKXGTUKV[��VJG�UEJQQNU��CPF�OCLQT�WPKVU�ECP�DG�HQWPF�QP�VJG�YGD�CV�JVVR���KFGPVKV[�GOQT[�GFW��RTKPV�ITCRJKE�UVCPFCTFU��CPF�JVVR���YGDIWKFG�GOQT[�GFW��YGD�ITCRJKE�UVCPFCTFU���6JG�'/14;�YQTFOCTM�KU�C�HGFGTCNN[�TGIKUVGTGF�VTCFGOCTM��#�UCPEVKQPGF�KFGPVKƂGT�QH�VJG�7PKXGTUKV[tC�UEJQQN�QT�OCLQT�WPKV�NQIQ�VJCV�KPENWFGU�VJG�UJKGNF�U[ODQN�CPF�VJG�YQTFOCTM�'/14;tUJQWNF�CRRGCT�QP�GCEJ�RWDNKECVKQP��+H�[QW�YKUJ�VQ�JCXG�CP�KFGPVKƂGT�WPKV�UKIPCVWTG��ETGCVGF�URGEKƂECNN[�HQT�[QWT�RTQITCO�QT�FGRCTVOGPV��RNGCUG�EQPVCEV�VJG�1HƂEG�QH�$TCPF�/CPCIGOGPV�CV��������������QT�UVCPKU�MQFOCP"GOQT[�GFW��
6JG�V[RGHCEG�)QWF[�KU�TGUGTXGF�HQT�VJG�'OQT[�VTCFGOCTMU�CPF�PGXGT�UJQWNF�DG�WUGF�KP�VGZV�QT�FKURNC[�EQR[��;QW�PGXGT�UJQWNF�CVVGORV�VQ�TGPFGT�VJG�'OQT[�NQIQ�D[�V[RKPI�VJG�NGVVGTU�KP�C�YQTF�RTQEGUUKPI�QT�RCIG�NC[QWV�RTQITCO��0QPG�QH�VJG�NQIQU�KU�C�V[RGF�YQTF�DWV�TCVJGT�KU�URGEKƂECNN[�FGUKIPGF�XGEVQT�CTV�
'OQT[�VKGT�QPG�NQIQU�UJQWNF�TGRTQFWEG�QPN[�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG��6JG�UJKGNF�JGKIJV�UJQWNF�TGRTQFWEG�CV����q�QT�NCTIGT���#�IGPGTCN�TWNG�HQT�URCEKPI�CTQWPF�CP�'OQT[�NQIQ�KU�VQ�KPVGITCVG�CP�QDXKQWU�XKUWCN�UGRCTCVKQPtPQ�FGUKIP�GNGOGPV�QT�VGZV�UJQWNF�DG�PGUVGF�YKVJ�'OQT[�NQIQU�
'OQT[oU�RTKOCT[�EQNQTU�CTG�'OQT[�DNWG�2/5������CPF�[GNNQY�2/5�������'OQT[�7PKXGTUKV[�YQTFOCTMU� ECP�DG�TGRTQFWEGF�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG�QP�CP�'OQT[�DNWG�QT�FCTM�DCEMITQWPF��
YYY�KFGPVKV[�GOQT[�GFWJVVR���YGDIWKFG�GOQT[�GFW
3/8” minimun reproductionheight of shield
Keep a space around the logo equal to the height and width of the “M” in Emory
PMS 280 PMS 131 PMS 130 coated uncoated
Web colors are: Emory blue 002878 gold (dark) d28e00 gold (light) d2b000
N-gram Models• Bigram model
- Given any words wi and wj in sequence, it shows the likelihood of wj following wi in context.
- This can be represented as the conditional probability of P(wj|wi).
- What is the most likely word following “Emory”?
9
Emory University was found as Emory College by John Emory.Emory University is the 20th among the national universities in US.
P (University|Emory) = 24 = 0.5
P (College|Emory) = 14 = 0.25
P (.|Emory) = 14 = 0.25
argmax
kP (wk|Emory)
Maximum Likelihood
10
x
n1 = x1, . . . , xn
P (xn1 ) = P (x1) · P (x2|x1) · P (x3|x2
1) · · ·P (xn|xn�11 )
Chain rule
Any practical issue?
(x1, …, xk) can be very sparse.
Markov assumption
P (xk|xk�11 ) ⇡ P (xk|xk�1)
P (xn1 ) ⇡ P (x1) · P (x2|x1) · P (x3|x2) · · ·P (xn|xn�1)
Emory University Logo Guidelines
#P�KPUVKVWVKQP�CU�NCTIG�CPF�XCTKGF�CU�'OQT[�TGSWKTGU�C�EQPUKUVGPV�XKUWCN�KFGPVKV[�VJCV�WPKƂGU�KVU�XCTKQWU�CHƂNKCVGU��'OQT[oU�EWTTGPV�UVCPFCTFU��YJKEJ�JCXG�DGGP�KP�WUG�UKPEG�������TGKPHQTEG�VJG�WPKSWG�EJCTCEVGT�CPF�SWCNKV[�QH�GCEJ�CECFGOKE�CPF�CFOKPKUVTCVKXG�WPKV��YJKNG�UKOWNVCPGQWUN[�OCMKPI�KV�ENGCT�VJCV�'OQT[�UVCPFU�DGJKPF�GCEJ�QH�VJGO� +P�CFFKVKQP�VQ�VJG�OCKP�7PKXGTUKV[�ITCRJKE�KFGPVKƂGTU��OQUV�UEJQQNU�CPF�OCLQT�WPKVU�JCXG�VJGKT�QYP�EQORNGOGPVCT[�UGV�QH�KFGPVKV[�ITCRJKEU�HQT�RTKPV�CPF�YGD��YJKEJ�YGTG�FGXGNQRGF�KP�ECTGHWN�EQPUWNVC-VKQP�YKVJ�FGCPU�CPF�WPKV�JGCFU��&QYPNQCFCDNG�NQIQU�CPF�YQTFOCTMU�HQT�'OQT[�7PKXGTUKV[��VJG�UEJQQNU��CPF�OCLQT�WPKVU�ECP�DG�HQWPF�QP�VJG�YGD�CV�JVVR���KFGPVKV[�GOQT[�GFW��RTKPV�ITCRJKE�UVCPFCTFU��CPF�JVVR���YGDIWKFG�GOQT[�GFW��YGD�ITCRJKE�UVCPFCTFU���6JG�'/14;�YQTFOCTM�KU�C�HGFGTCNN[�TGIKUVGTGF�VTCFGOCTM��#�UCPEVKQPGF�KFGPVKƂGT�QH�VJG�7PKXGTUKV[tC�UEJQQN�QT�OCLQT�WPKV�NQIQ�VJCV�KPENWFGU�VJG�UJKGNF�U[ODQN�CPF�VJG�YQTFOCTM�'/14;tUJQWNF�CRRGCT�QP�GCEJ�RWDNKECVKQP��+H�[QW�YKUJ�VQ�JCXG�CP�KFGPVKƂGT�WPKV�UKIPCVWTG��ETGCVGF�URGEKƂECNN[�HQT�[QWT�RTQITCO�QT�FGRCTVOGPV��RNGCUG�EQPVCEV�VJG�1HƂEG�QH�$TCPF�/CPCIGOGPV�CV��������������QT�UVCPKU�MQFOCP"GOQT[�GFW��
6JG�V[RGHCEG�)QWF[�KU�TGUGTXGF�HQT�VJG�'OQT[�VTCFGOCTMU�CPF�PGXGT�UJQWNF�DG�WUGF�KP�VGZV�QT�FKURNC[�EQR[��;QW�PGXGT�UJQWNF�CVVGORV�VQ�TGPFGT�VJG�'OQT[�NQIQ�D[�V[RKPI�VJG�NGVVGTU�KP�C�YQTF�RTQEGUUKPI�QT�RCIG�NC[QWV�RTQITCO��0QPG�QH�VJG�NQIQU�KU�C�V[RGF�YQTF�DWV�TCVJGT�KU�URGEKƂECNN[�FGUKIPGF�XGEVQT�CTV�
'OQT[�VKGT�QPG�NQIQU�UJQWNF�TGRTQFWEG�QPN[�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG��6JG�UJKGNF�JGKIJV�UJQWNF�TGRTQFWEG�CV����q�QT�NCTIGT���#�IGPGTCN�TWNG�HQT�URCEKPI�CTQWPF�CP�'OQT[�NQIQ�KU�VQ�KPVGITCVG�CP�QDXKQWU�XKUWCN�UGRCTCVKQPtPQ�FGUKIP�GNGOGPV�QT�VGZV�UJQWNF�DG�PGUVGF�YKVJ�'OQT[�NQIQU�
'OQT[oU�RTKOCT[�EQNQTU�CTG�'OQT[�DNWG�2/5������CPF�[GNNQY�2/5�������'OQT[�7PKXGTUKV[�YQTFOCTMU� ECP�DG�TGRTQFWEGF�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG�QP�CP�'OQT[�DNWG�QT�FCTM�DCEMITQWPF��
YYY�KFGPVKV[�GOQT[�GFWJVVR���YGDIWKFG�GOQT[�GFW
3/8” minimun reproductionheight of shield
Keep a space around the logo equal to the height and width of the “M” in Emory
PMS 280 PMS 131 PMS 130 coated uncoated
Web colors are: Emory blue 002878 gold (dark) d28e00 gold (light) d2b000
Maximum Likelihood• Maximum likelihood
- Given any word sequence wi, …, wn, how likely this sequence appears in context.
- This can be represented as the joint probability of P(wj, …, wn).
- How likely does the sequence “you know” appears in context?
11
you know , I know you know that you do .
Chain rule
P (you, know) =2
11
P (you) · P (know|you) = 3
11· 23=
2
11
not 10?
Emory University Logo Guidelines
#P�KPUVKVWVKQP�CU�NCTIG�CPF�XCTKGF�CU�'OQT[�TGSWKTGU�C�EQPUKUVGPV�XKUWCN�KFGPVKV[�VJCV�WPKƂGU�KVU�XCTKQWU�CHƂNKCVGU��'OQT[oU�EWTTGPV�UVCPFCTFU��YJKEJ�JCXG�DGGP�KP�WUG�UKPEG�������TGKPHQTEG�VJG�WPKSWG�EJCTCEVGT�CPF�SWCNKV[�QH�GCEJ�CECFGOKE�CPF�CFOKPKUVTCVKXG�WPKV��YJKNG�UKOWNVCPGQWUN[�OCMKPI�KV�ENGCT�VJCV�'OQT[�UVCPFU�DGJKPF�GCEJ�QH�VJGO� +P�CFFKVKQP�VQ�VJG�OCKP�7PKXGTUKV[�ITCRJKE�KFGPVKƂGTU��OQUV�UEJQQNU�CPF�OCLQT�WPKVU�JCXG�VJGKT�QYP�EQORNGOGPVCT[�UGV�QH�KFGPVKV[�ITCRJKEU�HQT�RTKPV�CPF�YGD��YJKEJ�YGTG�FGXGNQRGF�KP�ECTGHWN�EQPUWNVC-VKQP�YKVJ�FGCPU�CPF�WPKV�JGCFU��&QYPNQCFCDNG�NQIQU�CPF�YQTFOCTMU�HQT�'OQT[�7PKXGTUKV[��VJG�UEJQQNU��CPF�OCLQT�WPKVU�ECP�DG�HQWPF�QP�VJG�YGD�CV�JVVR���KFGPVKV[�GOQT[�GFW��RTKPV�ITCRJKE�UVCPFCTFU��CPF�JVVR���YGDIWKFG�GOQT[�GFW��YGD�ITCRJKE�UVCPFCTFU���6JG�'/14;�YQTFOCTM�KU�C�HGFGTCNN[�TGIKUVGTGF�VTCFGOCTM��#�UCPEVKQPGF�KFGPVKƂGT�QH�VJG�7PKXGTUKV[tC�UEJQQN�QT�OCLQT�WPKV�NQIQ�VJCV�KPENWFGU�VJG�UJKGNF�U[ODQN�CPF�VJG�YQTFOCTM�'/14;tUJQWNF�CRRGCT�QP�GCEJ�RWDNKECVKQP��+H�[QW�YKUJ�VQ�JCXG�CP�KFGPVKƂGT�WPKV�UKIPCVWTG��ETGCVGF�URGEKƂECNN[�HQT�[QWT�RTQITCO�QT�FGRCTVOGPV��RNGCUG�EQPVCEV�VJG�1HƂEG�QH�$TCPF�/CPCIGOGPV�CV��������������QT�UVCPKU�MQFOCP"GOQT[�GFW��
6JG�V[RGHCEG�)QWF[�KU�TGUGTXGF�HQT�VJG�'OQT[�VTCFGOCTMU�CPF�PGXGT�UJQWNF�DG�WUGF�KP�VGZV�QT�FKURNC[�EQR[��;QW�PGXGT�UJQWNF�CVVGORV�VQ�TGPFGT�VJG�'OQT[�NQIQ�D[�V[RKPI�VJG�NGVVGTU�KP�C�YQTF�RTQEGUUKPI�QT�RCIG�NC[QWV�RTQITCO��0QPG�QH�VJG�NQIQU�KU�C�V[RGF�YQTF�DWV�TCVJGT�KU�URGEKƂECNN[�FGUKIPGF�XGEVQT�CTV�
'OQT[�VKGT�QPG�NQIQU�UJQWNF�TGRTQFWEG�QPN[�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG��6JG�UJKGNF�JGKIJV�UJQWNF�TGRTQFWEG�CV����q�QT�NCTIGT���#�IGPGTCN�TWNG�HQT�URCEKPI�CTQWPF�CP�'OQT[�NQIQ�KU�VQ�KPVGITCVG�CP�QDXKQWU�XKUWCN�UGRCTCVKQPtPQ�FGUKIP�GNGOGPV�QT�VGZV�UJQWNF�DG�PGUVGF�YKVJ�'OQT[�NQIQU�
'OQT[oU�RTKOCT[�EQNQTU�CTG�'OQT[�DNWG�2/5������CPF�[GNNQY�2/5�������'OQT[�7PKXGTUKV[�YQTFOCTMU� ECP�DG�TGRTQFWEGF�KP�'OQT[�DNWG�2/5�������DNCEM��QT�YJKVG�QP�CP�'OQT[�DNWG�QT�FCTM�DCEMITQWPF��
YYY�KFGPVKV[�GOQT[�GFWJVVR���YGDIWKFG�GOQT[�GFW
3/8” minimun reproductionheight of shield
Keep a space around the logo equal to the height and width of the “M” in Emory
PMS 280 PMS 131 PMS 130 coated uncoated
Web colors are: Emory blue 002878 gold (dark) d28e00 gold (light) d2b000
Word Segmentation• Word segmentation
- Segment a chunk of string into a sequence of words.
- Are there more than one possible sequence?
- Choose the sequence that most likely appears in context.
12
youknow
P (you) · P (know|you) > P (yo) · P (uk|yo) · P (now|uk)
log(P (you) · P (know|you)) > log(P (yo) · P (uk|yo) · P (now|uk))
log(P (you)) + log(P (know|you)) > log(P (yo)) + log(P (uk|yo)) + log(P (now|uk))
Laplace Smoothing
13
P (xn1 ) ⇡ P (x1) · P (x2|x1) · P (x3|x2) · · ·P (xn|xn�1)
What if P(x1) = 0? P (xn1 ) ⇡ 0 ← BAD!!
Laplace Smoothing
Pl(xi) =C(xi) + ↵Pk(C(xk) + ↵)
=C(xi) + ↵P
k C(xk) + ↵|X|
P (xi) =C(xi)Pk C(xk)
=C(xi) + ↵
N + ↵|X|
=↵
N + ↵|X|Pl(x?) =
C(x?) + ↵Pk C(xk) + ↵|X|
Laplace Smoothing
14
Laplace Smoothing
Pl(xj |xi) =C(xi, xj) + ↵Pk(C(xi, xk) + ↵)
=C(xi, xj) + ↵P
k C(xi, xk) + ↵|Xi,⇤|
P (xj |xi) =C(xi, xj)Pk C(xi, xk)
=C(xi, xj)
C(xi)
=C(xi, xj) + ↵
C(xi) + ↵|Xi,⇤|
Pl(x?|xi) =↵
C(xi) + ↵|Xi,⇤|
Discount Smoothing• Issues with Laplace smoothing
- Unfair discounts.
- Unseen likelihood may get penalized too harshly when the minimum count is much greater than α.
- How to reduce the gap between the minimum count and unseen count?
15
10
100= 0.1 ! 10 + 1
100 + 10= 0.1
50
100= 0.5 ! 50 + 1
100 + 10= 0.46
1
100= 0.01 ! 1 + 1
100 + 10= 0.018 +0.008
0
-0.04
Discount Smoothing
16
Laplace Discount
Pl(xi) =C(xi) + ↵
N + ↵|X|
Pl(xj |xi) =C(xi, xj) + ↵
C(xi) + ↵|Xi,⇤|
Pl(x?|xi) =↵
C(xi) + ↵|Xi,⇤|
Pl(x?) =↵
N + ↵|X|
Pd(xi) =C(xi)� Pd(x?)
N
Pd(xj |xi) =C(xi, xj)� Pd(x?|xi)
C(xi)
Pd(x?|xi) = ↵ ·mink
P (xk|xi)
Pd(x?) = ↵ ·mink
P (xk)
Good-Turing Smoothing
17
Nc = the count of n-grams that appear c times
Type
carp
perch
whitefish
trout
salmon
eel
?
Total
N1 = 3, N2 = 1, N3 = 1, N4 = 1
MLE
0.33
0.25
0.17
0.08
0.08
0.08
0.00
1.00
C
4
3
2
1
1
1
0
12
C(xi)⇤ =
(C(xi) + 1) ·Nc+1
Nc
P (x?) =N1
N
C(eel) =(C(eel) + 1) ·N2
N1=
2
3
C*
5.00
4.00
3.00
0.67
0.67
0.67
0.17P (carp) =?
P (eel) =2/3
12
Pb(xj |xi) =
8<
:
Pl|d(xj |xi) P (xj |xi) > 0
� · Pl|d(xj) Otherwise
Backoff• Backoff
- Bigrams are more accurate than unigrams.
- Bigrams are sparser than unigrams.
- Use bigrams in general, and use unigrams only bigrams don’t exist.
18
How to measure?
� = ↵ · h(P (xj |xi)ii,jh(P (xj)ij