Post on 15-Feb-2018
:zizek`ln dpial `ean qxewa meiq hwiiext
zirah dty iedif
201564895 ,lresisi ,iqiqx xe`il
200790111, mikab4, owxa dwin
2014 uxna 23
1
mipiipr okez
4 `ean I
6 megza zexeyw ze eare dtyd iedif zniyn II
7 ogand xnegle oeni`d xnegl mipezpd seqi` III
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zetyd zxiga 1
7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeni`d xnega letihde mihqwhd xewn 2
8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihixw`i mipniqa letih 3
9 minzixebl` IV
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . stif weg - ziai`pd dyibd 4
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlibx dxitq 4.1
9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "Borda Count" zhiy 4.2
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 5
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 5.1
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 5.1.1
10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 5.1.2
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 5.1.3
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` i nv) mxbia 5.1.4
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dty lk ly mixehwed egi` ote` 5.2
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ig` lwyn 5.2.1
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ilniqwnd jxrd zxiga 5.2.2
11 . . . . . . . . . . . . . . . . . . . . . ogapd xehwel dtyd xehwe oia wgxnd z i n ote` 5.3
11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "heyt" wgxn 5.3.1
12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ilwe` wgxn 5.3.2
12 . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia zieefd qepiqew t"r oein 5.3.3
12 . . . . . . . . . (zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4
12 . . . . . . . . . . . . . . . . . . . . . . . . . . . (Ranks) mewina miyxtdd mekq 5.3.5
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . seqpi` znxep 5.3.6
13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 6
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 6.1
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 6.1.1
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 6.1.2
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 6.1.3
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` i nv) mxbia 6.1.4
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeveawl epwlig ea ote`d 6.2
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpet 6.3
14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy itl 6.3.1
2
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain itl 6.3.2
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain Ratio itl 6.3.3
15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gini Gain itl 6.3.4
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Train Error itl 6.3.5
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . urd ziipal mzixebl`d 6.4
16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . (classification) dtyd beeiql mzixebl`d 6.5
17 ze`vezd V
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (stif weg) ziai`p dyib 7
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 8
17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia wgxnd z i n ote` t"r 8.1
19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-gram t"r 8.2
20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 8.3
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . overfitting-e zeiawr 8.4
21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . recall, precision, F1 i n 8.5
22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 9
22 . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1
24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 9.2
26 izrl zeaygne zepwqn ,oei VI
28 zexewn VII
29 zeihqihhqd ze`vezd ixwir hexit - '` gtqp VIII
30 dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp IX
3
I wlg
`ean
zepey zety 120-n xzei ly ozaizkl ynyn `ede ,mlera zevetpd azkd zekxrnn zg` `ed ipihld ziatl`d
zetya etqep odil` ,zeiqiqa zeize` 26 llek `ed .('eke zeia`lq ,zeip`nxb ,zeip`nex) zepey zety zegtynn
-l`a agxpd yeniyd .dtyl ziatl`d z` mi`zdl e repy ,(miihixw`i mipniq") mi gein mipniq zenieqn
zedfl ozip vike m`d ,zeipihl zeize`a aezkd hqwh ozpida - zxbz`ne zpiiprn ziaeyig dira dlrn df ziat
?aezk `ed zirah dty efi`a
ir na miax mitpre divipbewd ir n ,ziaeyig zepyla ,zepyla znbe k) miax minegz dtiwn efd dirad
miweg xa a iwqneg ly zeipylad zeqitza oe l lkep ,dnbe l ,jk .zexg` zeax zel`y dlrn `ide (aygnd
zedfl mi nel e` dty miykex epgp` ea ote`d znbe k zeiaihipbew zel`ya e` ,zeirah zety ly miilqxaipe`
lerii znbe k ,miax miihwxt miyeniy dl yi :` ixb zihxe`iz dppi` dirady xekfl aeyg ,liawna .milin
ly ihqihhq gezip .'eke ihnehe` mebxz jildza oey`x alyk ,miknqn mr d eare hqwh yetig zeniyn
Cryptanalysis ly megza xefrl s` leki ,zepey zetya zeize` ly zeiexi z gezip ,hxtae ,zety oia mil add
iedif e` ,mihqwh ly geprte dptvd zeniynn wlgk ,miiq`lw miptve dtlgd ipteva ,lynl) r in oeghae
.'eke ,(onted e iw znbe k) r in ly dqig e e iwa ,(gpretnd jnqnd znerl ixewnd jnqna miqet
megz `edy ,NLP-d megza `yep-zzk) zirah dty iedif ziira ly zhyten dqxba weqrz eply d eard
aezk day dtyd z` ihnehe` ote`a zedfl enll ozip oda mikx x`zp dkldnae ,(AI-d mlera e`n oiiprn
:zeipihl zeize`a zeazkp xy` y`xn zex ben zety xtqn ly oeebn oian ,oezp hqwh
zipnex zi xtq zipnxb ziwlhi`
zi eey ziplet zixbped zifpe pi`
zifbehxet ziwxeh zilbp`
ziztxv zipihl qpwixt`
zeize`d zebltzdl xeywd lka ipiite` qet yi dty lkly did hwiiextd jldna epze` dgpdy iqiqad oeirxd
ly zebef ly e` (minxbipei) ze ea zeize` ly ziqgid zegikydy dtvp ,dnbe l,jk .dtya zepeyd milina
dxeyw dlin seqa e` dlin y`xa znieqn ze` ly zegikydy oke ,dtyl dtyn dpey didz (minxbia) zeize`
xear elld miqet d z` enlle zeqpl ephlgd ,df oeirx xe`l .diiedifa riiqle dilr irdl leki jkitle ,dtyl
miqet de mipezpd lr jnzqda ,oezp hqwh ly dtyd z` zefgl lkep m`d we ale ,lirl zetyd 14-n `"k
.ep nly
,2010-2011 l"dpya xnze oxen i"r dazkpy "Natural Language Detection" d eard mr azkzz epz ear
ze`vezd z` xbz`l dqppe ,dpey dxeva dirad z` sewzl dqpp epz eara .dne dniyn mr d enzdy
ribdl epl eriiqiy ,zizek`lnd dpiad megzn miax mitqep milka yeniy jez ,ebiyd zeixewnd zexagndy
- zigpen d inl zehiy zervn`a diral ybip epgp` mb ,zeixewnd zeazekd enk .xzei s` zeaeh ze`vezl
l add mle` .oeni`d xneg jezn zety beeiq zeivwpet zepal dqppe ,dhlgd ivr zervn`ae miihqihhq milk
le b l adl liaedl lekiy dn ,dtyd ddefn mditl mipiit`nd zxigaa oenh didi ze eard izy oia izednd
mda minzixebl`ae zilnihte`d aeyigd jx a mb enk ,yeniy dyrp mda miihqihhqd milkae dhlgdd ivra
,zeixewnd zexagnd elirtdy dl` lr mitqep miihnzne miiaeyig zepeirx lirtp ,jkl sqepa .ynzydl epxga
zniieqn dty zedfl dqpp ," ala" zety yy oia xgai beeqndy mewnay jk ,beeiqd zniyn z` aigxp s`e
jixvne ,zxkip dxeva dniynd z` jaqn oaenky dn) ipihl ziatl`a zeazkpy zety ly xzei agx oeebn jezn
.(mitqep miax mixhnxta zeaygzd
lr d`ln dhily epivxe xg`n ,epi i lr azkp hlgend eaex .Python3.0-a azkp hwiiextd zxbqna ewd
eidy ,(NLTK znbe k) y`xn mix ben milk lr qqazdl epivx `le ,ep ar mzi` mixnegd lre minzixebl`d
zaezka (README.txt uaewa dvxd ze`xed llek) yibp ewd llk .eply zeyinbd z in z` liabdl mileki
:d`ad
http://tinyurl.com/qda6f5h
5
II wlg
megza zexeyw ze eare dtyd iedif zniyn
xfrip epz ear jxevl .oezp hqwh ly dty iedif zniyn rval ep`eaa rixkne izedn wlg `id mipiit`n zxiga
zigpen d inl ly zepey zehiy zervn`a dirad z` sewzl lkepy i k ynzyp mda xy` ,miipyla mipiit`na
(bigrams) minxbiae (unigrams) minxbipei ly zeiegiky we ap ,jildzdn wlgk .elld mipiit`nd lr eknzqiy
zeize` nv e` zniieqn ze` zrted ly zeiexazqdd z` `vnp ,xnelk ,zew apd zetydn `"ka zeize` ly
mitqep mipiit`n .aezk hqwhd da dtyd lr (jkn enll e`) jkn jilydl lkep m`d we ape ,znieqn dtya
dlin y`xa znieqn ze` zrted zegikyl zeqgiizd ellki - zixewnd d eara eqgiizd `l mdil` - ogapy
.'eke ,(zilbp`a dxi pe ziwlhi`a dgiky dlin seqa "a" ze`d ,lynl) dlin seqa e`
mixwege mipyla i"r xwgpe re i mwlgy) elld mipiit`nd z` enll dqpp ,ipylad zr d megzn dpeya
ode (mixehwe z`eeyd) "dheyt" zihqihhq dwi a jxevl od mda ynzype ,ihnehe` ote`a (NLP-d megza
zxbqna .('eke information gain ,ditexhp` znbe k) zepey geex zeivwpeta yeniy jez ,dhlgd ivr ziipa jxevl
`ly zeax zetqep zehiya ynzype ,zixewnd d eara yeniy dyrp oda ze eznd z` xbz`l dqpp d eard
znxepa yeniy i"r xara elawzdy ze`vezd z` xtyl ozip m`d we ap ,dnbe l ,jk .zixewnd d eara ew ap
ly eteqa .'eke Kullback-Leibler wgxn znbe k mitqep milka yeniy mb enk ,zetyd z`eeyd myl seqpi`
miweg ly hq zxivie zizek`lnd dpiad megzn milk zlrtd jez ,xzei zeaeh ze`vezl ribdl dvxp ,xa
.oeni`d xneg jnq lr e nliiy ,miqet e
hqwhd z` zeaikxnd milind yetig i"r zeidl dleki ,aezk oezp hqwh da dtyd iedifl ziai`pd jx d
`"k xear oelin wfgzl ,xnelk) zeknzpd zetydn `"ka milind lk z` lelkiy ,i erii r in xb`na w apd
,z`f mr gi .("daexw ikd" dtyd z` xifgdle ,oezpd hqwha milindn `"k ea ytgl ,zeknzpd zetydn
(zetydn `"k xear) dfky mevr r in xb`na yetigd jyne ,mewn zpigan xwi e`n `ed dfky iai`p oexzt
zxez yi` ly dpga`a ynzydl lkei ixyt` xetiy .zeliria zihnx dxeva rbety dn , e`n jex` zeidl ieyr
"apf"e xvw "seb" zlra `ide ,dne i zetyd lka milind zegiky zebltzdy
1
d`xdy ,stif 'bxe'b divnxetpi`d
milind 100-e ,iqetih hqwha mirtendn 25%-k zeqkn zilbp`a xzeia zevetpd milind 10 ,lynl jk :jex`
lelkiy ,"mvnevn" xb`nl xae nd xb`nd z` mvnvl didi ozip ,xnelk .mirtendn 45%-k zeqkn xak zevetpd
z` zepiit`ny zei egii xeyiw zelin e` ,zetydn `"ka zevetpd milind (ze`n e`) zexyr dnk ly dniyx
dpi` - mewn zegt zizernyn zkxeve zkaeqn zegt zizernyn `idy s` lr - efky d inl ,mle`e .dtyd
milin zedfl dvxpy mixwna ziyeniy didz `l `id ,efn dxzie ,zizek`ln dpia zpigan "zpiiprn" zn`a
,zxg` jx `evnl dvxpy ,o`kn .zevetp e`n milin e` qgi zelin llek `ly ,milin xtqn ly svx e` ze ea
ddeab zexazqda oezpd hqwhd ly dtyd z` ddfiy ,beeqn eitl zepale ,oezp oeni` xneg lr jnzqdl lkezy
.efd dyibd z` mb dxvwa epga epz eara ,z`f mr gi .ozipd lkk
-al `ean iqxewa mb epxai dilr ,dtyd iedif zniyn oexzitl xzeia zpiiprnde zirahd jx d `id d inl
lr miqqazny milk md ,Google Translate znbe k ,NLP-d megza miax miihnehe` milk .ziaeyig zepyl
mivex epgp`e ,"dfd oeeika mikled" ziaeyigd zepylad megza miax mixwgn ,ok enk .zewihqihhq lre d inl
.ribdl lkep ze`vez eli`l - oaenk ,oitp` xirfa - we ale jiyndl
George K. Zipf (1949), Human Behavior and the Principle of Least Effort, Addison-Wesley. 1
6
III wlg
ogand xnegle oeni`d xnegl mipezpd seqi`
:zeizernyn e`n zehlgd xtqn lawl epvl`p ,dniynd mr enzdl ep`eaa
zetyd zxiga 1
znerl) zedfl lkepy zetyd xtqn z` zxkip dxeva li bdl did epinvrl epavdy miixewnd mi rid g`
lr e`n drityn `id oky , e`n zizernyn `id ozedfe zetyd zenk zxiga .(zixewnd de ara zety yy
zey g zeize` "siqedl" dlelr xgapy dty lk ,efn dxzi .oeni`d ixneg gtp lr mb enk ,zxgapd dibhxhq`d
li bdl leki ipy vn la` ,iedifd zniyn lr lwdl leki g` vny dn ,(zi eeya a znbe k) dl zei egiiy
,xnelk ,urd l eb z` zizernyn dxeva li bdl jkae ,minxbiad (jkn xeng)e minxbipeid zenk z` zxkip dxeva
- dxwira zipkh `id ztqep zixyt` dira .epxviy mivawd l eb z`e dvixd jyn z` zihnx dxeva jix`dl
- zety oia oein a dxeyw zxg` dira .xzei zkaeqn zeidl dlelr xzei "zeihefw`" zetya oeni` ixneg zbyd
jildz ly eteqa .iedifd jild lr zeywdl did lekiy dn ,zene e`n zety od zi plede qpwixt` ,dnbe l ,jk
eptqed odl ,zixewnd d eardn zetyd 6 z` zelleky ,lirl ehxety zetyd 14 z` xegal ephlgd ,daiyg
zegtynn zety ,(zipihle ziwlhi` ,dnbe l) szeyn ixehqid xewn zelrae "zene " zety - zepiiprn zety
.('eke ,zia`lq dty `id ziplet ,zip`nxb dty `id zi eey ,zip`nex dty `id zipnex ,lynl) zepey zety
oeni`d xnega letihde mihqwhd xewn 2
hqwhd zty z` zedfl zexyt`d didzy `id ,hwiiextd ziy`xa epnvr ipta epavdy zeaeygd zexhnd zg`
xewne xg`n .hpxhpi`a bela e` di tiwiea jxr ,dxiy xtq ,oezirn gewl `ed m`d - exewnl xyw `ll
milind jxe`e mihtynd jxe` ,zeize`d zeiebltzd lre milind xve` lr e`n ritydl mileki ebeqe hqwhd
.mipey dty ialyn xzeiy dnk miqkny ,mipey zexewnn oeni` ixneg biydl epl aeyg did ,miynzyn oda
-i` xnega ynzydl epivx `l` ,ze`vezd z` zehdl elkeiy mii`xw` mihqwh lr xytzdl epivx `l ,sqepa
oda biydl lwy ,zeyibpd zetya mb - zetyd oeebna dne swidae d ig` dnxa ,ozipd lkk izeki`e oin` oen
ziwxeh ,zifpe pi` znbe k) oda miyibp zegt mixnegy zetya mbe ,(ziztxve zipnxb ,zilbp` znbe k) mixneg
xg`l .elld zetyd z` mixae `l epgp`e zeid mixnegd zeki` z` jixrdl dyw epl did ,sqepa .(qpwixt`e
z xed `ed ef dxhn zbydl xzeia aehd xewnd ik ep`vn ,mihqwh ly e`n agx oeebn yetige dwinrn dwi a
,dixehqid ,difhpt ixtq) zepey zeixebhwn mixtq zxiga lr ep twd .zyxa zety oeebna miipexhwl` mixtq
.mipey dty ialyn lr rvazz d inldy epb` jkae ,('eke dxiy
epyyge xg`n ,milin yng zegtl ellky mihtynd z` wx epxnye ,mihtynl epwxit mipeyd mihqwhd z`
dtyl dne didiy ,oin` litext mditl yabl lkepy i k zeize` witqn milikn `l xzei mixvw mihtyny
zeize`d lk z` epktd ,sqepa .mipey mihtyn 2000 zegtl lelki dty lka oeni`d xnegy jkl epb` .idylk
.lower case-l
xear .di tiwie - zilkza dpey xewnn `wee eze` epgwl ,ozipd lkk oeebne "i`nvr" didi ogand xnegy i k
dtyd `id zw apd dtydy dpi nd ly (dievxd dtya) di tiwied jxrn mihtyn llk ogand xneg ,dty lk
zeny ellki `l ,lynl) zexf zetya r in e`n hrn lelkie ,oin` didi my r indy daygn jezn ,da zinyxd
epynzyd zi xtq xear ,lynl ,jk .(dtyd dze`a mibyene migpen `l` ,zilbp`a miax miir n migpen e`
."dwixt` mex " jxra epynzyd qpwixt` xeare ," xtq" jxra
7
miihixw`i mipniqa letih 3
.zilbp`n epl zexkend zeize`d 26-l xarn ,zetqep zeize` zeniiw zeipihl zeize`a zeazkpd zetydn zeaxa
,dlind ly diibdd ote` lr miritynd miitxbezxe` mipniq md ,miihixw`i mipniq mi`xwpd ,dl` mipniq
,a ,a ,a zeidl dleki a ze`d ,dnbe l ,jk ."dlibxd" ze`l zgzn e` lrn edylk oniq ztqed i"r elawzd mde
dly zernynd z` mb `l` ,dlind ly diibdd z` zepyl wx `l leki ihixw`i d oniqa yeniyd xy`k ,'eke a
.("xak" `id schon dlind zernyn era ,"dti" `id "schön" dlind zernyn ,zipnxbd dtya ,lynl)
lr (zeywdl e`) lwdle ,hwiiextd lr zizernyn dxeva ritydl dleki miihixw`i mipniq xa a dhlgdd
mipniq x rid era ,zipnxb `id zw apd dtydy jk lr zwdaen dxeva irz ß ze`d ,lynl ,jk .dtyd iedif
epiid oii r ,miihixw`i d mipniqd z` xiqdl mivex epiidy dgpda ,z`f mr gi .zilbp` lr fnxn miihixw`i
,(dlibx a-l a jetdl ,lynl) zg` ze`l dxnd zlaha ynzydl did ozip - z`f zeyrl vik rixkdl miyx p
.elld mipniqdn lilk mlrzdl elit` e` ,(zipnxba bedpy itk ,ss-l ß e` ae-l a jetdl ,lynl) zeize` izyl
,yeniy miyer epgp` mda mixehwed llk lre zeiebltzdd lr zxkip dxeva ritydl leki dfky oewiz lk ,xen`k
,zipyla dpigan jxr zxqg `id miihixw`i d mipniqd ly dxqd ,jkl xarn .hwiiextd ze`vez z` zepyle
.milin mi`ivnne ,dtyd z` oihelgl mipyn epgp` oky
we al ephlgd ,ziaeyigd zepylad megzn mixwege mipyla xtqn mr zeievriizd ellky - miax mihal xg`l
-xw`i d mipniqd zxqd xy`k ,miihixw`i mipniq `lle ,miihixw`i mipniq mr :mipte` ipya ze`vezd z`
NFD (Normalization Form hnxeta yeniye oe'ziit ly unicodedata ziixtqa yeniy jez drvazd miihi
lr e`n milwn miihixw`i mipniq ditl) eply dxrydd z` we al lkep jk .Canonical Decomposition)
.miihixw`i d mipniqd z` dxiqdy ,zixewnd d earl epz ear ze`vez z` zeeydl lkep oke ,(dtyd iedif
8
IV wlg
minzixebl`
dyibd z` dxvwa we al mb epxga mle` ,dhlgdd ivre zihqihhqd dyibd lr yb epny epz eara ,xen`k
.stif weg lr zqqazny ,ziai`pd
stif weg - ziai`pd dyibd 4
x qa) oda yeniyd zegiky itl idylk zirah dtya milind z` bx p m` eitl ,ixitn` weg `ed stif weg
:
1i-l zilpeivxtexty zexi z zlra `id i-d dlind ik `vnp ,( xei zegiky
occurances (wi) =K
i
.edylk reaw `ed K-e ,dzexi za i-d dlind ly zertedd xtqn `id occurances (wi) xy`k
,d inl o`k oi` oky ,zizek`ln dpia zpigan "zpiiprn" zn`a dpi` efd ziai`pd dyibd ,`eand wlga xen`k
`l didiy epybxde ,qegii z ewpk dze` `iadl oekpl ep`vn ,z`f mr gi .milin zniyxa heyt yetig `l`
.efd dhiyd z` xikfdl ilan ihnehe` ote`a dty iedif zniyn lr xa l oekp
okn xg`le ,(zipihl hrnl) zew apd zetydn `"ka xzeia zevetpd milind zniyx z` epfkix oey`xd alya
dlin lk lr epvx ,zrk .x ∈ {10, 20, 50, 100, 500, 1000} xy`k ,dty lka zevetpd milind x z` wx ep`ved
dhiy e` dlibx dxitq) ep ar dzi` dhiyl m`zda .zevetpd milind x zniyxa dze` epytige ,ogapd hqwha
,zevetpd milind zniyxa drited dlind xy`k iaeig ewip) dlinl edylk ewip epzp (Borda Count ziien
milind zniyxn milin xzei yi ogand uaeway lkky `ed oeirxd .(my drited `l `id xy`k ilily ewipe
z` xa ly eteqa epxfgd ,jkitl .dtyd dze`a aezk hqwhdy miiekiqd mil b jk ,idylk dtya zevetpd
.xzeia deabd oeivd lawzd dxear dtyd
: ewip zehiy izy m` ep ar ,xen`k
dlibx dxitq 4.1
oeiv dlaiw `id ,my drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna
,ogand uaewa eritedy milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,1
.xzeia deabd ewipd lawzd dxear dtyd z` epxfgde
"Borda Count" zhiy 4.2
i zexi za uaewa drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna
`idy ewipd jk ,ddeab xzei dlind zexi zy lkk ,xnelk) x − i oeiv dlaiw `id ,(zevetpd milind x jezn)
milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,(xzei deab didi lawz
did efd ewipd zhiyl lpeivxd .xzeia deabd ewipd lawzd dxear dtyd z` epxfgde ,ogand uaewa eritedy
daeh dxeva ze irn okle xzei zexi z ok` od ,stif ly ezpga` t"ry ,dtya xzei zevetp milinl zeti r zzl
dlind day dtyd z` si rp okle ,zety xtqna driten znieqn dliny okzii ,sqepa .dtyl zekiiy lr xzei
.dvetp xzei
9
zihqihhq dyib 5
-nxt t"r mixehwel mzkitde oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet ep nl ef dyib zxbqna
, igi iqetih xehwe i kl ep gi` dty lka zepeyd ze`nbe dn elawzdy mipeyd mixehwed z` .mipey mixh
iedif zniyn .xehwe epnn mb epxvie ,ogand xneg ly dne gezip eprvia okn xg`l .dtyd dze` z` bviind
-etihd xehwed oial (ogand xneg z` bviiny) lawzdy xehwed oia d`eeyd i kl "dnbxez" ,jk m` ,dtyd
z` (jynda aigxp odilr ,zepey zehiya) ep n classification-d alya ,ok`e .zew apd zetydn `"k ly iq
did da qet dy dtyd z` epxfgde ,zetyd ly miibeviid mixehwedn `"k oial w apd xehwed oia wgxnd
.oeni`d xnegl xzeia dne d
ote`e egi`d ote` , nd ly zifhxwd dltknd .zepey mikx xtqna rvazd lirl x`ezy jildza aly lk
z` epeeyd ,jildzd ly eteqa .ywean hqwh lk xear zepey zewi a e`n daxda znkzqn wgxnd z i n
ozep mi ndn in reawl lkepy i ka ,mi nd t"r epvaiwe ,zizin`d dtyl aipd mitexivdn `"ky d`vezd
.xzeia oekpd aexiwd z`
oia wgxnd z i n ote`
ogapd xehwel dtyd xehwe
mixehwed egi` ote`
dty lk ly
mipniq
miihixw`i
n-gram
heyt wgxn ig` lwyn mr (ze` zegiky) mxbipei
i ilwe` wgxn ilniqwnd jxrd zxiga `ll y`xa ze` zegiky
dlin
zieefd qepiqew t"r oein seqa ze` zegiky
dlin
ixhniq-`l Kullback-Leibler wgxn (zeize` i nv) mxbia
ixhniq Kullback-Leibler wgxn
mewina miyxtdd mekq
seqpi` znxep
zihqihhqd dyiba mikezigd :1 dlah
n-grams t"r 5.1
zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi nd
:dtya zepeyd milina zeize`d
mxbipei 5.1.1
`"ka dtya z ea ze` lk ly ziqgid zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn
dlin y`xa ze` zegiky 5.1.2
dpey`x ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka milina
10
dlin seqa ze` zegiky 5.1.3
milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka
(zeize` i nv) mxbia 5.1.4
ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d
dty lk ly mixehwed egi` ote` 5.2
elawzdy mixehwed llk z` llwyl epilr did ,zew apd zetydn `"k xear iqetih xehwe yabl lkepy i k
dze`a mixehwed lkn mipiit`nd lk z` lleky , g` xehwe i kl dtyd dze`a mipeyd oeni`d ixneg xear
:mikx izya z`f zeyrl epxga okle ,ze`vezd lr ritydl ieyr hlgda lelwyd ote` .dtyd
ig` lwyn 5.2.1
iqgid lwynd ,mixehwe x eid znieqn dty xear m` :ddf did mixehwedn `"kl ozipy iqgid lwynd ,ef jx a
.
1xdid mdn `"k ly
v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbe l
.v = (0.4, 0.1, 0.2, 0.3)did llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpi xe`ewd xy`k)
ilniqwnd jxrd zxiga 5.2.2
.zeiegikyd z` eplnxp okn xg`le ,mixehwed lka ely zilniqwnd zegikyd z` oiit`n lk xear epxga ,ef jx a
v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbe l
v = did lenxpd iptl llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpi xe`ewd xy`k)
.v = 11.4 · (0.5, 0.2, 0.3, 0.4) =
(
514 ,
214 ,
314 ,
414
)
did `ed okn xg`le ,(0.5, 0.2, 0.3, 0.4)
ogapd xehwel dtyd xehwe oia wgxnd z i n ote` 5.3
.ogand xneg z` bviiny xehwe mb enk ,zetydn `"ka zeiebltzdd z` bviiny iqetih xehwe epi ia yi zrk
dne y xehwed z` `vnpy i k ,zetydn `"k ly bviind xehwel ogand xehwe oia d`eeyd zlert rval eppevxa
xehwe z`ivn .xehwed eze` i"r zbveiny dtyd `id oeni`d xneg ly dtydy dfgp jke ,ogand xnegl xzeia
`ed ogand xehwe oial epia wgxnd exear xehwed z`ivn i"r dzyrp ogand xehwel xzeia "dne "d dtyd
zxiga .mixehwe ipy oia wgxnd z` jixrdl lkep vik dziid zizednd dl`yd .mixehwed llk oian ilnipind
eynzyd da) zg` dhiyn xzeia epynzyd okle ,zeiteqd ze`vezd lr e`n ritydl dleki aeyigd jx
.(zixewnd d eara
ep nly dtyd xehwe z`e P = (P1, ..., Pn) xeza ogand xehwe z` onqp mi`ad mitirqd lka ,zegepd myl
z` mibviin mdy zexnl) dpey `ed mixehwea miheiaixh`d xtqny okziiy al miyp .Q = (Q1, ..., Qm) xeza
P -l eptqed ,dfd iyewd mr enzdl i k .(ogand xnegae oeni`d xnega miielz mixehwed oky - dtyd dze`
ody eplaiw ,jk .0 lwyn mdl epzpe ,ipyd xehwea miriten ok la` mda miriten `ly miheiaixh`d z` Q-le
.x ≥ max (m,n) xy`k ,x l eba mixehwe md Q ode P
"heyt" wgxn 5.3.1
ipy oia wgxndy dvxpe xg`n .
∑xi=1 |Pi −Qi| dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya
l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed
11
.menipinl
i ilwe` wgxn 5.3.2
ipy oia wgxndy dvxpe xg`n .
√
∑xi=1 (Pi −Qi)
2dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya
l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed
.menipinl
mixehwed oia zieefd qepiqew t"r oein 5.3.3
:mdipy oia zieefd qepiqew t"r mdipia oein d z` e nl ozip ,xeyina mii nin-e mixehwe ozpida
cos (α− β) = cos (α) · cos (β) + sin (α) · sin (β)
=P1
√
P 21 + P 2
2
·Q1
√
Q21 +Q2
2
+P2
√
P 21 + P 2
2
·Q2
√
Q21 +Q2
2
=P ×Q
|P | × |Q|
:`id i nin-x xehwel zillkd dgqepd ,xnelk
∑x
i=1 Pi ·Qi√
∑x
i=1 P2i ·
√∑x
i=1 Q2i
didz mixehwed ipy oia zieefdy lkk oky) xyt`d lkk dphw didz mixehwed ipy oia zieefdy dvxpe xg`n
z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,(xzei le b didi mdipia l add jk ,xzei dle b
.menipinl l"pd iehiad
(zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4
.DKL (P,Q) =∑x
i=1 Pi · log(
Pi
Qi
)
dgqepd q"r ,mixehwe ipy oia l add z` `ven nd
DKL (P,Q) 6= miiwzn) ixhniq `l `ed ely "dheytd" dqxbae xg`n ,"iq`lw" n `l `ed dfd nd
`l` ,ziq`lwd ezqxba KL wgxn aeyiga epwtzqd `l ,dfd ixyt`d iyewd mr enzdl i k .(DKL (Q,P )
:mixehwed ipyl deey "qgi" ozepy ,ixhniq KL na mb epynzyd
DSymmetric−KL =1
2(DKL (P,Q) +DKL (Q,P ))
ef didz diefgd dtyd okle ,xyt`d lkk ohw didi mixehwed ipy oia wgxndy dvxp zehiyd izyn `"ka
.menipinl l"pd iehiad z` `ian (Q) dly ibeviid xehwedy
(Ranks) mewina miyxtdd mekq 5.3.5
oiit`n lkl epwprde , xei zegiky x qa xehwe lka mipiit`nd z` epx iq ,(eply gezit ixt `idy) ef dhiya
oeiva yxtdd ly hlgend jxrd z` epnkq ,okn xg`l .zeiegikyd xe iqa enewinl m`zda ,x-l 1 oia mly oeiv
.
∑xi=1 |(Rank (Pi)−Rank (Qi))| ,xnelk .zeize`dn `"k ly
.xzeia dne d dtyd `id ,w apd xehwed znerl xzeia jenpd miyxtdd mekq lawzn dxeary dtyd
12
Fitness Functions zeneyxd xtqn mipniq
miihixw`i
n-gram
Gini Gain 500 mr (ze` zegiky) mxbipei
Entropy 1000 `ll dlin y`xa ze` zegiky
Information Gain 1500 dlin seqa ze` zegiky
Information Gain Ratio 2000 (zeize` i nv) mxbia
Train Error
dhlgdd ivra mikezigd :2 dlah
seqpi` znxep 5.3.6
.bigrams xear wx dlrted ef dhiy ,zixewnd d earl dne a
`ide ,xzeia le bd `ed (hlgen jxra) miyxtdd mekq da dxeyd ly meniqwnd zeidl zx ben seqpi` znxep
:`ad ote`a dayeg
ly (hlgen jxra) miyxtdd mekq z` miaygn ,minxbiad zniyxa zepey`xd zeize`dn `"k xear .1
.ze`d dze`a miligzny minxbiad
ze`d dze`a miligzny minxbiad ly (hlgen jxra) miyxtdd mekq dxear dpey`xd ze`d z` mixgea .2
:dtyd xehwel w apd xehwed oia xzeia le bd yxtdd z` miaipn
‖ A ‖∞ = max1≤i≤x
x∑
j=1
|Pij −Qij |
.ilnipin did m ewd alya ep`vny jxrd dxear dtyd z` mixifgn .3
dhlgd ivr 6
-nxt t"r dhlgd ivr zxivie oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet ep nl ef dyib zxbqna
.zepeyd ze`nbe d lr jnzqdae ,mipey mixh
-i`d ixnegn `"ka eply n-grams-dn `"k ly zeiegikyd z` ep n ,oeni`d alya zihqihhqd dhiyl dne a
dhiydn dpeya .oeni` xneg eze`a n-gram lk ly ziqgid zegikyd z` dn ny ,iqetih xehwe oditl epxvie ,oen
-be lkl g` - mdy enk mixehwed z` epx`yd `l` ,zetydn `"kl igi xehwe epxvi `l o`k ,zihqihhqd
,dlaha eply zeneyxd zeidl ektd mde ,oeni`d ixehwe llk oian mixehwe xtqn ilne px ote`a epxga .dn
alya .(zeize` ly mipey zebef ,dnbe l) ep`vny n-gram-d eid (dlaha miheiaixh`d ,xnelk) dly ze enrdy
,ura znev `ed heiaixh` lky jk ,(jynda x`ezny ,ID3 mzixebl` zxfra) envr dhlgdd ur z` epipa `ad
`ed dlr lke ,(3 dlah itl ,divfihxwqi exar mikxrd xy`k) heiaixh` eze` ly ixyt` jxr zbviin rlv lk
m`zda ,ogand xneg ly classification eprvia okn xg`l .(zetyd 14 oian idylk dty ,xnelk) iteq beiz
.epipay dhlgdd ur lr jnzqdae jynda x`ezny mzixebl`l
z` xifgz dhiy efi` we al dziid dxhndyk ,zepey mikx xtqna rvazd lirl x`ezy jildza aly lk
:xzeia aehd ieaipd
13
n-grams t"r 6.1
zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi nd
:dtya zepeyd milina zeize`d
mxbipei 6.1.1
`"ka dtya z ea ze` lk ly ziqgid zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn
dlin y`xa ze` zegiky 6.1.2
dpey`x ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka milina
dlin seqa ze` zegiky 6.1.3
milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d ixnegn `"ka
(zeize` i nv) mxbia 6.1.4
ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia
.oeni`d
oky) e`n dle b miheiaixh`d zenk minxbia mr d eara ,ze ea zeize`e minxbipei mr d eardn dpeya
epxga ,dfd iyewd mr enzdl i k .i n le b diqxewx wnerl eprbd okle ,(zeize` izy ly zeivhenxta xae n
.urd z` mditl epipae ,xzeia mivetpd miheiaixh`d 400-800 z` "wx" miax mixwna
zeveawl epwlig ea ote`d 6.2
likn htyn lk .0-1 oiay geeha mixtqn ,xnelk ,minxbia e` minxbipei ly zeiegikyd md miheiaixh`d ikxr
dpiidz zeiegikydy dtvp ,jkitl .zeize` ly dpey xtqnn zeakxend ,mipey mikxe`a milin ly dpey xtqn
.zeiegikyd ly divfihxwqi rval ul`p ,mipey miheiaixh` oia zeeydl lkepy i ke ,efn ef e`n zepey
(geexd) dn`zdd zeivwpet 6.3
,('eke ,minxbia ,zepeyd ze ead zeize`d) urd z` mipea epgp` mditl mipeyd mipiit`nd md miheiaixh`d
.dtya mdly zegikyd mdy ,mdly mikxrl zeqgiizd jez
# Example a b ... Language
1 0.081 0.014 ... English
2 0.12 0.022 ... Spanish
3 0.068 0.017 ... English
... ... ... ... ...
Entropy itl 6.3.1
mieqn heiaixh` ly zebltzd iabl ze` eed-i` z in z` bviind divnxetpi`d zxeza n `id ditexhp`
.mdylk mipezp ozpida
14
eply d eard zixewnd d eard
minxbia ,minxbipei
,dpey`x ze`
dpexg` ze`
minxbia minxbipei rhwn
0-0.00015 0-0.0015 0-0.00001 0-0.001 0
0.00015-0.0003 0.0015-0.003 0.00001-0.0003 0.001-0.03 1
0.0003-0.0005 0.003-0.005 0.0003-0.0006 0.03-0.06 2
0.0005-0.001 0.005-0.01 0.0006-0.0009 0.06-0.09 3
0.001-0.0015 0.01-0.015 0.0009-0.012 0.09-0.12 4
0.0015-0.002 0.015-0.02 0.012-0.015 0.12-0.15 5
0.002-0.0025 0.02-0.025 0.015-0.018 0.15-0.18 6
0.0025-0.0035 0.025-0.035 0.018-1 0.18-1 7
0.0035-0.005 0.035-0.05 8
0.005-0.007 0.05-0.07 9
0.007-0.01 0.07-0.1 10
0.01-0.013 0.1-1 11
0.013-1 12
zeveawl dwelgd :3 dlah
zeidl ditexhp`d z` xi bp (Language attribute-d xear ditexhp` mirvan ep` xy`k) A dpezp dlahl
-azqdd `ed pv = |Av ||A| -e ,zetyd zniyxn dtyd z` bviin v xy`k ,H (A) = −
∑
v∈LanguageList pvlog (pv)
jxr zelra zeneyxl wx dlahd mevnv `ed Av ,d ig` zexazqda miynzyn epgp`) mi`znd beizd ly zex
.(dlaha zeneyxd xtqn `ed |A|-e a heiaixh`a v
H (A, a) = zeidl ely ditexhp`d z` xi bp mieqn a heiaixh` xear ze` eed-i` z in z` reawl i k
.
∑
v∈V alues(a) H (Av)
dvxp o`k ,deab ikd jxrd mr heiaixh`d z` zgwl dvxp oday ,bivpy ze`ad zeivwpetl ebipay al miyp
.xzeia jenpd ditexhp`d jxr lra heiaixh`d z` `evnl
Information Gain itl 6.3.2
A dlaha a heiaixh` lkl .ura ew ewk mieqn heiaixh` ozpida ditexhp`a dzgtdd zlgez z` en IG
:xi bp
IG (A, a) = H (A)−∑
v∈value(a)
|Av|
|A|
·H (Av)
Information Gain Ratio itl 6.3.3
:zeidl Information Gain Ratio-d z` xi bp a heiaixh` lkl
IGR(A,a) =IG(A, a)
H(A,a)
Gini Gain itl 6.3.4
n `ed Gini index ik ep nl epkxry miyetign la` ,dfd nd ze e` qxewd zxbqna dpyd ep nl `l mpn`
.(Language eply dxwna `edy) dxhnd heiaixh` ly mipeyd mikxrd ly zeiexazqdd oia diihql
:Language ly mikxrd xear A dpezpd dlahd ly Gini Index diihqd n z` dligz xi bp ,ditexhp`l dne a
GI(A) = 1−∑
v∈LanguageList
(
|Av|
|A|
)2
:zeidl heiaixh` lkl geexd zivwpet z` xi bp zrk
GG(A, a) = GI(A)−∑
v∈value(a)
|Av|
|A|·GI (Av)
letk dlah-zz lkl diihqd oial zillkd diihqd oia ilnipind yxtdd ly meniqwnd z` `evnl dvxp
.dze` xviind jxrd ly zexazqdd
Train Error itl 6.3.5
-xh`d z` xegal dvxp xy`k ,mieqn heiaixh` xear oeni`d z`iby zlgeza d ixid z` z en z`f divwpet
:d ixid zlgez z` mqwnny a heiai
TE(A,a) = minv∈LanguageList(pA)−∑
v∈value(a)
(
|Av|
|A|
)
minLanguage∈LanguageList (pAv)
.A dlahl qgia x ben pA xy`k
urd ziipal mzixebl`d 6.4
yeniy jez ,ID3 ,
2
qxewa epi`xy iaiqxewxd mzixebl`d `ed dhlgdd ivr ziipal epze` yniyy mzixebl`d
.lirl ex`ezy (geexd) dn`zdd zeivwpeta
(classification) dtyd beeiql mzixebl`d 6.5
divxhi` lkay jk ,dhlgdd ur xena "liihn"y "heyt" iaiqxewx mzixebl` did ipey`xd beeiqd mzixebl`
d`ixw `xewe ,igkepd znevd ly heiaixh`d xear ogand xehwea yiy jxrl mi`zny urd-zz z` xgea `ed
eze` ly (dtyd z` ,xnelk) beizd z` xifgne ,dlrl ribn `ed xy`k xver mzixebl`d .urd-zz lr ziaiqxewx
mivrd miax mixwnae xg`ne ,zeneyxd llkn ilne px ote`a oeni`d xneg z` mixgea epgp`e xg`n ,mle`e .dlr
heiaixh` eze`l miixyt`d mikxrd lk `l ,miliaend miheiaixh`d 400-800-a xegal epyx pe i n mile b eid
,beeiqd jildza rwzip ,xnelk ,ep nly ura miiw `ly ur-zz ytgle zeqpl milelr epgp` ,okl .mibvein inz
.mi`zny ur-zz didi `le xg`n
xehwea heiaixh`d ly jxrl mi`zny ur-zz el oi`y ura znevl eprbdy rbxay ephlgd ,ef dira xeztl i ka
xena liihl jiynpe ,"yegip"k dfky y g leih lk xi bp .heiaixh` eze` ly mivrd-izz lk lr xearp ,beeiqd
xearpe ,"yegip" erk z`f xetqp ur-zzd xena leiha "iziira" avnl ribpy mrt lka .dlrl ribp xy` r ,urd
xtqn z`e ur-zz eze`n beizd z` lawp jildzd ly eteqa .znevd ly heiaixh`d ly mivr-izzd lk lr
lk oky) xzeia ohwd "miyegip"d xtqna yeniy jez eprbd eil` beizd z` xifgpe ,jx a epiyry "miyegip"d
.(ozipd lkk xrtd z` oihwdl dvxpe ,miaxwzn epgp` dil` dtyd oial ogand xneg oia xrt lr irn "yegip"
ephlgd ,overfitting-n rpnidl i ke ,dl`ky miax miznv lelkle mile b zeidl mileki mivrde xg`n ,ok enk
xtqn z` mb epxard ,ziaiqxewx d`ixw lka - urd xena "leih"d jldna pruning rvale jildzd z` lriil
,dfd miyegipd xtqn z` epxar urd xena leihd jldna m`e ,dk r beiz epl biydy jenp ikd miyegipd
.zilnihte`d d`vezd z` aipi `l `ed oky ,lelqnd eze` lr epxziee ura zelvtzdd jynd z` epwqtd
24 zitewy ,10 lebxz
2
16
V wlg
ze`vezd
,df wxta `aen zeitivtqd ze`vezd gezip .ephwp oda zeyibdn `"ka ,eplaiwy ze`vezd z` bivp df wlga
,dhlgdd ivr ly ode zihqihhqd dyiba od ,ze`lnd ze`vezd .`ad wxta driten xzei dagx zeqgiizd era
.'a-e '` migtqpk ze`aen
(stif weg) ziai`p dyib 7
:
3
sxba ze`xl ozipy itk ,zelern eid dbiyd ziai`pd dyibdy ze`vezd ,dtevnk
zniiw ,xnelk ,oihelgl ddf did milin 1000-e 500 xear Borda Count zhiya iedifd feg` ik oiivl oiiprn
.zevetp milin lr zqqaznd ieaipd zlekia "zikekf zxwz" oirn (dxe`kl)
z` epeeyd `l okle ,hwiiexta ep wnzd oda d inld zehiyn e`n dpey efd dyibdy aey yib dl aeyg
.l"pd ze`vezl epbydy ze`vezd
zihqihhq dyib 8
dlaha ehxety miax mikezig itl ,zevxd ly e`n ax xtqn llk d inld jildz ef dhiya ,lirl xaqedy itk
ephlgdy zetyd 14 llk xear ze xtp zevxde ,zixewnd d eara eritedy zetyd xear zevxd eprvia ,sqepa .1
.we al
,xnelk) mibiiezn mivaw mr ep ar oky ,i nl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd
.(lawl mixen` epgp` dze` dtyd idn epr i
mixehwed oia wgxnd z i n ote` t"r 8.1
-gxnd z` jixrdl i k seqpi` znxepa yeniy dyrp zixewnd d eara zihqihhqd dyiba wqry wlga ,xen`k
milk eplrtd df hwiiexta era ,w apd htynd z` bviiny xehwed oial zetydn `"k ly xehwed oia miw
.mitqep
dxexad dxeva - efl ef zeaexw ze`vez epzpy - elawzdy zeivwpetd z` bivdl oevxd lyae mewn iveli`n :mitxbd iabl zillk dxrd
3
.(mi`ixw zegt zeidl milelr eid mitxbd efky dl`wqa oky) 100-a miizqne 0-n ligzn inz `l y xiv ,xzeia
17
:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k
aexae ,xzeia zerexbd ze`vezd z` daipd seqpi` znxep ,mikezigde zevxdd lka hrnk ,ze`xl ozipy itk
600 llk oeni`d xneg xy`k ew ap mipezpd ,zixewnd d eara .mipte`d xzin izernyn yxtda - mixwnd
zixewnd d eara ebyedy dl`l zene ze`vezl eprbdy ze`xl ozip .46%-k lr e nr dglvdd ifeg`e ,zeneyx
,k"dqa .mda epaygzd xy`k xzei zeaeh hrne ,miihixw`i d mipniqdn epnlrzd xy`k zeaeh zegt hrn -
.hrna elired miihixw`i d mipniqd xy`k ,ef dhiya iedifa dglvd 50%-l 40% oial eprbd
iedif 70% r - e`n miaeh mirevial eribd epynzyd oda zetqepd zeihnznd zehiyd ,seqpi` znxepn dpeya
miihixw`i mipniq mr od - dliaend dhiyd .mda zeaygzd jez iedif 80% re ,miihixw`i mipniq `ll
ziaxn .zixhniq-`ld dqxbae zixhniqd dqxba zene ze`vez daipdy ,KL zhiy dziid - mdi rla ode
xy`k ,(mda epaygzd xy`k xetiyl diihp mr) miihixw`i d mipniql xyw ila zene ze`vez eaipd zehiyd
dxe`kly) miihixw`i d mipniqa epaygzd xy`k `wee y oiivl oiiprn .Ranks-e KL zehiy md ote d i`vei
-ixw`i d mipniqd zkitdy `id jkl daiqd d`xpd lkk .aeh zegt d ar Ranks zhiy ,(r in xzei miwtqn
xy`n dpey ote`a oeni`d xnega zevetpd zeize`d xe iqa miiepiyl dnxb (a-l a ,lynl) zelibx zeize`l miih
.miiepiyl xzei yibx okle ,xzei mvnevn `edy ,ogand xnega
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
18
zeyrl gilvd `ed , ala zeixewnd zetyd 6 oian dty zefgl yx p beeqndyk :o`k d xi ze`vezd zeki`
-xewnd zetyd 6 mr 80% znerl 70%) zetyd 14 oian dty zefgl yx p `edyk xy`n xzei daeh dxeva jk
ze`xl oiiprn . ixtdl yx p beeqnd odipia zetyd zenka aygzda geina ,oievn iedifa xae n oii r mle` ,(zei
zeaeh ze`vez mixwnd ziaxna eplaiw ,(6 jezn wx `le) zety 14 oian dty ly beeiqa xae n did xy`k mby
zeivwpet oia qgid xnyp o`k mb .zixewnd d eara yeniy dyrp da ,seqpi` znxep zhiy daipdy dl`n xzei
e` zene ze`vez eaipd zeivwpetd llk ,Ranks-e Angle hrnl .liaedl dkiynn KL-y jk ,zepeyd d i nd
li adl i k ipeigd r ind z` miwtqn d`xpd lkky ,miihixw`i d mipniqa epaygzd xy`k xzei zeaeh
.d eard zligza hxety itk ,zetyd oia xzei daeh dxeva
n-gram t"r 8.2
mi n df hwiiexta siqedl ephlgd epgp` ,xen`k .minxbipeie minxbia itl d inl drvazd zixewnd d eara
.dlin seqae dlin y`xa zeize` zegiky lr mikznqny mitqep
:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k
19
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
. e`n le b `ed miixyt`d minxbiad xtqn - minxbiadn ribd xzeia izernynd r ind ,epitvy itk
jildza xzeia axd r ind z` wtiq dlind seqa `wee ze`d mewiny zelbl eprzted ,ze ead zeize`d oian
iwtz zlra `id miax mixwna dpexg`d ze`dy d aerd mr g` dpwa dler df ielib ,z`f mr gi .d inld
iedifa riiqz `idy ipeibd hlgda okle ,('eke mipeyd mipnfa lretd ziihp ,miaxd zxev oeiv ,dnbe l) iwe w
oiivl oiiprn .(e e` i i"r ziwlhi`ae n i"r zipnxba era ,s ze`d i"r zpievn zilbp`a miaxd zxev ,lynl) dtyd
ze`vez daipd `id mb mle` ,(iqgi ote`a) zeaeh zegtd ze`vezd z` daipd dpey`xd ze`d itl dwi ady
.(seqpi` znxep ly iedifl dne a) iedif 45% lrn - zerx `l
(aeyig onf) enild avw 8.3
epvxd ,dfd mipezpd dpan ozpida .zetyd lkn oeni`d ihtyn lk z` likdy mipezp dpan epxvi oey`xd alya
`"ka dlind seqa zeize`de dlind zligza zeize`d ,minxbipeid ,minxbiad zeiexi z z` e nly minzixebl`
20
.zetyd 14-n
.zetyd lk xear elld zewihqihhqd lk z` enll i ka zeipy 25-k - e`n xidn did enild avw
overfitting-e zeiawr 8.4
d`vezke ,zellkd epnn enll mewna oeni`d xneg z` "opyl" ligzn beeqnd xy`k ygxzdl lelr overfitting
xneg lr xzei zewie ne zeaeh ze`vez xifgi beeqnd ,dfky dxwna ."yrx"l xzei le b lwyn ozep `ed ,jkn
aeh zegt didi beeqnd ,xnelk) xzei zerexb dpiidz xken `le y g ogan xneg lr ze`vezd j` ,xkend oeni`d
ok` oeni`d xnegy ze`xl i k ipy vne , g` vn overfitting rvazd `ly ` eel i k .(y g r in mr iefiga
dlaha ze`xl ozipy itk .d inld revia xg`l oeni`d xnegn wlg lr mzixebl`d z` epvxd ,dkldk beeqn
-nyn dxeva `l j` ,mixwnd ziaxna xzei zeaeh ok` oeni`d xneg lr mzixebl`d zlert ly ze`vezd ,dhnl
irny dn ,zedeab ziqgi od ze`vezd ,ok enk .overfitting oi`y wiqdl lkepy ,o`kn .oepiy lr d irny zizer
.mixwnd ziaxna dtyd z` ddif ok` beeqndy jk lr
Kullback Symmetric
Kullback
Angle Eucleadean Infinity Ranks Simple
Difference
ze`vezd
zeixewnd
69.34 68.87 46.93 59.19 44.94 42.56 53.07
wlg lr ze`vezd
oeni`d xnegn ohw
71.41 69.3 50.95 61.12 40.07 51.91 57.14
recall, precision, F1 i n 8.5
ogand xneg da dtyd ly iefig eplaiw xa ly eteqae ,ogan xneg lr mzixebl` epvxd ,hwiiextd zxbqna
:mi`ad mipte`a miiefigd lr lkzqdl ozipy al miyp .aezk
.dil` jiizyn ok` `ede ,idylk dtyl jiiyk htynd z` epidif = True Positive .1
.dil` jiizyn `l `ed la` ,idylk dtyl jiiyk htynd z` epidif = False Positive .2
.dil` jiizyn `l ok` `ede ,idylk dtyl jiiy `lk htynd z` epidif = True Negative .3
.dil` jiizyn ok `ed la` ,idylk dtyl jiiy `lk htynd z` epidif = False Negative .4
.precision-e recall i n i"r `id ze`vezd z` jixrdl ztqep jx
iedifa lawl mixen` epiidy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z` en xy` n `ed recall n.d`hgdd xeriy z` en `ed ,zexg` milina .mlyen
dze`l lreta eplaiwy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z` en xy` n `ed precision n
.yrxd xeriy z` en `ed ,zexg` milina .dtyd
recall =True Positive
True Positive + False Negative
precision =True Positive
True Positive + False Positive
,rexb precision mr gia ribdl leki aeh recall) mnvr ipta mi ner `l md la` ,miaeyg md elld mi nd ipy
`ed mday hleadyk ,mialeyn mi n lr mb lkzqdl bedp ,okl .(aeh `ed iefigdy reawl i k i ea oi` okle
:d`ad dgqepd jnq lr lawzne ,mdipy ly ipenxd rvenn `edy ,F1 n
F1 = 2 ·precision · recall
precision + recall
aeh iefigd jk ,1-l xzei miaexw mdy lkke ,1-l 0 oiay geeha mirp elld mi nd llk ly miixyt`d mikxrd
21
.xzei
:eply zevxda elld mikxrd z` ep n
with diacritics without diacritics
500 1000 1500 2000 500 1000 1500 2000
original 0.648 0.651 0.654 0.653 0.65 0.646 0.642 0.637
all languages 0.618 0.606 0.599 0.593 0.59 0.586 0.583 0.581
dhlgd ivr 9
.2 dlaha exkfedy miax mikezig itl ,zevxd ly e`n ax xtqn eprvia ef dhiya ,lirl xaqedy itk
ephlgdy zetyd 14 llk xear ze xtp zevxde ,zixewnd d eara eritedy zetyd xear zevxd eprvia ,sqepa
.we al
mibiiezn mivaw mr ep ar oky ,i nl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd ,o`k mb
.(lawl mixen` epgp` dze` dtyd idn epr i ,xnelk)
(geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1
:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k
:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k
22
mikenp iefig ifeg` eaipd dpexg`d ze`de dpey`xd ze`d lr mi aery minzixebl`d zevxdd lka ,epzrztdl
enk zeveawl dwelg dze`a mxear epynzydy d aera dverp jkl daiqd ,epzr l .(30%-n zegt) iqgi ote`a
.ep`vny zeiebltzdd z` dni`zny zi erii dwelga `le ,(3 dlaha hxety itk) illk ote`a minxbipei xear
d earl - minxbipei xear ode minxbia xear od - e`n zene eply ze`vezd ,Gini ziivwpetl xywda
epglvd ,xzeia zeaehd ze`vezd z` daipd ef divwpet my ,zixewnd d eardn dpeya ,z`f mr gi .zixewnd
efd d`vezd .minxbia xear ode minxbipei xear od ,IGR ziivwpeta yeniy jez ,xzei s` zeaeh ze`vezl ribdl
lelry overfitting-d z` mvnvl i k "gzet" IGR ditl ,qxewd jldna ep nly d aerd mr g` dpwa dler
izy ly mireviaa 15%-l 5% oia rpy xrt miiw ,ok`e .Information Gain diivwpeta yeniya xveeidl
.minxbipei xear ode minxbia xear od ,elld zeivwpetd
zeaeh ze`vez epzp minxbia mr e ary zeivwpetd mixwnd aexa ,zixewnd d eardn dpeyae ,epitivy itk
.epynzyd mda miheiaixh`d xtqn z` epnvnv xy`k mb ,xzei
ze`vezd mle` ,ditexhp`d ziivwpeta yeniy jez exvepy mivrd z` zepal i k ax onf yx p ik oiivl yi
dliri `l `id ditexhp` ,xnelk .zeivwpetd xzil qgia ode iheleqa` ote`a od ,zeaeh ze`vez eid eaipd mdy
r in zegt `l ozep milevitd lk jq ,xa ly eteqa mle` ,zeivwpetd x`yn r in zegt ozep da levit lke
.zexg`d zeivwpetd ziaxna xy`n (r in xzei s` mizrle)
(aeyig onf) enild avw 9.2
xtqnl xyi qgia nr dhlgd ur ziipal yx py onfd jyn .dry ivgk lr nr ur lk ly rvennd diipad onf
mixwna .jx`zd mivrd ziipa onf jk ,xzei miax eid mdy lkk :oeni`d xnega zexeyd xtqnle miheiaixh`d
jix`d miihixw`i mipniqa yeniyd .jkn zegt s` e` 800-l miheiaixh`d xtqn z` liabdl epvl`p miax
24
.zihnx dxeva l b miheiaixh`d xtqn oky ,daxda aeyigd jyn z`
ivr era ,(ze ea zew xtqn) xdn ziqgi epap dpexg`d ze`de dpey`xd ze`d ,minxbipeid ly mivrd
ur era ,dw n zegt jez dpap zeneyx 2000 lr IG ly mxbipeid ur ,lynl) ax aeyig onf eyx minxbiad
miheiaixh`d xtqn xe`l z`f xiaqdl ozip .(zew 56 jez dpap zeneyx 2000 lr Gini Gain itl mxbiad
dnk lr nr oxtqn okle ,ze ea zeize` eid miheiaixh`d oky) oey`xd beqdn mivrd ly ziqgi mvnevnd
-nxbiad ivra miheiaixh`d xtqn era ,(miihixw`i d mipniqd ztqeza "zelibxd" zeize`d - ze ea zexyr
xaqedy itk ,oey`xd beqdn miheiaixh`d xtqn reaix ly l eb x qa `ed ,dyrnl) zizernyn le b did mi
.(lirl
n `id ditexhp`dy `id jkl daiqd .xzeia le bd aeyigd jyn z` yx ditexhp`d n ,mi nd oian
dxeva mipezpd z` wlgn `l `ed okle ,mipeyd mix'vitae zepeyd zeiexyt`a witqn aygzn `ly ,i n illk
miynzyn jk lya wei a .aeyigd jyn zl bdle xzei daxd le b diqxewx wnerl liaeny dn ,dxexa witqn
.mitqep miaeyiga dze` millwyn la` ditexhp`d z` miaygny ,IGR enk xzei miaeh mi na k" a
25
VI wlg
izrl zeaygne zepwqn ,oei
zewiie n ze`vezl ribdl ozip eply d eard qiqa lry zepin`n epgp`" ik eazk xnze oxen ,oz ear mekiqa
iedif 70% lrnl ribdl epglvd - zizernyn zeaeh ze`vez od epz eara eprbd odil` ze`vezd ,ok`e ."xzei s`
,(zepey (geex) dn`zd zeivwpete zeihqihhq zeivwpet mre ,minxbipei ,minxbia xear) mikezige zehiy oeebna
,60%-k dziid xzeia daehd d`vezd zixewnd d eara era ,IGR zhiya 73%-le KL zhiya iedif 79%-l s`e
feg` ,zehiyd xzi lk xear - zg` dhiya wxe ,dtyl zeneyx 2000 ly le b oeni` xneg xear dlawzd `ide
.xzeid lkl 45% did iedifd
hlgend aexa xy`k ,minxbipei xear `wee elawzd xnze oxen ly xzeia zeaehd ze`vezd ik oiivl oiiprn
elrd od .dglvd 25%-kl eribd od ,minxbia mr e ar odyk .50%-n miphw eid dglvdd ifeg` mixwnd ly
mipey mirhwnl dwelg ike ,dpi r witqn dziid `l d`xpk eyry dwelgdy `id jkl daiqdy dxrydd z`
"wgyl" epiqip ,mivrd lr d eardn wlgk .zxg` dler eply d eardn ,mle`e .xzei zeaeh ze`vez aipdl dieyr
,zepeebne zepey zewelg lr zeax minrt mivrd z` epvxd :(3 dlaha ze`xl ozipy itk) mivawnl dwelgd mr
,wenr zegt didi urd ,mipeyd mivawnd oia zpfe`ne dpi r xzei didz dwelgdy lkky dziid daygndyk
epiqipy dwelgd oia ze`veza mixkip mil ad epilib `l xa ly eteqa ,mle`e .xzei daeh didz dwelgde
,epzr l .zixewnd dwelgd mizrle ,xzei zeaeh ze`vez dbiyd dy gd dwelgd mizrl ;zixewnd dwelgde
`le mihen eid odly ogand e` oeni`d ixnegy zeidl dieyr zixewnd d eara ziqgi zekenpd ze`vezl daiqd
ztqep zexyt` .dn`zd dziid `l okle ,(oeni`d xnegn dpeya) iewip xar `l odly ogand xnegy e` ,mipekp
dxeva miy g mixneg mr enzdl r i `l odly beeqnd okle ,overfitting dxvi minxbiad ly d inldy `id
.daeh
ly zeikeaiqd znx z` dlrn mzx`ydy egipd ode xg`n ,miihixw`i d mipniqd z` exiqd xnze oxen
-`d izya jenzl epxga ,z`f znerl ,eply hwiiexta .d inld jyn z`e mipezpd ipan z` dli bne ,dniynd
mipniqd ,`eand wlga ephxite epitivy itk .mze` xiqdle ,miihixw`i d mipniqd z` xizedl - zeiexyt
- dhlgdd ivra ode zihqihhqd dhiya od - minzixebl`d ziaxn okle ,izernyn r in etiqed miihixw`i d
-ihqihhqd milka 10%-k ly rvenn xetiy) miihixw`i d mipniqa epaygzd xy`k xzei zeaeh ze`vez ebivd
minzixebl`d ,miihixw`i d mipniqd `ll mb ik oiivl aeyg ,z`f mr gi .(dhlgdd ivra 20% r lye mi
drityd miihixw`i d mipniqd zx`yd ik oiivl i` k . e`n zeti ze`vezl eribde ,mnvr z` egiked mipeyd
dvixd ipnf lr xkip ote`a drityd `id mle` ,mihqihhqd minzixebl`a dvixd ipnf lr i nl gipf ote`a
miihixw`i d mipniqd zx`yde zeid ,riztn df oi` .(epap xak mivrdy ixg` beeiqa `l la`) mivrd ziipaa
li bdl dlelr j` ,miihqihhqd minzixebl`a mitqep miaeyig ly reawe mvnevn xtqn xzeid lkl dtiqen
wner z` miax mixwna dlrn df xa .(lirl x`ezy itk) dlah lka miheiaixh`d xtqn z` zxkip dxeva
.mivrd ziipa ly aeyigd onf z` mb m`zdae ,diqxewxd
zehiy zervn`a ...ze`vezd z` xtyl ozipy" zeayeg od ik eazk xnze oxen ,zihqihhqd dyibl xywda
iedif zeleki epibtd epynzyd mda miy gd milkd llk ,ok`e ."seqpi` znxep[n℄ ...zen wzn xzei dwqd
ik zelbl oiiprn did ,sqepa .zegt zeaeh ze`vezl ribd seqpi` znxep lr qqazdy nd era , e`n zedeab
.zixhniq-`ld ezqxbae zixhniqd ezqxba zedf hrnk ze`vez aipd KL n
26
ax aeyig onf yx pe , e`n mile b mdy jka ielz mdly ixwird oexqgd ik epilib ,dhlgdd ivrl xywda
miheiaixh`d xtqn z` mvnvl epyx p okle ,ilniqwnd diqxewxd wnerl eprbd miax mixwna .mzxivil
mr "wgyl" epiqip xen`k .miihqihhqd mixwna xy`n aeh zegt did beeiqd ,jkn d`vezk .epynzyd mda
ly ze`vezd ziaxn ,z`f zexnl .xwip xetiy e`xd `l dl` miiepiy j` ,ze`vezd z` xtyl oeiqipa zewelgd
wfgny dn ,zetyd 14 lk q"r dpap `edyke zeixewnd zetyd 6 lr "wx" dpap urdyk zedf e`vi dhlgdd ivr
.w apd zetyd xtqnn mirtyen `le hrnk mdye ,dty iedifl miaeh milk ok` md dhlgd ivr ditl ,epzyib z`
od era ,dhlgdd ivr mr daeh dxeva e ar `l dpexg`d ze`de dpey`xd ze`d ly mi nd ik ze`xl ozip
.zeveawl dwelgl xeyw dfy mixeaq epgp` ,xen`k .miihqihhqd milkd mr zizernyn zeaeh ze`vezl eribd
xewgl epwtqd `l ,miax zereay ekynp mgezipe mzvxd ,hwiiexta minzixebl`d llk gezit jildze xg`n
- zey g zewelg mr zepey zevxd zeqple ,mipezpd z` oegal leki i izr hwiiext ,epzr l .wnerl `yepd z`
.zetyd ilitext z` xzeia daehd dxeva `hal lkezy ,zil`i i` dwelg zlawl r - zegte xzei zepi r
:dnid nd dprhd z` y gn dgiken ,megza zetqep zeax ze ear ly ze`vez enk ,eply d eard ze`vez
e` mipey mialyn ,we w iweg ly dri ia jxev oi` - m ew ipyla r i ila elit` zeirah zety zedfl ozip
,miline zeize` ly yaie "xw" gezipa wtzqdl xyt` `l` ,dtyd z` zedfl i k milin ly zihpnq zernyn
ozip ,sqepa .dti zextqe hpxhpi` ixz` ,mipezir znbe k ,miyibp zexewn ly mevr oeebnn gwlidl zelekiy
zety ze e` epl yiy ipylad r id z` xiyrdl i k yeniy epiyr mday milkae d eard ze`veza ynzydl
ziwlhi`a dlin seqa xzei dgiky a ze`d eitl llk ielib ,dnbe l) mipey miweg ,odipia daxwd z in - zepey
.jkn zernzynd zeiaihipbew zeiernyn we ale ,(zilbp`a xy`n
minxbipeie minxbia ly aeliy znbe k ,mitqep mi n ztqed i"r eply hwiiextd z` aigxdle jiyndl ozip
ribdl lkep epzr l .(dnbe l ,miliaend minxbipeid y-e miliaend minxbiad x md ely miznvdy ur zepal ozip)
z`tn hwiiextd zxbqna z`f ynnl epwtqd `l ,epxrvl .odil` eprbdy dl`n xzei zeaeh elit` ze`vezl jk
,ew ap `ly mipey mi na mixeyw xewgl ozipy mitqep mi`yep .epynzyd mda mi nd ieaixe onfd xvew
dkex` zipnxba zrvennd dlind ,lynl) dtya zrvennd dlind jxe` e` dtya rvennd milind xtqn znbe k
.xzei s` zeaeh ze`vezl ribdle ,mivawnl dwelgd z` xtyle zeqpl ozip ,ok enk .(zilbp`a zrvennd dlindn
27
VII wlg
zexewn
http://www.cs.huji.ac.il/~ai/projects/NLP.pdf :xnze oxen ly zixewnd d eard •
"zizek`ln dpial `ean" - 67842 qxewd ly milebxzde mixeriyd zebvn •
"dty ly miiaihipbew mihaide ziaeyig d inl" - 36622 qxewd ly mixeriyd zebvn •
• Gutenberg Project - http://www.gutenberg.org/
• http://www.bookrix.com/
• http://www.e-book.com.au/morefreebooks/freemultilingualbooks.htm
• http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-
science-meets-human/
• http://en.wikipedia.org/wiki/List_of_languages_by_writing_system#Latin_script
• http://en.wikipedia.org/wiki/Letter_frequency
• http://stackoverflow.com/questions/3194516/replace-national-characters-with-ascii-equivalent
• http://staff.science.uva.nl/~tsagias/?p=185
• http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf
• http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=4
• http://www.101languages.net/common-words/
28
VIII wlg
zeihqihhqd ze`vezd ixwir hexit - '` gtqp
original langauges w/ diacritics original langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Kullback 79.43 78.07 78.07 79.07 67.72 68.36 66.92 66.86
Symmetric Kullback 77.85 75.71 75.5 76.71 67.28 67.28 65.86 64.92
Angle 59.5 58.28 60.57 59.43 57 56.5 60.22 58.78
Eucleadean 70.21 66.57 68.71 67.5 67.07 66.72 66.78 66.78
Infinity 48.85 43.14 47.71 46.29 41.14 42.57 45.42 42
Ranks 58.07 60 58.57 60.71 69.28 67.07 65.22 68.14
Simple Difference 62.85 65.14 64.14 64.79 58.36 61.07 60.5 61.78
All langauges w/ diacritics All langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Kullback 69.34 69.22 69.59 69.88 62.73 63.45 62.53 61.9
Symmetric Kullback 68.87 68.33 69.19 69.4 59.09 60.43 59.02 57.75
Angle 46.93 46.17 45 44.75 49.48 49.25 49.25 48.96
Eucleadean 59.19 58.77 57.38 57.12 56.66 57.98 57.56 56.59
Infinity 44.94 41.49 43.33 40.8 39.78 39.78 41.03 38.39
Ranks 42.56 43.85 46.35 45.60 55 57.53 58.45 57.75
Simple Difference 53.07 52.25 51.78 50.67 53.17 53.21 52.98 51.5
original langauges w/ diacritics original langauges w/o diacritics
500 1000 1500 2000 500 1000 1500 2000
Unigrams 62.57 62.05 61.91 60.1 56.96 55.61 57.81 58.57
Bigrams 69.55 66.89 68.48 69.39 69.6 65.31 68.86 68.81
First 58.42 60.86 61 61.1 55.1 54.23 52.91 55.71
Last 77.96 75.42 75.52 77.72 71.42 79.05 73.52 70.61
All langauges w/ diacritics All langauges w/o diacritics
Unigrams 53.23 52.04 54.54 53.14 52.1 52.25 54.06 53.23
Bigrams 68.01 65.59 65.53 65.65 68.88 67.42 67.07 66.84
First 47.6 48.52 47.85 46.53 45.09 48.32 46.55 45.03
Last 53.95 55.54 54.57 55.47 53.14 55.32 54.5 53.1
29
IX wlg
dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp
First Letter Last Letter
500 1000 1500 2000 500 1000 1500 2000
Gini 20 21.15 21.84 18.39 23.45 21.38 20.92 22.76
Entropy 20.68 20.68 22.06 22 25.74 26.43 23.9 29.86
IG 18.85 19.54 20.23 21.61 20.92 20.46 20 20.1
IGR 22.53 27.36 29.66 29.89 21.38 26.9 28.28 26.67
Train Error 16.09 17.93 18.62 18.16 15.86 20.69 20.69 20.69
Unigrams Bigrams
500 1000 1500 2000 500 1000 1500 2000
Gini 51.03 49.2 52.41 54.71 30.11 30.8 28.9 31.03
Entropy 57.24 62.29 70.11 68.28 61.38 64.83 67.13 62.56
IG 42.53 46.67 53.79 56.55 56.32 61.84 61.61 63.51
IGR 61.38 62.07 72.64 71.49 69.65 71.3 73.33 72.64
Train Error 39.77 42.53 44.83 46.44 27.58 28.05 33.1 31.72
30