lresisi - huji.ac.ilai/projects/2013/NaturalLanguageDetection/files/... · lresisi,iqiqx xe`il...

30

Transcript of lresisi - huji.ac.ilai/projects/2013/NaturalLanguageDetection/files/... · lresisi,iqiqx xe`il...

:zizek`ln dpial `ean qxewa meiq hwiiext

zirah dty iedif

201564895 ,lresisi ,iqiqx xe`il

200790111, mikab4, owxa dwin

2014 uxna 23

1

mipiipr okez

4 `ean I

6 megza zexeyw ze eare dtyd iedif zniyn II

7 ogand xnegle oeni`d xnegl mipezpd seqi` III

7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zetyd zxiga 1

7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . oeni`d xnega letihde mihqwhd xewn 2

8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . miihixw`i mipniqa letih 3

9 minzixebl` IV

9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . stif weg - ziai`pd dyibd 4

9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlibx dxitq 4.1

9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "Borda Count" zhiy 4.2

10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 5

10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 5.1

10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 5.1.1

10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 5.1.2

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 5.1.3

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` i nv) mxbia 5.1.4

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dty lk ly mixehwed egi` ote` 5.2

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ig` lwyn 5.2.1

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ilniqwnd jxrd zxiga 5.2.2

11 . . . . . . . . . . . . . . . . . . . . . ogapd xehwel dtyd xehwe oia wgxnd z i n ote` 5.3

11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "heyt" wgxn 5.3.1

12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ilwe` wgxn 5.3.2

12 . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia zieefd qepiqew t"r oein 5.3.3

12 . . . . . . . . . (zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4

12 . . . . . . . . . . . . . . . . . . . . . . . . . . . (Ranks) mewina miyxtdd mekq 5.3.5

13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . seqpi` znxep 5.3.6

13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 6

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-grams t"r 6.1

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mxbipei 6.1.1

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin y`xa ze` zegiky 6.1.2

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dlin seqa ze` zegiky 6.1.3

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (zeize` i nv) mxbia 6.1.4

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zeveawl epwlig ea ote`d 6.2

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpet 6.3

14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy itl 6.3.1

2

15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain itl 6.3.2

15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Gain Ratio itl 6.3.3

15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gini Gain itl 6.3.4

16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Train Error itl 6.3.5

16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . urd ziipal mzixebl`d 6.4

16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . (classification) dtyd beeiql mzixebl`d 6.5

17 ze`vezd V

17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (stif weg) ziai`p dyib 7

17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . zihqihhq dyib 8

17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . mixehwed oia wgxnd z i n ote` t"r 8.1

19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n-gram t"r 8.2

20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 8.3

21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . overfitting-e zeiawr 8.4

21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . recall, precision, F1 i n 8.5

22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dhlgd ivr 9

22 . . . . . . . . . . . . . . . . . . . . . . . . (geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1

24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (aeyig onf) enild avw 9.2

26 izrl zeaygne zepwqn ,oei VI

28 zexewn VII

29 zeihqihhqd ze`vezd ixwir hexit - '` gtqp VIII

30 dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp IX

3

I wlg

`ean

zepey zety 120-n xzei ly ozaizkl ynyn `ede ,mlera zevetpd azkd zekxrnn zg` `ed ipihld ziatl`d

zetya etqep odil` ,zeiqiqa zeize` 26 llek `ed .('eke zeia`lq ,zeip`nxb ,zeip`nex) zepey zety zegtynn

-l`a agxpd yeniyd .dtyl ziatl`d z` mi`zdl e repy ,(miihixw`i mipniq") mi gein mipniq zenieqn

zedfl ozip vike m`d ,zeipihl zeize`a aezkd hqwh ozpida - zxbz`ne zpiiprn ziaeyig dira dlrn df ziat

?aezk `ed zirah dty efi`a

ir na miax mitpre divipbewd ir n ,ziaeyig zepyla ,zepyla znbe k) miax minegz dtiwn efd dirad

miweg xa a iwqneg ly zeipylad zeqitza oe l lkep ,dnbe l ,jk .zexg` zeax zel`y dlrn `ide (aygnd

zedfl mi nel e` dty miykex epgp` ea ote`d znbe k zeiaihipbew zel`ya e` ,zeirah zety ly miilqxaipe`

lerii znbe k ,miax miihwxt miyeniy dl yi :` ixb zihxe`iz dppi` dirady xekfl aeyg ,liawna .milin

ly ihqihhq gezip .'eke ihnehe` mebxz jildza oey`x alyk ,miknqn mr d eare hqwh yetig zeniyn

Cryptanalysis ly megza xefrl s` leki ,zepey zetya zeize` ly zeiexi z gezip ,hxtae ,zety oia mil add

iedif e` ,mihqwh ly geprte dptvd zeniynn wlgk ,miiq`lw miptve dtlgd ipteva ,lynl) r in oeghae

.'eke ,(onted e iw znbe k) r in ly dqig e e iwa ,(gpretnd jnqnd znerl ixewnd jnqna miqet

megz `edy ,NLP-d megza `yep-zzk) zirah dty iedif ziira ly zhyten dqxba weqrz eply d eard

aezk day dtyd z` ihnehe` ote`a zedfl enll ozip oda mikx x`zp dkldnae ,(AI-d mlera e`n oiiprn

:zeipihl zeize`a zeazkp xy` y`xn zex ben zety xtqn ly oeebn oian ,oezp hqwh

zipnex zi xtq zipnxb ziwlhi`

zi eey ziplet zixbped zifpe pi`

zifbehxet ziwxeh zilbp`

ziztxv zipihl qpwixt`

zeize`d zebltzdl xeywd lka ipiite` qet yi dty lkly did hwiiextd jldna epze` dgpdy iqiqad oeirxd

ly zebef ly e` (minxbipei) ze ea zeize` ly ziqgid zegikydy dtvp ,dnbe l,jk .dtya zepeyd milina

dxeyw dlin seqa e` dlin y`xa znieqn ze` ly zegikydy oke ,dtyl dtyn dpey didz (minxbia) zeize`

xear elld miqet d z` enlle zeqpl ephlgd ,df oeirx xe`l .diiedifa riiqle dilr irdl leki jkitle ,dtyl

miqet de mipezpd lr jnzqda ,oezp hqwh ly dtyd z` zefgl lkep m`d we ale ,lirl zetyd 14-n `"k

.ep nly

,2010-2011 l"dpya xnze oxen i"r dazkpy "Natural Language Detection" d eard mr azkzz epz ear

ze`vezd z` xbz`l dqppe ,dpey dxeva dirad z` sewzl dqpp epz eara .dne dniyn mr d enzdy

ribdl epl eriiqiy ,zizek`lnd dpiad megzn miax mitqep milka yeniy jez ,ebiyd zeixewnd zexagndy

- zigpen d inl zehiy zervn`a diral ybip epgp` mb ,zeixewnd zeazekd enk .xzei s` zeaeh ze`vezl

l add mle` .oeni`d xneg jezn zety beeiq zeivwpet zepal dqppe ,dhlgd ivr zervn`ae miihqihhq milk

le b l adl liaedl lekiy dn ,dtyd ddefn mditl mipiit`nd zxigaa oenh didi ze eard izy oia izednd

mda minzixebl`ae zilnihte`d aeyigd jx a mb enk ,yeniy dyrp mda miihqihhqd milkae dhlgdd ivra

,zeixewnd zexagnd elirtdy dl` lr mitqep miihnzne miiaeyig zepeirx lirtp ,jkl sqepa .ynzydl epxga

zniieqn dty zedfl dqpp ," ala" zety yy oia xgai beeqndy mewnay jk ,beeiqd zniyn z` aigxp s`e

jixvne ,zxkip dxeva dniynd z` jaqn oaenky dn) ipihl ziatl`a zeazkpy zety ly xzei agx oeebn jezn

.(mitqep miax mixhnxta zeaygzd

lr d`ln dhily epivxe xg`n ,epi i lr azkp hlgend eaex .Python3.0-a azkp hwiiextd zxbqna ewd

eidy ,(NLTK znbe k) y`xn mix ben milk lr qqazdl epivx `le ,ep ar mzi` mixnegd lre minzixebl`d

zaezka (README.txt uaewa dvxd ze`xed llek) yibp ewd llk .eply zeyinbd z in z` liabdl mileki

:d`ad

http://tinyurl.com/qda6f5h

5

II wlg

megza zexeyw ze eare dtyd iedif zniyn

xfrip epz ear jxevl .oezp hqwh ly dty iedif zniyn rval ep`eaa rixkne izedn wlg `id mipiit`n zxiga

zigpen d inl ly zepey zehiy zervn`a dirad z` sewzl lkepy i k ynzyp mda xy` ,miipyla mipiit`na

(bigrams) minxbiae (unigrams) minxbipei ly zeiegiky we ap ,jildzdn wlgk .elld mipiit`nd lr eknzqiy

zeize` nv e` zniieqn ze` zrted ly zeiexazqdd z` `vnp ,xnelk ,zew apd zetydn `"ka zeize` ly

mitqep mipiit`n .aezk hqwhd da dtyd lr (jkn enll e`) jkn jilydl lkep m`d we ape ,znieqn dtya

dlin y`xa znieqn ze` zrted zegikyl zeqgiizd ellki - zixewnd d eara eqgiizd `l mdil` - ogapy

.'eke ,(zilbp`a dxi pe ziwlhi`a dgiky dlin seqa "a" ze`d ,lynl) dlin seqa e`

mixwege mipyla i"r xwgpe re i mwlgy) elld mipiit`nd z` enll dqpp ,ipylad zr d megzn dpeya

ode (mixehwe z`eeyd) "dheyt" zihqihhq dwi a jxevl od mda ynzype ,ihnehe` ote`a (NLP-d megza

zxbqna .('eke information gain ,ditexhp` znbe k) zepey geex zeivwpeta yeniy jez ,dhlgd ivr ziipa jxevl

`ly zeax zetqep zehiya ynzype ,zixewnd d eara yeniy dyrp oda ze eznd z` xbz`l dqpp d eard

znxepa yeniy i"r xara elawzdy ze`vezd z` xtyl ozip m`d we ap ,dnbe l ,jk .zixewnd d eara ew ap

ly eteqa .'eke Kullback-Leibler wgxn znbe k mitqep milka yeniy mb enk ,zetyd z`eeyd myl seqpi`

miweg ly hq zxivie zizek`lnd dpiad megzn milk zlrtd jez ,xzei zeaeh ze`vezl ribdl dvxp ,xa

.oeni`d xneg jnq lr e nliiy ,miqet e

hqwhd z` zeaikxnd milind yetig i"r zeidl dleki ,aezk oezp hqwh da dtyd iedifl ziai`pd jx d

`"k xear oelin wfgzl ,xnelk) zeknzpd zetydn `"ka milind lk z` lelkiy ,i erii r in xb`na w apd

,z`f mr gi .("daexw ikd" dtyd z` xifgdle ,oezpd hqwha milindn `"k ea ytgl ,zeknzpd zetydn

(zetydn `"k xear) dfky mevr r in xb`na yetigd jyne ,mewn zpigan xwi e`n `ed dfky iai`p oexzt

zxez yi` ly dpga`a ynzydl lkei ixyt` xetiy .zeliria zihnx dxeva rbety dn , e`n jex` zeidl ieyr

"apf"e xvw "seb" zlra `ide ,dne i zetyd lka milind zegiky zebltzdy

1

d`xdy ,stif 'bxe'b divnxetpi`d

milind 100-e ,iqetih hqwha mirtendn 25%-k zeqkn zilbp`a xzeia zevetpd milind 10 ,lynl jk :jex`

lelkiy ,"mvnevn" xb`nl xae nd xb`nd z` mvnvl didi ozip ,xnelk .mirtendn 45%-k zeqkn xak zevetpd

z` zepiit`ny zei egii xeyiw zelin e` ,zetydn `"ka zevetpd milind (ze`n e`) zexyr dnk ly dniyx

dpi` - mewn zegt zizernyn zkxeve zkaeqn zegt zizernyn `idy s` lr - efky d inl ,mle`e .dtyd

milin zedfl dvxpy mixwna ziyeniy didz `l `id ,efn dxzie ,zizek`ln dpia zpigan "zpiiprn" zn`a

,zxg` jx `evnl dvxpy ,o`kn .zevetp e`n milin e` qgi zelin llek `ly ,milin xtqn ly svx e` ze ea

ddeab zexazqda oezpd hqwhd ly dtyd z` ddfiy ,beeqn eitl zepale ,oezp oeni` xneg lr jnzqdl lkezy

.efd dyibd z` mb dxvwa epga epz eara ,z`f mr gi .ozipd lkk

-al `ean iqxewa mb epxai dilr ,dtyd iedif zniyn oexzitl xzeia zpiiprnde zirahd jx d `id d inl

lr miqqazny milk md ,Google Translate znbe k ,NLP-d megza miax miihnehe` milk .ziaeyig zepyl

mivex epgp`e ,"dfd oeeika mikled" ziaeyigd zepylad megza miax mixwgn ,ok enk .zewihqihhq lre d inl

.ribdl lkep ze`vez eli`l - oaenk ,oitp` xirfa - we ale jiyndl

George K. Zipf (1949), Human Behavior and the Principle of Least Effort, Addison-Wesley. 1

6

III wlg

ogand xnegle oeni`d xnegl mipezpd seqi`

:zeizernyn e`n zehlgd xtqn lawl epvl`p ,dniynd mr enzdl ep`eaa

zetyd zxiga 1

znerl) zedfl lkepy zetyd xtqn z` zxkip dxeva li bdl did epinvrl epavdy miixewnd mi rid g`

lr e`n drityn `id oky , e`n zizernyn `id ozedfe zetyd zenk zxiga .(zixewnd de ara zety yy

zey g zeize` "siqedl" dlelr xgapy dty lk ,efn dxzi .oeni`d ixneg gtp lr mb enk ,zxgapd dibhxhq`d

li bdl leki ipy vn la` ,iedifd zniyn lr lwdl leki g` vny dn ,(zi eeya a znbe k) dl zei egiiy

,xnelk ,urd l eb z` zizernyn dxeva li bdl jkae ,minxbiad (jkn xeng)e minxbipeid zenk z` zxkip dxeva

- dxwira zipkh `id ztqep zixyt` dira .epxviy mivawd l eb z`e dvixd jyn z` zihnx dxeva jix`dl

- zety oia oein a dxeyw zxg` dira .xzei zkaeqn zeidl dlelr xzei "zeihefw`" zetya oeni` ixneg zbyd

jildz ly eteqa .iedifd jild lr zeywdl did lekiy dn ,zene e`n zety od zi plede qpwixt` ,dnbe l ,jk

eptqed odl ,zixewnd d eardn zetyd 6 z` zelleky ,lirl ehxety zetyd 14 z` xegal ephlgd ,daiyg

zegtynn zety ,(zipihle ziwlhi` ,dnbe l) szeyn ixehqid xewn zelrae "zene " zety - zepiiprn zety

.('eke ,zia`lq dty `id ziplet ,zip`nxb dty `id zi eey ,zip`nex dty `id zipnex ,lynl) zepey zety

oeni`d xnega letihde mihqwhd xewn 2

hqwhd zty z` zedfl zexyt`d didzy `id ,hwiiextd ziy`xa epnvr ipta epavdy zeaeygd zexhnd zg`

xewne xg`n .hpxhpi`a bela e` di tiwiea jxr ,dxiy xtq ,oezirn gewl `ed m`d - exewnl xyw `ll

milind jxe`e mihtynd jxe` ,zeize`d zeiebltzd lre milind xve` lr e`n ritydl mileki ebeqe hqwhd

.mipey dty ialyn xzeiy dnk miqkny ,mipey zexewnn oeni` ixneg biydl epl aeyg did ,miynzyn oda

-i` xnega ynzydl epivx `l` ,ze`vezd z` zehdl elkeiy mii`xw` mihqwh lr xytzdl epivx `l ,sqepa

oda biydl lwy ,zeyibpd zetya mb - zetyd oeebna dne swidae d ig` dnxa ,ozipd lkk izeki`e oin` oen

ziwxeh ,zifpe pi` znbe k) oda miyibp zegt mixnegy zetya mbe ,(ziztxve zipnxb ,zilbp` znbe k) mixneg

xg`l .elld zetyd z` mixae `l epgp`e zeid mixnegd zeki` z` jixrdl dyw epl did ,sqepa .(qpwixt`e

z xed `ed ef dxhn zbydl xzeia aehd xewnd ik ep`vn ,mihqwh ly e`n agx oeebn yetige dwinrn dwi a

,dixehqid ,difhpt ixtq) zepey zeixebhwn mixtq zxiga lr ep twd .zyxa zety oeebna miipexhwl` mixtq

.mipey dty ialyn lr rvazz d inldy epb` jkae ,('eke dxiy

epyyge xg`n ,milin yng zegtl ellky mihtynd z` wx epxnye ,mihtynl epwxit mipeyd mihqwhd z`

dtyl dne didiy ,oin` litext mditl yabl lkepy i k zeize` witqn milikn `l xzei mixvw mihtyny

zeize`d lk z` epktd ,sqepa .mipey mihtyn 2000 zegtl lelki dty lka oeni`d xnegy jkl epb` .idylk

.lower case-l

xear .di tiwie - zilkza dpey xewnn `wee eze` epgwl ,ozipd lkk oeebne "i`nvr" didi ogand xnegy i k

dtyd `id zw apd dtydy dpi nd ly (dievxd dtya) di tiwied jxrn mihtyn llk ogand xneg ,dty lk

zeny ellki `l ,lynl) zexf zetya r in e`n hrn lelkie ,oin` didi my r indy daygn jezn ,da zinyxd

epynzyd zi xtq xear ,lynl ,jk .(dtyd dze`a mibyene migpen `l` ,zilbp`a miax miir n migpen e`

."dwixt` mex " jxra epynzyd qpwixt` xeare ," xtq" jxra

7

miihixw`i mipniqa letih 3

.zilbp`n epl zexkend zeize`d 26-l xarn ,zetqep zeize` zeniiw zeipihl zeize`a zeazkpd zetydn zeaxa

,dlind ly diibdd ote` lr miritynd miitxbezxe` mipniq md ,miihixw`i mipniq mi`xwpd ,dl` mipniq

,a ,a ,a zeidl dleki a ze`d ,dnbe l ,jk ."dlibxd" ze`l zgzn e` lrn edylk oniq ztqed i"r elawzd mde

dly zernynd z` mb `l` ,dlind ly diibdd z` zepyl wx `l leki ihixw`i d oniqa yeniyd xy`k ,'eke a

.("xak" `id schon dlind zernyn era ,"dti" `id "schön" dlind zernyn ,zipnxbd dtya ,lynl)

lr (zeywdl e`) lwdle ,hwiiextd lr zizernyn dxeva ritydl dleki miihixw`i mipniq xa a dhlgdd

mipniq x rid era ,zipnxb `id zw apd dtydy jk lr zwdaen dxeva irz ß ze`d ,lynl ,jk .dtyd iedif

epiid oii r ,miihixw`i d mipniqd z` xiqdl mivex epiidy dgpda ,z`f mr gi .zilbp` lr fnxn miihixw`i

,(dlibx a-l a jetdl ,lynl) zg` ze`l dxnd zlaha ynzydl did ozip - z`f zeyrl vik rixkdl miyx p

.elld mipniqdn lilk mlrzdl elit` e` ,(zipnxba bedpy itk ,ss-l ß e` ae-l a jetdl ,lynl) zeize` izyl

,yeniy miyer epgp` mda mixehwed llk lre zeiebltzdd lr zxkip dxeva ritydl leki dfky oewiz lk ,xen`k

,zipyla dpigan jxr zxqg `id miihixw`i d mipniqd ly dxqd ,jkl xarn .hwiiextd ze`vez z` zepyle

.milin mi`ivnne ,dtyd z` oihelgl mipyn epgp` oky

we al ephlgd ,ziaeyigd zepylad megzn mixwege mipyla xtqn mr zeievriizd ellky - miax mihal xg`l

-xw`i d mipniqd zxqd xy`k ,miihixw`i mipniq `lle ,miihixw`i mipniq mr :mipte` ipya ze`vezd z`

NFD (Normalization Form hnxeta yeniye oe'ziit ly unicodedata ziixtqa yeniy jez drvazd miihi

lr e`n milwn miihixw`i mipniq ditl) eply dxrydd z` we al lkep jk .Canonical Decomposition)

.miihixw`i d mipniqd z` dxiqdy ,zixewnd d earl epz ear ze`vez z` zeeydl lkep oke ,(dtyd iedif

8

IV wlg

minzixebl`

dyibd z` dxvwa we al mb epxga mle` ,dhlgdd ivre zihqihhqd dyibd lr yb epny epz eara ,xen`k

.stif weg lr zqqazny ,ziai`pd

stif weg - ziai`pd dyibd 4

x qa) oda yeniyd zegiky itl idylk zirah dtya milind z` bx p m` eitl ,ixitn` weg `ed stif weg

:

1i-l zilpeivxtexty zexi z zlra `id i-d dlind ik `vnp ,( xei zegiky

occurances (wi) =K

i

.edylk reaw `ed K-e ,dzexi za i-d dlind ly zertedd xtqn `id occurances (wi) xy`k

,d inl o`k oi` oky ,zizek`ln dpia zpigan "zpiiprn" zn`a dpi` efd ziai`pd dyibd ,`eand wlga xen`k

`l didiy epybxde ,qegii z ewpk dze` `iadl oekpl ep`vn ,z`f mr gi .milin zniyxa heyt yetig `l`

.efd dhiyd z` xikfdl ilan ihnehe` ote`a dty iedif zniyn lr xa l oekp

okn xg`le ,(zipihl hrnl) zew apd zetydn `"ka xzeia zevetpd milind zniyx z` epfkix oey`xd alya

dlin lk lr epvx ,zrk .x ∈ {10, 20, 50, 100, 500, 1000} xy`k ,dty lka zevetpd milind x z` wx ep`ved

dhiy e` dlibx dxitq) ep ar dzi` dhiyl m`zda .zevetpd milind x zniyxa dze` epytige ,ogapd hqwha

,zevetpd milind zniyxa drited dlind xy`k iaeig ewip) dlinl edylk ewip epzp (Borda Count ziien

milind zniyxn milin xzei yi ogand uaeway lkky `ed oeirxd .(my drited `l `id xy`k ilily ewipe

z` xa ly eteqa epxfgd ,jkitl .dtyd dze`a aezk hqwhdy miiekiqd mil b jk ,idylk dtya zevetpd

.xzeia deabd oeivd lawzd dxear dtyd

: ewip zehiy izy m` ep ar ,xen`k

dlibx dxitq 4.1

oeiv dlaiw `id ,my drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna

,ogand uaewa eritedy milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,1

.xzeia deabd ewipd lawzd dxear dtyd z` epxfgde

"Borda Count" zhiy 4.2

i zexi za uaewa drited dlind m` .zevetpd milind x zniyxa ogapd hqwha dlin lk epytig ,df dxwna

`idy ewipd jk ,ddeab xzei dlind zexi zy lkk ,xnelk) x − i oeiv dlaiw `id ,(zevetpd milind x jezn)

milindn `"k ly ewipd z` epnkq ,xarnd ly eteqa .−1 oeiv dlaiw `id zxg`e ,(xzei deab didi lawz

did efd ewipd zhiyl lpeivxd .xzeia deabd ewipd lawzd dxear dtyd z` epxfgde ,ogand uaewa eritedy

daeh dxeva ze irn okle xzei zexi z ok` od ,stif ly ezpga` t"ry ,dtya xzei zevetp milinl zeti r zzl

dlind day dtyd z` si rp okle ,zety xtqna driten znieqn dliny okzii ,sqepa .dtyl zekiiy lr xzei

.dvetp xzei

9

zihqihhq dyib 5

-nxt t"r mixehwel mzkitde oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet ep nl ef dyib zxbqna

, igi iqetih xehwe i kl ep gi` dty lka zepeyd ze`nbe dn elawzdy mipeyd mixehwed z` .mipey mixh

iedif zniyn .xehwe epnn mb epxvie ,ogand xneg ly dne gezip eprvia okn xg`l .dtyd dze` z` bviind

-etihd xehwed oial (ogand xneg z` bviiny) lawzdy xehwed oia d`eeyd i kl "dnbxez" ,jk m` ,dtyd

z` (jynda aigxp odilr ,zepey zehiya) ep n classification-d alya ,ok`e .zew apd zetydn `"k ly iq

did da qet dy dtyd z` epxfgde ,zetyd ly miibeviid mixehwedn `"k oial w apd xehwed oia wgxnd

.oeni`d xnegl xzeia dne d

ote`e egi`d ote` , nd ly zifhxwd dltknd .zepey mikx xtqna rvazd lirl x`ezy jildza aly lk

z` epeeyd ,jildzd ly eteqa .ywean hqwh lk xear zepey zewi a e`n daxda znkzqn wgxnd z i n

ozep mi ndn in reawl lkepy i ka ,mi nd t"r epvaiwe ,zizin`d dtyl aipd mitexivdn `"ky d`vezd

.xzeia oekpd aexiwd z`

oia wgxnd z i n ote`

ogapd xehwel dtyd xehwe

mixehwed egi` ote`

dty lk ly

mipniq

miihixw`i

n-gram

heyt wgxn ig` lwyn mr (ze` zegiky) mxbipei

i ilwe` wgxn ilniqwnd jxrd zxiga `ll y`xa ze` zegiky

dlin

zieefd qepiqew t"r oein seqa ze` zegiky

dlin

ixhniq-`l Kullback-Leibler wgxn (zeize` i nv) mxbia

ixhniq Kullback-Leibler wgxn

mewina miyxtdd mekq

seqpi` znxep

zihqihhqd dyiba mikezigd :1 dlah

n-grams t"r 5.1

zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi nd

:dtya zepeyd milina zeize`d

mxbipei 5.1.1

`"ka dtya z ea ze` lk ly ziqgid zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn

dlin y`xa ze` zegiky 5.1.2

dpey`x ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn `"ka milina

10

dlin seqa ze` zegiky 5.1.3

milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn `"ka

(zeize` i nv) mxbia 5.1.4

ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d

dty lk ly mixehwed egi` ote` 5.2

elawzdy mixehwed llk z` llwyl epilr did ,zew apd zetydn `"k xear iqetih xehwe yabl lkepy i k

dze`a mixehwed lkn mipiit`nd lk z` lleky , g` xehwe i kl dtyd dze`a mipeyd oeni`d ixneg xear

:mikx izya z`f zeyrl epxga okle ,ze`vezd lr ritydl ieyr hlgda lelwyd ote` .dtyd

ig` lwyn 5.2.1

iqgid lwynd ,mixehwe x eid znieqn dty xear m` :ddf did mixehwedn `"kl ozipy iqgid lwynd ,ef jx a

.

1xdid mdn `"k ly

v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbe l

.v = (0.4, 0.1, 0.2, 0.3)did llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpi xe`ewd xy`k)

ilniqwnd jxrd zxiga 5.2.2

.zeiegikyd z` eplnxp okn xg`le ,mixehwed lka ely zilniqwnd zegikyd z` oiit`n lk xear epxga ,ef jx a

v2 = (0.5, 0, 0.3, 0.2)-e v1 = (0.3, 0.2, 0.1, 0.4) mixehwed z` milawn epiid znieqn dty xear m` ,dnbe l

v = did lenxpd iptl llweynd xehwed ,(i ze`d ly ziqgid zegikyd z` zbviin i-d dhpi xe`ewd xy`k)

.v = 11.4 · (0.5, 0.2, 0.3, 0.4) =

(

514 ,

214 ,

314 ,

414

)

did `ed okn xg`le ,(0.5, 0.2, 0.3, 0.4)

ogapd xehwel dtyd xehwe oia wgxnd z i n ote` 5.3

.ogand xneg z` bviiny xehwe mb enk ,zetydn `"ka zeiebltzdd z` bviiny iqetih xehwe epi ia yi zrk

dne y xehwed z` `vnpy i k ,zetydn `"k ly bviind xehwel ogand xehwe oia d`eeyd zlert rval eppevxa

xehwe z`ivn .xehwed eze` i"r zbveiny dtyd `id oeni`d xneg ly dtydy dfgp jke ,ogand xnegl xzeia

`ed ogand xehwe oial epia wgxnd exear xehwed z`ivn i"r dzyrp ogand xehwel xzeia "dne "d dtyd

zxiga .mixehwe ipy oia wgxnd z` jixrdl lkep vik dziid zizednd dl`yd .mixehwed llk oian ilnipind

eynzyd da) zg` dhiyn xzeia epynzyd okle ,zeiteqd ze`vezd lr e`n ritydl dleki aeyigd jx

.(zixewnd d eara

ep nly dtyd xehwe z`e P = (P1, ..., Pn) xeza ogand xehwe z` onqp mi`ad mitirqd lka ,zegepd myl

z` mibviin mdy zexnl) dpey `ed mixehwea miheiaixh`d xtqny okziiy al miyp .Q = (Q1, ..., Qm) xeza

P -l eptqed ,dfd iyewd mr enzdl i k .(ogand xnegae oeni`d xnega miielz mixehwed oky - dtyd dze`

ody eplaiw ,jk .0 lwyn mdl epzpe ,ipyd xehwea miriten ok la` mda miriten `ly miheiaixh`d z` Q-le

.x ≥ max (m,n) xy`k ,x l eba mixehwe md Q ode P

"heyt" wgxn 5.3.1

ipy oia wgxndy dvxpe xg`n .

∑xi=1 |Pi −Qi| dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya

l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed

11

.menipinl

i ilwe` wgxn 5.3.2

ipy oia wgxndy dvxpe xg`n .

∑xi=1 (Pi −Qi)

2dgqepd i"r oezp Q-e P mixehwed oia wgxnd ,efd dhiya

l"pd iehiad z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,xyt`d lkk ohw didi mixehwed

.menipinl

mixehwed oia zieefd qepiqew t"r oein 5.3.3

:mdipy oia zieefd qepiqew t"r mdipia oein d z` e nl ozip ,xeyina mii nin-e mixehwe ozpida

cos (α− β) = cos (α) · cos (β) + sin (α) · sin (β)

=P1

P 21 + P 2

2

·Q1

Q21 +Q2

2

+P2

P 21 + P 2

2

·Q2

Q21 +Q2

2

=P ×Q

|P | × |Q|

:`id i nin-x xehwel zillkd dgqepd ,xnelk

∑x

i=1 Pi ·Qi√

∑x

i=1 P2i ·

√∑x

i=1 Q2i

didz mixehwed ipy oia zieefdy lkk oky) xyt`d lkk dphw didz mixehwed ipy oia zieefdy dvxpe xg`n

z` `ian (Q) dly ibeviid xehwedy ef didz diefgd dtyd ,(xzei le b didi mdipia l add jk ,xzei dle b

.menipinl l"pd iehiad

(zixhniq-`l dqxbe zixhniq dqxb) Kullback-Leibler wgxn 5.3.4

.DKL (P,Q) =∑x

i=1 Pi · log(

Pi

Qi

)

dgqepd q"r ,mixehwe ipy oia l add z` `ven nd

DKL (P,Q) 6= miiwzn) ixhniq `l `ed ely "dheytd" dqxbae xg`n ,"iq`lw" n `l `ed dfd nd

`l` ,ziq`lwd ezqxba KL wgxn aeyiga epwtzqd `l ,dfd ixyt`d iyewd mr enzdl i k .(DKL (Q,P )

:mixehwed ipyl deey "qgi" ozepy ,ixhniq KL na mb epynzyd

DSymmetric−KL =1

2(DKL (P,Q) +DKL (Q,P ))

ef didz diefgd dtyd okle ,xyt`d lkk ohw didi mixehwed ipy oia wgxndy dvxp zehiyd izyn `"ka

.menipinl l"pd iehiad z` `ian (Q) dly ibeviid xehwedy

(Ranks) mewina miyxtdd mekq 5.3.5

oiit`n lkl epwprde , xei zegiky x qa xehwe lka mipiit`nd z` epx iq ,(eply gezit ixt `idy) ef dhiya

oeiva yxtdd ly hlgend jxrd z` epnkq ,okn xg`l .zeiegikyd xe iqa enewinl m`zda ,x-l 1 oia mly oeiv

.

∑xi=1 |(Rank (Pi)−Rank (Qi))| ,xnelk .zeize`dn `"k ly

.xzeia dne d dtyd `id ,w apd xehwed znerl xzeia jenpd miyxtdd mekq lawzn dxeary dtyd

12

Fitness Functions zeneyxd xtqn mipniq

miihixw`i

n-gram

Gini Gain 500 mr (ze` zegiky) mxbipei

Entropy 1000 `ll dlin y`xa ze` zegiky

Information Gain 1500 dlin seqa ze` zegiky

Information Gain Ratio 2000 (zeize` i nv) mxbia

Train Error

dhlgdd ivra mikezigd :2 dlah

seqpi` znxep 5.3.6

.bigrams xear wx dlrted ef dhiy ,zixewnd d earl dne a

`ide ,xzeia le bd `ed (hlgen jxra) miyxtdd mekq da dxeyd ly meniqwnd zeidl zx ben seqpi` znxep

:`ad ote`a dayeg

ly (hlgen jxra) miyxtdd mekq z` miaygn ,minxbiad zniyxa zepey`xd zeize`dn `"k xear .1

.ze`d dze`a miligzny minxbiad

ze`d dze`a miligzny minxbiad ly (hlgen jxra) miyxtdd mekq dxear dpey`xd ze`d z` mixgea .2

:dtyd xehwel w apd xehwed oia xzeia le bd yxtdd z` miaipn

‖ A ‖∞ = max1≤i≤x

x∑

j=1

|Pij −Qij |

.ilnipin did m ewd alya ep`vny jxrd dxear dtyd z` mixifgn .3

dhlgd ivr 6

-nxt t"r dhlgd ivr zxivie oeni`d ixneg gezip jez ,zepeyd zetya mixfeg miqet ep nl ef dyib zxbqna

.zepeyd ze`nbe d lr jnzqdae ,mipey mixh

-i`d ixnegn `"ka eply n-grams-dn `"k ly zeiegikyd z` ep n ,oeni`d alya zihqihhqd dhiyl dne a

dhiydn dpeya .oeni` xneg eze`a n-gram lk ly ziqgid zegikyd z` dn ny ,iqetih xehwe oditl epxvie ,oen

-be lkl g` - mdy enk mixehwed z` epx`yd `l` ,zetydn `"kl igi xehwe epxvi `l o`k ,zihqihhqd

,dlaha eply zeneyxd zeidl ektd mde ,oeni`d ixehwe llk oian mixehwe xtqn ilne px ote`a epxga .dn

alya .(zeize` ly mipey zebef ,dnbe l) ep`vny n-gram-d eid (dlaha miheiaixh`d ,xnelk) dly ze enrdy

,ura znev `ed heiaixh` lky jk ,(jynda x`ezny ,ID3 mzixebl` zxfra) envr dhlgdd ur z` epipa `ad

`ed dlr lke ,(3 dlah itl ,divfihxwqi exar mikxrd xy`k) heiaixh` eze` ly ixyt` jxr zbviin rlv lk

m`zda ,ogand xneg ly classification eprvia okn xg`l .(zetyd 14 oian idylk dty ,xnelk) iteq beiz

.epipay dhlgdd ur lr jnzqdae jynda x`ezny mzixebl`l

z` xifgz dhiy efi` we al dziid dxhndyk ,zepey mikx xtqna rvazd lirl x`ezy jildza aly lk

:xzeia aehd ieaipd

13

n-grams t"r 6.1

zebltzdl xeywd lka ipiite` qet yi dty lkl ditl dpga`d lr mipryp epynzyd mda mipeyd mi nd

:dtya zepeyd milina zeize`d

mxbipei 6.1.1

`"ka dtya z ea ze` lk ly ziqgid zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn

dlin y`xa ze` zegiky 6.1.2

dpey`x ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn `"ka milina

dlin seqa ze` zegiky 6.1.3

milina dpexg` ze` xeza zeize`dn `"k ly zegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d ixnegn `"ka

(zeize` i nv) mxbia 6.1.4

ixnegn `"ka milina zeize` ly zebef ly zeiegikyd did w apd nd xy`k ,lirl hxety jildzd z` eprvia

.oeni`d

oky) e`n dle b miheiaixh`d zenk minxbia mr d eara ,ze ea zeize`e minxbipei mr d eardn dpeya

epxga ,dfd iyewd mr enzdl i k .i n le b diqxewx wnerl eprbd okle ,(zeize` izy ly zeivhenxta xae n

.urd z` mditl epipae ,xzeia mivetpd miheiaixh`d 400-800 z` "wx" miax mixwna

zeveawl epwlig ea ote`d 6.2

likn htyn lk .0-1 oiay geeha mixtqn ,xnelk ,minxbia e` minxbipei ly zeiegikyd md miheiaixh`d ikxr

dpiidz zeiegikydy dtvp ,jkitl .zeize` ly dpey xtqnn zeakxend ,mipey mikxe`a milin ly dpey xtqn

.zeiegikyd ly divfihxwqi rval ul`p ,mipey miheiaixh` oia zeeydl lkepy i ke ,efn ef e`n zepey

(geexd) dn`zdd zeivwpet 6.3

,('eke ,minxbia ,zepeyd ze ead zeize`d) urd z` mipea epgp` mditl mipeyd mipiit`nd md miheiaixh`d

.dtya mdly zegikyd mdy ,mdly mikxrl zeqgiizd jez

# Example a b ... Language

1 0.081 0.014 ... English

2 0.12 0.022 ... Spanish

3 0.068 0.017 ... English

... ... ... ... ...

Entropy itl 6.3.1

mieqn heiaixh` ly zebltzd iabl ze` eed-i` z in z` bviind divnxetpi`d zxeza n `id ditexhp`

.mdylk mipezp ozpida

14

eply d eard zixewnd d eard

minxbia ,minxbipei

,dpey`x ze`

dpexg` ze`

minxbia minxbipei rhwn

0-0.00015 0-0.0015 0-0.00001 0-0.001 0

0.00015-0.0003 0.0015-0.003 0.00001-0.0003 0.001-0.03 1

0.0003-0.0005 0.003-0.005 0.0003-0.0006 0.03-0.06 2

0.0005-0.001 0.005-0.01 0.0006-0.0009 0.06-0.09 3

0.001-0.0015 0.01-0.015 0.0009-0.012 0.09-0.12 4

0.0015-0.002 0.015-0.02 0.012-0.015 0.12-0.15 5

0.002-0.0025 0.02-0.025 0.015-0.018 0.15-0.18 6

0.0025-0.0035 0.025-0.035 0.018-1 0.18-1 7

0.0035-0.005 0.035-0.05 8

0.005-0.007 0.05-0.07 9

0.007-0.01 0.07-0.1 10

0.01-0.013 0.1-1 11

0.013-1 12

zeveawl dwelgd :3 dlah

zeidl ditexhp`d z` xi bp (Language attribute-d xear ditexhp` mirvan ep` xy`k) A dpezp dlahl

-azqdd `ed pv = |Av ||A| -e ,zetyd zniyxn dtyd z` bviin v xy`k ,H (A) = −

v∈LanguageList pvlog (pv)

jxr zelra zeneyxl wx dlahd mevnv `ed Av ,d ig` zexazqda miynzyn epgp`) mi`znd beizd ly zex

.(dlaha zeneyxd xtqn `ed |A|-e a heiaixh`a v

H (A, a) = zeidl ely ditexhp`d z` xi bp mieqn a heiaixh` xear ze` eed-i` z in z` reawl i k

.

v∈V alues(a) H (Av)

dvxp o`k ,deab ikd jxrd mr heiaixh`d z` zgwl dvxp oday ,bivpy ze`ad zeivwpetl ebipay al miyp

.xzeia jenpd ditexhp`d jxr lra heiaixh`d z` `evnl

Information Gain itl 6.3.2

A dlaha a heiaixh` lkl .ura ew ewk mieqn heiaixh` ozpida ditexhp`a dzgtdd zlgez z` en IG

:xi bp

IG (A, a) = H (A)−∑

v∈value(a)

|Av|

|A|

·H (Av)

Information Gain Ratio itl 6.3.3

:zeidl Information Gain Ratio-d z` xi bp a heiaixh` lkl

IGR(A,a) =IG(A, a)

H(A,a)

Gini Gain itl 6.3.4

n `ed Gini index ik ep nl epkxry miyetign la` ,dfd nd ze e` qxewd zxbqna dpyd ep nl `l mpn`

.(Language eply dxwna `edy) dxhnd heiaixh` ly mipeyd mikxrd ly zeiexazqdd oia diihql

:Language ly mikxrd xear A dpezpd dlahd ly Gini Index diihqd n z` dligz xi bp ,ditexhp`l dne a

GI(A) = 1−∑

v∈LanguageList

(

|Av|

|A|

)2

:zeidl heiaixh` lkl geexd zivwpet z` xi bp zrk

GG(A, a) = GI(A)−∑

v∈value(a)

|Av|

|A|·GI (Av)

letk dlah-zz lkl diihqd oial zillkd diihqd oia ilnipind yxtdd ly meniqwnd z` `evnl dvxp

.dze` xviind jxrd ly zexazqdd

Train Error itl 6.3.5

-xh`d z` xegal dvxp xy`k ,mieqn heiaixh` xear oeni`d z`iby zlgeza d ixid z` z en z`f divwpet

:d ixid zlgez z` mqwnny a heiai

TE(A,a) = minv∈LanguageList(pA)−∑

v∈value(a)

(

|Av|

|A|

)

minLanguage∈LanguageList (pAv)

.A dlahl qgia x ben pA xy`k

urd ziipal mzixebl`d 6.4

yeniy jez ,ID3 ,

2

qxewa epi`xy iaiqxewxd mzixebl`d `ed dhlgdd ivr ziipal epze` yniyy mzixebl`d

.lirl ex`ezy (geexd) dn`zdd zeivwpeta

(classification) dtyd beeiql mzixebl`d 6.5

divxhi` lkay jk ,dhlgdd ur xena "liihn"y "heyt" iaiqxewx mzixebl` did ipey`xd beeiqd mzixebl`

d`ixw `xewe ,igkepd znevd ly heiaixh`d xear ogand xehwea yiy jxrl mi`zny urd-zz z` xgea `ed

eze` ly (dtyd z` ,xnelk) beizd z` xifgne ,dlrl ribn `ed xy`k xver mzixebl`d .urd-zz lr ziaiqxewx

mivrd miax mixwnae xg`ne ,zeneyxd llkn ilne px ote`a oeni`d xneg z` mixgea epgp`e xg`n ,mle`e .dlr

heiaixh` eze`l miixyt`d mikxrd lk `l ,miliaend miheiaixh`d 400-800-a xegal epyx pe i n mile b eid

,beeiqd jildza rwzip ,xnelk ,ep nly ura miiw `ly ur-zz ytgle zeqpl milelr epgp` ,okl .mibvein inz

.mi`zny ur-zz didi `le xg`n

xehwea heiaixh`d ly jxrl mi`zny ur-zz el oi`y ura znevl eprbdy rbxay ephlgd ,ef dira xeztl i ka

xena liihl jiynpe ,"yegip"k dfky y g leih lk xi bp .heiaixh` eze` ly mivrd-izz lk lr xearp ,beeiqd

xearpe ,"yegip" erk z`f xetqp ur-zzd xena leiha "iziira" avnl ribpy mrt lka .dlrl ribp xy` r ,urd

xtqn z`e ur-zz eze`n beizd z` lawp jildzd ly eteqa .znevd ly heiaixh`d ly mivr-izzd lk lr

lk oky) xzeia ohwd "miyegip"d xtqna yeniy jez eprbd eil` beizd z` xifgpe ,jx a epiyry "miyegip"d

.(ozipd lkk xrtd z` oihwdl dvxpe ,miaxwzn epgp` dil` dtyd oial ogand xneg oia xrt lr irn "yegip"

ephlgd ,overfitting-n rpnidl i ke ,dl`ky miax miznv lelkle mile b zeidl mileki mivrde xg`n ,ok enk

xtqn z` mb epxard ,ziaiqxewx d`ixw lka - urd xena "leih"d jldna pruning rvale jildzd z` lriil

,dfd miyegipd xtqn z` epxar urd xena leihd jldna m`e ,dk r beiz epl biydy jenp ikd miyegipd

.zilnihte`d d`vezd z` aipi `l `ed oky ,lelqnd eze` lr epxziee ura zelvtzdd jynd z` epwqtd

24 zitewy ,10 lebxz

2

16

V wlg

ze`vezd

,df wxta `aen zeitivtqd ze`vezd gezip .ephwp oda zeyibdn `"ka ,eplaiwy ze`vezd z` bivp df wlga

,dhlgdd ivr ly ode zihqihhqd dyiba od ,ze`lnd ze`vezd .`ad wxta driten xzei dagx zeqgiizd era

.'a-e '` migtqpk ze`aen

(stif weg) ziai`p dyib 7

:

3

sxba ze`xl ozipy itk ,zelern eid dbiyd ziai`pd dyibdy ze`vezd ,dtevnk

zniiw ,xnelk ,oihelgl ddf did milin 1000-e 500 xear Borda Count zhiya iedifd feg` ik oiivl oiiprn

.zevetp milin lr zqqaznd ieaipd zlekia "zikekf zxwz" oirn (dxe`kl)

z` epeeyd `l okle ,hwiiexta ep wnzd oda d inld zehiyn e`n dpey efd dyibdy aey yib dl aeyg

.l"pd ze`vezl epbydy ze`vezd

zihqihhq dyib 8

dlaha ehxety miax mikezig itl ,zevxd ly e`n ax xtqn llk d inld jildz ef dhiya ,lirl xaqedy itk

ephlgdy zetyd 14 llk xear ze xtp zevxde ,zixewnd d eara eritedy zetyd xear zevxd eprvia ,sqepa .1

.we al

,xnelk) mibiiezn mivaw mr ep ar oky ,i nl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd

.(lawl mixen` epgp` dze` dtyd idn epr i

mixehwed oia wgxnd z i n ote` t"r 8.1

-gxnd z` jixrdl i k seqpi` znxepa yeniy dyrp zixewnd d eara zihqihhqd dyiba wqry wlga ,xen`k

milk eplrtd df hwiiexta era ,w apd htynd z` bviiny xehwed oial zetydn `"k ly xehwed oia miw

.mitqep

dxexad dxeva - efl ef zeaexw ze`vez epzpy - elawzdy zeivwpetd z` bivdl oevxd lyae mewn iveli`n :mitxbd iabl zillk dxrd

3

.(mi`ixw zegt zeidl milelr eid mitxbd efky dl`wqa oky) 100-a miizqne 0-n ligzn inz `l y xiv ,xzeia

17

:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k

aexae ,xzeia zerexbd ze`vezd z` daipd seqpi` znxep ,mikezigde zevxdd lka hrnk ,ze`xl ozipy itk

600 llk oeni`d xneg xy`k ew ap mipezpd ,zixewnd d eara .mipte`d xzin izernyn yxtda - mixwnd

zixewnd d eara ebyedy dl`l zene ze`vezl eprbdy ze`xl ozip .46%-k lr e nr dglvdd ifeg`e ,zeneyx

,k"dqa .mda epaygzd xy`k xzei zeaeh hrne ,miihixw`i d mipniqdn epnlrzd xy`k zeaeh zegt hrn -

.hrna elired miihixw`i d mipniqd xy`k ,ef dhiya iedifa dglvd 50%-l 40% oial eprbd

iedif 70% r - e`n miaeh mirevial eribd epynzyd oda zetqepd zeihnznd zehiyd ,seqpi` znxepn dpeya

miihixw`i mipniq mr od - dliaend dhiyd .mda zeaygzd jez iedif 80% re ,miihixw`i mipniq `ll

ziaxn .zixhniq-`ld dqxbae zixhniqd dqxba zene ze`vez daipdy ,KL zhiy dziid - mdi rla ode

xy`k ,(mda epaygzd xy`k xetiyl diihp mr) miihixw`i d mipniql xyw ila zene ze`vez eaipd zehiyd

dxe`kly) miihixw`i d mipniqa epaygzd xy`k `wee y oiivl oiiprn .Ranks-e KL zehiy md ote d i`vei

-ixw`i d mipniqd zkitdy `id jkl daiqd d`xpd lkk .aeh zegt d ar Ranks zhiy ,(r in xzei miwtqn

xy`n dpey ote`a oeni`d xnega zevetpd zeize`d xe iqa miiepiyl dnxb (a-l a ,lynl) zelibx zeize`l miih

.miiepiyl xzei yibx okle ,xzei mvnevn `edy ,ogand xnega

:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

18

zeyrl gilvd `ed , ala zeixewnd zetyd 6 oian dty zefgl yx p beeqndyk :o`k d xi ze`vezd zeki`

-xewnd zetyd 6 mr 80% znerl 70%) zetyd 14 oian dty zefgl yx p `edyk xy`n xzei daeh dxeva jk

ze`xl oiiprn . ixtdl yx p beeqnd odipia zetyd zenka aygzda geina ,oievn iedifa xae n oii r mle` ,(zei

zeaeh ze`vez mixwnd ziaxna eplaiw ,(6 jezn wx `le) zety 14 oian dty ly beeiqa xae n did xy`k mby

zeivwpet oia qgid xnyp o`k mb .zixewnd d eara yeniy dyrp da ,seqpi` znxep zhiy daipdy dl`n xzei

e` zene ze`vez eaipd zeivwpetd llk ,Ranks-e Angle hrnl .liaedl dkiynn KL-y jk ,zepeyd d i nd

li adl i k ipeigd r ind z` miwtqn d`xpd lkky ,miihixw`i d mipniqa epaygzd xy`k xzei zeaeh

.d eard zligza hxety itk ,zetyd oia xzei daeh dxeva

n-gram t"r 8.2

mi n df hwiiexta siqedl ephlgd epgp` ,xen`k .minxbipeie minxbia itl d inl drvazd zixewnd d eara

.dlin seqae dlin y`xa zeize` zegiky lr mikznqny mitqep

:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k

19

:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

. e`n le b `ed miixyt`d minxbiad xtqn - minxbiadn ribd xzeia izernynd r ind ,epitvy itk

jildza xzeia axd r ind z` wtiq dlind seqa `wee ze`d mewiny zelbl eprzted ,ze ead zeize`d oian

iwtz zlra `id miax mixwna dpexg`d ze`dy d aerd mr g` dpwa dler df ielib ,z`f mr gi .d inld

iedifa riiqz `idy ipeibd hlgda okle ,('eke mipeyd mipnfa lretd ziihp ,miaxd zxev oeiv ,dnbe l) iwe w

oiivl oiiprn .(e e` i i"r ziwlhi`ae n i"r zipnxba era ,s ze`d i"r zpievn zilbp`a miaxd zxev ,lynl) dtyd

ze`vez daipd `id mb mle` ,(iqgi ote`a) zeaeh zegtd ze`vezd z` daipd dpey`xd ze`d itl dwi ady

.(seqpi` znxep ly iedifl dne a) iedif 45% lrn - zerx `l

(aeyig onf) enild avw 8.3

epvxd ,dfd mipezpd dpan ozpida .zetyd lkn oeni`d ihtyn lk z` likdy mipezp dpan epxvi oey`xd alya

`"ka dlind seqa zeize`de dlind zligza zeize`d ,minxbipeid ,minxbiad zeiexi z z` e nly minzixebl`

20

.zetyd 14-n

.zetyd lk xear elld zewihqihhqd lk z` enll i ka zeipy 25-k - e`n xidn did enild avw

overfitting-e zeiawr 8.4

d`vezke ,zellkd epnn enll mewna oeni`d xneg z` "opyl" ligzn beeqnd xy`k ygxzdl lelr overfitting

xneg lr xzei zewie ne zeaeh ze`vez xifgi beeqnd ,dfky dxwna ."yrx"l xzei le b lwyn ozep `ed ,jkn

aeh zegt didi beeqnd ,xnelk) xzei zerexb dpiidz xken `le y g ogan xneg lr ze`vezd j` ,xkend oeni`d

ok` oeni`d xnegy ze`xl i k ipy vne , g` vn overfitting rvazd `ly ` eel i k .(y g r in mr iefiga

dlaha ze`xl ozipy itk .d inld revia xg`l oeni`d xnegn wlg lr mzixebl`d z` epvxd ,dkldk beeqn

-nyn dxeva `l j` ,mixwnd ziaxna xzei zeaeh ok` oeni`d xneg lr mzixebl`d zlert ly ze`vezd ,dhnl

irny dn ,zedeab ziqgi od ze`vezd ,ok enk .overfitting oi`y wiqdl lkepy ,o`kn .oepiy lr d irny zizer

.mixwnd ziaxna dtyd z` ddif ok` beeqndy jk lr

Kullback Symmetric

Kullback

Angle Eucleadean Infinity Ranks Simple

Difference

ze`vezd

zeixewnd

69.34 68.87 46.93 59.19 44.94 42.56 53.07

wlg lr ze`vezd

oeni`d xnegn ohw

71.41 69.3 50.95 61.12 40.07 51.91 57.14

recall, precision, F1 i n 8.5

ogand xneg da dtyd ly iefig eplaiw xa ly eteqae ,ogan xneg lr mzixebl` epvxd ,hwiiextd zxbqna

:mi`ad mipte`a miiefigd lr lkzqdl ozipy al miyp .aezk

.dil` jiizyn ok` `ede ,idylk dtyl jiiyk htynd z` epidif = True Positive .1

.dil` jiizyn `l `ed la` ,idylk dtyl jiiyk htynd z` epidif = False Positive .2

.dil` jiizyn `l ok` `ede ,idylk dtyl jiiy `lk htynd z` epidif = True Negative .3

.dil` jiizyn ok `ed la` ,idylk dtyl jiiy `lk htynd z` epidif = False Negative .4

.precision-e recall i n i"r `id ze`vezd z` jixrdl ztqep jx

iedifa lawl mixen` epiidy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z` en xy` n `ed recall n.d`hgdd xeriy z` en `ed ,zexg` milina .mlyen

dze`l lreta eplaiwy mikeiyd llk jezn ,dtyl oekpd jeiyd xeriy z` en xy` n `ed precision n

.yrxd xeriy z` en `ed ,zexg` milina .dtyd

recall =True Positive

True Positive + False Negative

precision =True Positive

True Positive + False Positive

,rexb precision mr gia ribdl leki aeh recall) mnvr ipta mi ner `l md la` ,miaeyg md elld mi nd ipy

`ed mday hleadyk ,mialeyn mi n lr mb lkzqdl bedp ,okl .(aeh `ed iefigdy reawl i k i ea oi` okle

:d`ad dgqepd jnq lr lawzne ,mdipy ly ipenxd rvenn `edy ,F1 n

F1 = 2 ·precision · recall

precision + recall

aeh iefigd jk ,1-l xzei miaexw mdy lkke ,1-l 0 oiay geeha mirp elld mi nd llk ly miixyt`d mikxrd

21

.xzei

:eply zevxda elld mikxrd z` ep n

with diacritics without diacritics

500 1000 1500 2000 500 1000 1500 2000

original 0.648 0.651 0.654 0.653 0.65 0.646 0.642 0.637

all languages 0.618 0.606 0.599 0.593 0.59 0.586 0.583 0.581

dhlgd ivr 9

.2 dlaha exkfedy miax mikezig itl ,zevxd ly e`n ax xtqn eprvia ef dhiya ,lirl xaqedy itk

ephlgdy zetyd 14 llk xear ze xtp zevxde ,zixewnd d eara eritedy zetyd xear zevxd eprvia ,sqepa

.we al

mibiiezn mivaw mr ep ar oky ,i nl dheyt dziid zizin`d d`vezl eplaiwy iefigd z`vez z`eeyd ,o`k mb

.(lawl mixen` epgp` dze` dtyd idn epr i ,xnelk)

(geexd) dn`zdd zeivwpetl qgia n-gram t"r 9.1

:ze`ad ze`vezd z` eplaiw ,zixewnd d eardn zetyd lr wx epvx xy`k

:ze`ad ze`vezd z` eplaiw ,zetyd 14 llk lr epvx xy`k

22

:miliaend mi nd itl eppiqyk ,mivrd llk xear elawzdy miqet d z` ze`xl ozip `ad sxba

23

mikenp iefig ifeg` eaipd dpexg`d ze`de dpey`xd ze`d lr mi aery minzixebl`d zevxdd lka ,epzrztdl

enk zeveawl dwelg dze`a mxear epynzydy d aera dverp jkl daiqd ,epzr l .(30%-n zegt) iqgi ote`a

.ep`vny zeiebltzdd z` dni`zny zi erii dwelga `le ,(3 dlaha hxety itk) illk ote`a minxbipei xear

d earl - minxbipei xear ode minxbia xear od - e`n zene eply ze`vezd ,Gini ziivwpetl xywda

epglvd ,xzeia zeaehd ze`vezd z` daipd ef divwpet my ,zixewnd d eardn dpeya ,z`f mr gi .zixewnd

efd d`vezd .minxbia xear ode minxbipei xear od ,IGR ziivwpeta yeniy jez ,xzei s` zeaeh ze`vezl ribdl

lelry overfitting-d z` mvnvl i k "gzet" IGR ditl ,qxewd jldna ep nly d aerd mr g` dpwa dler

izy ly mireviaa 15%-l 5% oia rpy xrt miiw ,ok`e .Information Gain diivwpeta yeniya xveeidl

.minxbipei xear ode minxbia xear od ,elld zeivwpetd

zeaeh ze`vez epzp minxbia mr e ary zeivwpetd mixwnd aexa ,zixewnd d eardn dpeyae ,epitivy itk

.epynzyd mda miheiaixh`d xtqn z` epnvnv xy`k mb ,xzei

ze`vezd mle` ,ditexhp`d ziivwpeta yeniy jez exvepy mivrd z` zepal i k ax onf yx p ik oiivl yi

dliri `l `id ditexhp` ,xnelk .zeivwpetd xzil qgia ode iheleqa` ote`a od ,zeaeh ze`vez eid eaipd mdy

r in zegt `l ozep milevitd lk jq ,xa ly eteqa mle` ,zeivwpetd x`yn r in zegt ozep da levit lke

.zexg`d zeivwpetd ziaxna xy`n (r in xzei s` mizrle)

(aeyig onf) enild avw 9.2

xtqnl xyi qgia nr dhlgd ur ziipal yx py onfd jyn .dry ivgk lr nr ur lk ly rvennd diipad onf

mixwna .jx`zd mivrd ziipa onf jk ,xzei miax eid mdy lkk :oeni`d xnega zexeyd xtqnle miheiaixh`d

jix`d miihixw`i mipniqa yeniyd .jkn zegt s` e` 800-l miheiaixh`d xtqn z` liabdl epvl`p miax

24

.zihnx dxeva l b miheiaixh`d xtqn oky ,daxda aeyigd jyn z`

ivr era ,(ze ea zew xtqn) xdn ziqgi epap dpexg`d ze`de dpey`xd ze`d ,minxbipeid ly mivrd

ur era ,dw n zegt jez dpap zeneyx 2000 lr IG ly mxbipeid ur ,lynl) ax aeyig onf eyx minxbiad

miheiaixh`d xtqn xe`l z`f xiaqdl ozip .(zew 56 jez dpap zeneyx 2000 lr Gini Gain itl mxbiad

dnk lr nr oxtqn okle ,ze ea zeize` eid miheiaixh`d oky) oey`xd beqdn mivrd ly ziqgi mvnevnd

-nxbiad ivra miheiaixh`d xtqn era ,(miihixw`i d mipniqd ztqeza "zelibxd" zeize`d - ze ea zexyr

xaqedy itk ,oey`xd beqdn miheiaixh`d xtqn reaix ly l eb x qa `ed ,dyrnl) zizernyn le b did mi

.(lirl

n `id ditexhp`dy `id jkl daiqd .xzeia le bd aeyigd jyn z` yx ditexhp`d n ,mi nd oian

dxeva mipezpd z` wlgn `l `ed okle ,mipeyd mix'vitae zepeyd zeiexyt`a witqn aygzn `ly ,i n illk

miynzyn jk lya wei a .aeyigd jyn zl bdle xzei daxd le b diqxewx wnerl liaeny dn ,dxexa witqn

.mitqep miaeyiga dze` millwyn la` ditexhp`d z` miaygny ,IGR enk xzei miaeh mi na k" a

25

VI wlg

izrl zeaygne zepwqn ,oei

zewiie n ze`vezl ribdl ozip eply d eard qiqa lry zepin`n epgp`" ik eazk xnze oxen ,oz ear mekiqa

iedif 70% lrnl ribdl epglvd - zizernyn zeaeh ze`vez od epz eara eprbd odil` ze`vezd ,ok`e ."xzei s`

,(zepey (geex) dn`zd zeivwpete zeihqihhq zeivwpet mre ,minxbipei ,minxbia xear) mikezige zehiy oeebna

,60%-k dziid xzeia daehd d`vezd zixewnd d eara era ,IGR zhiya 73%-le KL zhiya iedif 79%-l s`e

feg` ,zehiyd xzi lk xear - zg` dhiya wxe ,dtyl zeneyx 2000 ly le b oeni` xneg xear dlawzd `ide

.xzeid lkl 45% did iedifd

hlgend aexa xy`k ,minxbipei xear `wee elawzd xnze oxen ly xzeia zeaehd ze`vezd ik oiivl oiiprn

elrd od .dglvd 25%-kl eribd od ,minxbia mr e ar odyk .50%-n miphw eid dglvdd ifeg` mixwnd ly

mipey mirhwnl dwelg ike ,dpi r witqn dziid `l d`xpk eyry dwelgdy `id jkl daiqdy dxrydd z`

"wgyl" epiqip ,mivrd lr d eardn wlgk .zxg` dler eply d eardn ,mle`e .xzei zeaeh ze`vez aipdl dieyr

,zepeebne zepey zewelg lr zeax minrt mivrd z` epvxd :(3 dlaha ze`xl ozipy itk) mivawnl dwelgd mr

,wenr zegt didi urd ,mipeyd mivawnd oia zpfe`ne dpi r xzei didz dwelgdy lkky dziid daygndyk

epiqipy dwelgd oia ze`veza mixkip mil ad epilib `l xa ly eteqa ,mle`e .xzei daeh didz dwelgde

,epzr l .zixewnd dwelgd mizrle ,xzei zeaeh ze`vez dbiyd dy gd dwelgd mizrl ;zixewnd dwelgde

`le mihen eid odly ogand e` oeni`d ixnegy zeidl dieyr zixewnd d eara ziqgi zekenpd ze`vezl daiqd

ztqep zexyt` .dn`zd dziid `l okle ,(oeni`d xnegn dpeya) iewip xar `l odly ogand xnegy e` ,mipekp

dxeva miy g mixneg mr enzdl r i `l odly beeqnd okle ,overfitting dxvi minxbiad ly d inldy `id

.daeh

ly zeikeaiqd znx z` dlrn mzx`ydy egipd ode xg`n ,miihixw`i d mipniqd z` exiqd xnze oxen

-`d izya jenzl epxga ,z`f znerl ,eply hwiiexta .d inld jyn z`e mipezpd ipan z` dli bne ,dniynd

mipniqd ,`eand wlga ephxite epitivy itk .mze` xiqdle ,miihixw`i d mipniqd z` xizedl - zeiexyt

- dhlgdd ivra ode zihqihhqd dhiya od - minzixebl`d ziaxn okle ,izernyn r in etiqed miihixw`i d

-ihqihhqd milka 10%-k ly rvenn xetiy) miihixw`i d mipniqa epaygzd xy`k xzei zeaeh ze`vez ebivd

minzixebl`d ,miihixw`i d mipniqd `ll mb ik oiivl aeyg ,z`f mr gi .(dhlgdd ivra 20% r lye mi

drityd miihixw`i d mipniqd zx`yd ik oiivl i` k . e`n zeti ze`vezl eribde ,mnvr z` egiked mipeyd

dvixd ipnf lr xkip ote`a drityd `id mle` ,mihqihhqd minzixebl`a dvixd ipnf lr i nl gipf ote`a

miihixw`i d mipniqd zx`yde zeid ,riztn df oi` .(epap xak mivrdy ixg` beeiqa `l la`) mivrd ziipaa

li bdl dlelr j` ,miihqihhqd minzixebl`a mitqep miaeyig ly reawe mvnevn xtqn xzeid lkl dtiqen

wner z` miax mixwna dlrn df xa .(lirl x`ezy itk) dlah lka miheiaixh`d xtqn z` zxkip dxeva

.mivrd ziipa ly aeyigd onf z` mb m`zdae ,diqxewxd

zehiy zervn`a ...ze`vezd z` xtyl ozipy" zeayeg od ik eazk xnze oxen ,zihqihhqd dyibl xywda

iedif zeleki epibtd epynzyd mda miy gd milkd llk ,ok`e ."seqpi` znxep[n℄ ...zen wzn xzei dwqd

ik zelbl oiiprn did ,sqepa .zegt zeaeh ze`vezl ribd seqpi` znxep lr qqazdy nd era , e`n zedeab

.zixhniq-`ld ezqxbae zixhniqd ezqxba zedf hrnk ze`vez aipd KL n

26

ax aeyig onf yx pe , e`n mile b mdy jka ielz mdly ixwird oexqgd ik epilib ,dhlgdd ivrl xywda

miheiaixh`d xtqn z` mvnvl epyx p okle ,ilniqwnd diqxewxd wnerl eprbd miax mixwna .mzxivil

mr "wgyl" epiqip xen`k .miihqihhqd mixwna xy`n aeh zegt did beeiqd ,jkn d`vezk .epynzyd mda

ly ze`vezd ziaxn ,z`f zexnl .xwip xetiy e`xd `l dl` miiepiy j` ,ze`vezd z` xtyl oeiqipa zewelgd

wfgny dn ,zetyd 14 lk q"r dpap `edyke zeixewnd zetyd 6 lr "wx" dpap urdyk zedf e`vi dhlgdd ivr

.w apd zetyd xtqnn mirtyen `le hrnk mdye ,dty iedifl miaeh milk ok` md dhlgd ivr ditl ,epzyib z`

od era ,dhlgdd ivr mr daeh dxeva e ar `l dpexg`d ze`de dpey`xd ze`d ly mi nd ik ze`xl ozip

.zeveawl dwelgl xeyw dfy mixeaq epgp` ,xen`k .miihqihhqd milkd mr zizernyn zeaeh ze`vezl eribd

xewgl epwtqd `l ,miax zereay ekynp mgezipe mzvxd ,hwiiexta minzixebl`d llk gezit jildze xg`n

- zey g zewelg mr zepey zevxd zeqple ,mipezpd z` oegal leki i izr hwiiext ,epzr l .wnerl `yepd z`

.zetyd ilitext z` xzeia daehd dxeva `hal lkezy ,zil`i i` dwelg zlawl r - zegte xzei zepi r

:dnid nd dprhd z` y gn dgiken ,megza zetqep zeax ze ear ly ze`vez enk ,eply d eard ze`vez

e` mipey mialyn ,we w iweg ly dri ia jxev oi` - m ew ipyla r i ila elit` zeirah zety zedfl ozip

,miline zeize` ly yaie "xw" gezipa wtzqdl xyt` `l` ,dtyd z` zedfl i k milin ly zihpnq zernyn

ozip ,sqepa .dti zextqe hpxhpi` ixz` ,mipezir znbe k ,miyibp zexewn ly mevr oeebnn gwlidl zelekiy

zety ze e` epl yiy ipylad r id z` xiyrdl i k yeniy epiyr mday milkae d eard ze`veza ynzydl

ziwlhi`a dlin seqa xzei dgiky a ze`d eitl llk ielib ,dnbe l) mipey miweg ,odipia daxwd z in - zepey

.jkn zernzynd zeiaihipbew zeiernyn we ale ,(zilbp`a xy`n

minxbipeie minxbia ly aeliy znbe k ,mitqep mi n ztqed i"r eply hwiiextd z` aigxdle jiyndl ozip

ribdl lkep epzr l .(dnbe l ,miliaend minxbipeid y-e miliaend minxbiad x md ely miznvdy ur zepal ozip)

z`tn hwiiextd zxbqna z`f ynnl epwtqd `l ,epxrvl .odil` eprbdy dl`n xzei zeaeh elit` ze`vezl jk

,ew ap `ly mipey mi na mixeyw xewgl ozipy mitqep mi`yep .epynzyd mda mi nd ieaixe onfd xvew

dkex` zipnxba zrvennd dlind ,lynl) dtya zrvennd dlind jxe` e` dtya rvennd milind xtqn znbe k

.xzei s` zeaeh ze`vezl ribdle ,mivawnl dwelgd z` xtyle zeqpl ozip ,ok enk .(zilbp`a zrvennd dlindn

27

VII wlg

zexewn

http://www.cs.huji.ac.il/~ai/projects/NLP.pdf :xnze oxen ly zixewnd d eard •

"zizek`ln dpial `ean" - 67842 qxewd ly milebxzde mixeriyd zebvn •

"dty ly miiaihipbew mihaide ziaeyig d inl" - 36622 qxewd ly mixeriyd zebvn •

• Gutenberg Project - http://www.gutenberg.org/

• http://www.bookrix.com/

• http://www.e-book.com.au/morefreebooks/freemultilingualbooks.htm

• http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-

science-meets-human/

• http://en.wikipedia.org/wiki/List_of_languages_by_writing_system#Latin_script

• http://en.wikipedia.org/wiki/Letter_frequency

• http://stackoverflow.com/questions/3194516/replace-national-characters-with-ascii-equivalent

• http://staff.science.uva.nl/~tsagias/?p=185

• http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf

• http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=4

• http://www.101languages.net/common-words/

28

VIII wlg

zeihqihhqd ze`vezd ixwir hexit - '` gtqp

original langauges w/ diacritics original langauges w/o diacritics

500 1000 1500 2000 500 1000 1500 2000

Kullback 79.43 78.07 78.07 79.07 67.72 68.36 66.92 66.86

Symmetric Kullback 77.85 75.71 75.5 76.71 67.28 67.28 65.86 64.92

Angle 59.5 58.28 60.57 59.43 57 56.5 60.22 58.78

Eucleadean 70.21 66.57 68.71 67.5 67.07 66.72 66.78 66.78

Infinity 48.85 43.14 47.71 46.29 41.14 42.57 45.42 42

Ranks 58.07 60 58.57 60.71 69.28 67.07 65.22 68.14

Simple Difference 62.85 65.14 64.14 64.79 58.36 61.07 60.5 61.78

All langauges w/ diacritics All langauges w/o diacritics

500 1000 1500 2000 500 1000 1500 2000

Kullback 69.34 69.22 69.59 69.88 62.73 63.45 62.53 61.9

Symmetric Kullback 68.87 68.33 69.19 69.4 59.09 60.43 59.02 57.75

Angle 46.93 46.17 45 44.75 49.48 49.25 49.25 48.96

Eucleadean 59.19 58.77 57.38 57.12 56.66 57.98 57.56 56.59

Infinity 44.94 41.49 43.33 40.8 39.78 39.78 41.03 38.39

Ranks 42.56 43.85 46.35 45.60 55 57.53 58.45 57.75

Simple Difference 53.07 52.25 51.78 50.67 53.17 53.21 52.98 51.5

original langauges w/ diacritics original langauges w/o diacritics

500 1000 1500 2000 500 1000 1500 2000

Unigrams 62.57 62.05 61.91 60.1 56.96 55.61 57.81 58.57

Bigrams 69.55 66.89 68.48 69.39 69.6 65.31 68.86 68.81

First 58.42 60.86 61 61.1 55.1 54.23 52.91 55.71

Last 77.96 75.42 75.52 77.72 71.42 79.05 73.52 70.61

All langauges w/ diacritics All langauges w/o diacritics

Unigrams 53.23 52.04 54.54 53.14 52.1 52.25 54.06 53.23

Bigrams 68.01 65.59 65.53 65.65 68.88 67.42 67.07 66.84

First 47.6 48.52 47.85 46.53 45.09 48.32 46.55 45.03

Last 53.95 55.54 54.57 55.47 53.14 55.32 54.5 53.1

29

IX wlg

dhlgdd ivra ze`vezd ixwir hexit - 'a gtqp

First Letter Last Letter

500 1000 1500 2000 500 1000 1500 2000

Gini 20 21.15 21.84 18.39 23.45 21.38 20.92 22.76

Entropy 20.68 20.68 22.06 22 25.74 26.43 23.9 29.86

IG 18.85 19.54 20.23 21.61 20.92 20.46 20 20.1

IGR 22.53 27.36 29.66 29.89 21.38 26.9 28.28 26.67

Train Error 16.09 17.93 18.62 18.16 15.86 20.69 20.69 20.69

Unigrams Bigrams

500 1000 1500 2000 500 1000 1500 2000

Gini 51.03 49.2 52.41 54.71 30.11 30.8 28.9 31.03

Entropy 57.24 62.29 70.11 68.28 61.38 64.83 67.13 62.56

IG 42.53 46.67 53.79 56.55 56.32 61.84 61.61 63.51

IGR 61.38 62.07 72.64 71.49 69.65 71.3 73.33 72.64

Train Error 39.77 42.53 44.83 46.44 27.58 28.05 33.1 31.72

30