Problem 1: Word Segmentation whatdoesthisreferto.
-
Upload
roy-derrick-walton -
Category
Documents
-
view
215 -
download
0
Transcript of Problem 1: Word Segmentation whatdoesthisreferto.
![Page 1: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/1.jpg)
Problem 1: Word Segmentation
whatdoesthisreferto
what does this refer to
![Page 2: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/2.jpg)
Application: Chinese Text
![Page 3: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/3.jpg)
Application: Internet Domain Names
www.visitbritain.com
Visit Britain
![Page 4: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/4.jpg)
Statistical Machine Learning
Best segmentation= one with highest probability
Probability of a segmentation= P(first word) × P(rest of segmentation)
P(word)= estimated by counting
![Page 5: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/5.jpg)
Statistical Machine Learning
choosespain
Choose Spain Chooses pain
P(“Choose Spain”) > P(“Chooses Pain”)
![Page 6: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/6.jpg)
Example
segment(“nowisthetime…”) Pf(“n”) × Pr(“owisthetime…”)
Pf(“no”) × Pr(“wisthetime…”)
Pf(“now”) × Pr(“isthetime…”)
Pf(“nowi”) × Pr(“sthetime…”) ……
![Page 7: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/7.jpg)
Example
segment(“nowisthetime…”)
![Page 8: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/8.jpg)
The Complete Program
![Page 9: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/9.jpg)
Performance
Accuracy = 98% Trained on 1.7B words (English)
Typical errors: baseratesoughtto
base rate sought to smallandinsignificant
small and in significant ginormousego
g in or mouse go
![Page 10: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/10.jpg)
Some Results
whorepresents.com[“who”, “represents”]
therapistfinder.com[“therapist”, “finder”]
expertsexchange.com[“experts”, “exchange”]
speedofart.net[“speed”, “of”, “art”]
penisland.com error: expected [“pen”, “island”]
![Page 11: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/11.jpg)
Problem 2: Spelling Correction
Mehran Salami Typical word processor: Tehran Salami But Google can …
![Page 12: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/12.jpg)
![Page 13: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/13.jpg)
Statistical Machine Learning
Best correction= one with highest probability
Probability of a spelling correction c= P(c as a word) × P(original is a typo for c)
P(c as a word)= estimated by counting
P(original is a typo for c)= proportional to number of changes
![Page 14: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/14.jpg)
The Complete Program
![Page 15: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/15.jpg)
Problem 3: Speech Recognition
An informal, incomplete grammar of the English language runs over 1,700 pages.
Invariably, simple models and a lot of data trump more elaborate models based on less data.
![Page 16: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/16.jpg)
Problem 3: Speech Recognition
If you have a lot of data, memorisation is a good policy.
For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.
![Page 17: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/17.jpg)
Problem 3: Speech Recognition
![Page 18: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/18.jpg)
Problem 3: Speech Recognition
![Page 19: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/19.jpg)
Problem 3: Speech Recognition
“Every time I fire a linguist, the performance of our speech recognition system goes up.”
--- Fred Jelinek
![Page 20: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/20.jpg)
Problem 4: Machine Translation
![Page 21: Problem 1: Word Segmentation whatdoesthisreferto.](https://reader031.fdocuments.us/reader031/viewer/2022032723/56649f565503460f94c7b2ae/html5/thumbnails/21.jpg)
Conclusion
(Statistical) [Machine] Learning Is
The Ultimate Agile Development Tool
Peter Norvig(Director of Research, Google)