TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
-
Upload
kelley-cain -
Category
Documents
-
view
215 -
download
1
Transcript of TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
![Page 1: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/1.jpg)
Text statistics 7Day 30 - 11/05/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
![Page 2: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/2.jpg)
Course organization
03-Nov-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/
CompCultEN/ Chapter numbering
3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode
characters 6. Control
![Page 3: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/3.jpg)
Final project
03-Nov-2014NLP, Prof. Howard, Tulane University
3
![Page 4: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/4.jpg)
Open Spyder
03-Nov-2014
4
NLP, Prof. Howard, Tulane University
![Page 5: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/5.jpg)
Review
03-Nov-2014
5
NLP, Prof. Howard, Tulane University
![Page 6: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/6.jpg)
ConditionalFreqDist
1. >>> from nltk.corpus import brown
2. >>> from nltk.probability import ConditionalFreqDist
3. >>> cat = ['news', 'romance']
4. >>> catWord = [(c,w)
5. for c in cat
6. for w in brown.words(categories=c)]
7. >>> cfd=ConditionalFreqDist(catWord)
03-Nov-2014NLP, Prof. Howard, Tulane University
6
![Page 7: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/7.jpg)
Conditional frequency distribution
03-Nov-2014
7
NLP, Prof. Howard, Tulane University
![Page 8: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/8.jpg)
03-Nov-2014NLP, Prof. Howard, Tulane University
8
A more interesting example
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
sci fi 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
![Page 9: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/9.jpg)
Conditions = categories, sample = modal verbs
1. # from nltk.corpus import brown2. # from nltk.probability import
ConditionalFreqDist3. >>> cat = ['news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor']4. >>> mod = ['can', 'could', 'may', 'might',
'must', 'will']5. >>> catWord = [(c,w)6. for c in cat7. for w in brown.words(categories=c)8. if w in mod]9. >>> cfd = ConditionalFreqDist(catWord)10. >>> cfd.tabulate()11. >>> cfd.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
9
![Page 10: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/10.jpg)
cfd.tabulate()
can could may might must will
news 93 86 66 38 50 389
religion 82 59 78 12 54 71
hobbies 268 58 131 22 83 264
science_fiction 16 49 4 12 8 16
romance 74 193 11 51 45 43
humor 16 30 8 8 9 13
03-Nov-2014NLP, Prof. Howard, Tulane University
10
![Page 11: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/11.jpg)
cfd.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
11
![Page 12: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/12.jpg)
03-Nov-2014NLP, Prof. Howard, Tulane University
12
Another example
The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses:1. >>> from nltk.corpus import inaugural2. >>> inaugural.fileids()
3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt']
![Page 13: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/13.jpg)
03-Nov-2014NLP, Prof. Howard, Tulane University
13
cfd2.plot()
![Page 14: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/14.jpg)
First try
1. from nltk.corpus import inaugural
2. from nltk.probability import ConditionalFreqDist
3. keys = ['america', 'citizen']
4. keyYear = [(w, title[:4])
5. for title in inaugural.fileids()
6. for w in inaugural.words(title)
7. if w.lower() in keys]
8. cfd2 = ConditionalFreqDist(keyYear)
9. cfd2.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
14
![Page 15: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/15.jpg)
03-Nov-2014NLP, Prof. Howard, Tulane University
15
cfd2.plot()
![Page 16: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/16.jpg)
Second try
1. from nltk.corpus import inaugural2. from nltk.probability import
ConditionalFreqDist3. keys = ['america', 'citizen']4. keyYear = [(key, title[:4])5. for title in inaugural.fileids()6. for w in inaugural.words(title)7. for k in keys8. if w.lower().startswith(k)]9. cfd3 = ConditionalFreqDist(keyYear)10. cfd3.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
16
![Page 17: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/17.jpg)
dfc3.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
17
![Page 18: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/18.jpg)
Stemming
03-Nov-2014NLP, Prof. Howard, Tulane University
18
![Page 19: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/19.jpg)
Third try
1. from nltk.stem.snowball import EnglishStemmer
2. stemmer = EnglishStemmer()
3. from nltk.corpus import inaugural
4. from nltk.probability import ConditionalFreqDist
5. keys = ['america', 'citizen']
6. keyYear = [(w, title[:4])
7. for title in inaugural.fileids()
8. for w in inaugural.words(title)
9. if stemmer.stem(w) in keys]
10. cfd4 = ConditionalFreqDist(keyYear)
11. cfd4.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
19
![Page 20: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/20.jpg)
cfd4.plot()
03-Nov-2014NLP, Prof. Howard, Tulane University
20
![Page 21: TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649d045503460f949d78d2/html5/thumbnails/21.jpg)
Next time
03-Nov-2014NLP, Prof. Howard, Tulane University
21