NLTK: the Good, the Bad, and the Awesome
-
Upload
jacob-perkins -
Category
Technology
-
view
13.162 -
download
0
description
Transcript of NLTK: the Good, the Bad, and the Awesome
![Page 1: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/1.jpg)
NLTKThe Good, the Bad, and the Awesome
![Page 2: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/2.jpg)
Jacob Perkins
• Python Text Processing with NLTK 2.0 Cookbook
• streamhacker.com
• weotta.com
• text-processing.com
• @japerk
![Page 3: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/3.jpg)
The Good• Makes NLProc easier and more accessible
• Python (great learning language)
• Lots of documentation (and 2 books!)
• Designed for training custom models
• Includes many training corpora
• Many algorithms to experiment with
![Page 4: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/4.jpg)
The Bad
• NLProc is hard
• Few out-of-the-box solutions (see Pattern)
• Not designed for big-data (see Mahout)
• Doesn’t have latest algorithms (see Scikits-Learn)
• No online or active learning algorithms
![Page 5: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/5.jpg)
More Bad
• Doesn’t play nice with pip or easy_install
• Python (Java: StanfordNLP, OpenNLP, Gate, Mahout)
• Models can use a lot of memory (& disk if pickled)
![Page 6: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/6.jpg)
The Awesome
• Great for education and research
• Lots of users & active community
• Extensible interfaces
• Training algorithms span human languages
![Page 7: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/7.jpg)
More Awesome
• Trained models can be very fast
• Well known algorithms can be very accurate
• NLTK-Trainer (train models with 0 code)
• Corpus bootstrapping
![Page 8: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/8.jpg)
Some Numbers• 3 Classification Algorithms
• 9 Part-of-Speech Tagging Algorithms
• Stemming Algorithms for 15 Languages
• 5 Word Tokenization Algorithms
• Sentence Tokenizers for 16 Languages
• 60 included corpora
![Page 9: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/9.jpg)
Text-Processing.com
• NLTK Demos & APIs
• Sentiment Analysis
• Part-of-Speech Tagging & Chunking / NER
• Stemming
• Tokenization
![Page 10: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/10.jpg)
Memory Usagetext-processing.com
![Page 11: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/11.jpg)
CPU Usagetext-processing.com
![Page 12: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/12.jpg)
NLTK-Trainer
• https://github.com/japerk/nltk-trainer
• 3 Training Command Scripts
‣ train_classifier.py
‣ train_tagger.py
‣ train_chunker.py
• Easy to tweak training parameters
• Duck-Typed corpus reading
![Page 13: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/13.jpg)
Training Classifiers
• train_classifier.py movie_reviews --instances paras
• train_classifier.py movie_reviews --instances paras --min_score 2 --ngrams 1 --ngrams 2
• train_classifier.py movie_reviews --instances paras --classifier MEGAM
• train_classifier.py movie_reviews --instances paras --cross-fold 10
• Pickled models are saved in ~/nltk_data/classifiers/
![Page 14: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/14.jpg)
Training Taggers
• train_tagger.py treebank
• train_tagger.py treebank --sequential ubt --brill
• train_tagger.py treebank --sequential ‘’ --classifier NaiveBayes
• train_tagger.py mac_morpho --simplify_tags
• Pickled models are saved in ~/nltk_data/taggers/
![Page 15: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/15.jpg)
Training Chunkers
• train_chunker.py treebank_chunk
• train_chunker.py treebank_chunk --classifier NaiveBayes
• train_chunker.py conll2000 --fileids train.txt
• Pickled models are saved in ~/nltk_data/chunkers/
![Page 16: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/16.jpg)
Corpus Bootstrapping
• Guess & Correct easier than starting from scratch
• Use an existing model for initial guesses
• emoticons
‣ :) = “pos”
‣ :( = “neg”
• ratings
‣ 5 stars = “pos”
‣ 1 star = “neg”
![Page 17: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/17.jpg)
Portuguese Phrase Extraction & Classification• similar to condensr.com
• Brazilian Portuguese
• aspect classification is easy with training corpus
• need chunked corpus for phrase extraction
• use mac_morpho & nltk-trainer to train initial tagger
• part-of-speech tag annotation is time consuming
• simplified tags are much easier
• bracketed phrases w/out pos tags
![Page 18: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/18.jpg)
treebank_chunk[ Pierre/NNP Vinken/NNP ],/, [ 61/CD years/NNS ]old/JJ ,/, will/MD join/VB [ the/DT board/NN ]as/IN [ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]./.
![Page 19: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/19.jpg)
Just Brackets
[ Pierre Vinken ] , [ 61 years ] old , will join [ the board ] as [ a nonexecutive director Nov. 29 ] .
![Page 20: NLTK: the Good, the Bad, and the Awesome](https://reader034.fdocuments.us/reader034/viewer/2022042513/554e1c2fb4c9056b798b4a68/html5/thumbnails/20.jpg)
NLP at Weotta
• Parsing & information extraction
• Text cleaning & normalization (more parsing)
• Text & keyword classification
• De-duplication
• Search indexing / IR
• Sentiment analysis
• Human integration