Post on 18-Feb-2019
Corpus Bootstrapping with NLTKby Jacob Perkins
Jacob Perkins
http://www.weotta.com
http://streamhacker.com
http://text-processing.com
https://github.com/japerk/nltk-trainer
@japerk
Problem
you want to do NLProc
many proven supervised training algorithms
but you don’t have a training corpus
Solution
make a custom training corpus
Problems with Manual Annotation
takes time
requires expertise
expert time costs $$$
Solution: Bootstrap
less time
less expertise
costs less
requires thinking & creativity
Corpus Bootstrapping at Weotta
review sentiment
keyword classification
phrase extraction & classification
Bootstrapping Examples
english -> spanish sentiment
phrase extraction
Translating Sentiment
start with english sentiment corpus & classifier
english -> spanish -> spanish
English -> Spanish -> Spanish
1. translate english examples to spanish
2. train classifier
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classifier
6. add to corpus & goto 4 until done
Translate Corpus
$ translate_corpus.py movie_reviews --source english --target spanish
Train Initial Classifier
$ train_classifier.py spanish_movie_reviews
Create New Corpus
$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle
Manual Correction
1. scan each file
2. move incorrect examples to correct file
Train New Classifier
$ train_classifier.py spanish_sentiment
Adding to the Corpus
start with >90% probability
retrain
carefully decrease probability threshold
Add more at a Lower Threshold
$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt
When are you done?
what level of accuracy do you need?
does your corpus reflect real text?
how much time do you have?
Tips
garbage in, garbage out
correct bad data
clean & scrub text
experiment with train_classifier.py options
create custom features
Bootstrapping a Phrase Extractor1. find a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done
NLTK Tagged Corpora
English: brown, conll2000, treebank
Portuguese: mac_morpho, floresta
Spanish: cess_esp, conll2002
Catalan: cess_cat
Dutch: alpino, conll2002
Indian Languages: indian
Chinese: sinica_treebank
see http://text-processing.com/demo/tag/
Train Tagger
$ train_tagger.py treebank --simplify_tags
Phrase Annotation
Hello world, [this is an important phrase].
Tag Phrases
$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt
Chunked & Tagged Phrase
Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.
Correct Unknown Words
1. find -NONE- tagged words
2. fix tags
Train New Tagger
$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
Train Chunker
$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader
Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag
tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')
def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d
sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
Final Tips
error correction is faster than manual annotation
find close enough corpora
use nltk-trainer to experiment
iterate -> quality
no substitute for human judgement
Links
http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com