Corpus Bootstrapping with NLTK

Corpus Bootstrapping with NLTKby Jacob Perkins

Jacob Perkins

http://www.weotta.com

http://streamhacker.com

http://text-processing.com

https://github.com/japerk/nltk-trainer

@japerk

Problem

you want to do NLProc

many proven supervised training algorithms

but you don’t have a training corpus

Solution

make a custom training corpus

Problems with Manual Annotation

takes time

requires expertise

expert time costs $$$

Solution: Bootstrap

less time

less expertise

costs less

requires thinking & creativity

Corpus Bootstrapping at Weotta

review sentiment

keyword classification

phrase extraction & classification

Bootstrapping Examples

english -> spanish sentiment

phrase extraction

Translating Sentiment

start with english sentiment corpus & classifier

english -> spanish -> spanish

English -> Spanish -> Spanish

1. translate english examples to spanish

2. train classifier

3. classify spanish text into new corpus

4. correct new corpus

5. retrain classifier

6. add to corpus & goto 4 until done

Translate Corpus

$ translate_corpus.py movie_reviews --source english --target spanish

Train Initial Classifier

$ train_classifier.py spanish_movie_reviews

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input spanish_examples.txt --classifier spanish_movie_reviews_NaiveBayes.pickle

Manual Correction

1. scan each file

2. move incorrect examples to correct file

Train New Classifier

$ train_classifier.py spanish_sentiment

Adding to the Corpus

start with >90% probability

retrain

carefully decrease probability threshold

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --classifier categorized_corpus_NaiveBayes.pickle --threshold 0.8 --input new_examples.txt

When are you done?

what level of accuracy do you need?

does your corpus reflect real text?

how much time do you have?

garbage in, garbage out

correct bad data

clean & scrub text

experiment with train_classifier.py options

create custom features

Bootstrapping a Phrase Extractor1. find a pos tagged corpus

2. annotate raw text

3. train pos tagger

4. create pos tagged & chunked corpus

5. tag unknown words

6. train pos tagger & chunker

7. correct errors

8. add to corpus, goto 5 until done

NLTK Tagged Corpora

English: brown, conll2000, treebank

Portuguese: mac_morpho, floresta

Spanish: cess_esp, conll2002

Catalan: cess_cat

Dutch: alpino, conll2002

Indian Languages: indian

Chinese: sinica_treebank

see http://text-processing.com/demo/tag/

Train Tagger

$ train_tagger.py treebank --simplify_tags

Phrase Annotation

Hello world, [this is an important phrase].

Tag Phrases

$ tag_phrases.py my_corpus --tagger treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET important/ADJ phrase/N ] ./.

Correct Unknown Words

1. find -NONE- tagged words

2. fix tags

Train New Tagger

$ train_tagger.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Train Chunker

$ train_chunker.py my_corpus --reader nltk.corpus.reader.ChunkedCorpusReader

Extracting Phrasesimport collections, nltk.datafrom nltk import tokenizefrom nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t): d = collections.defaultdict(list) for sub in t.subtrees(lambda s: s.node != 'S'): d[sub.node].append(' '.join(untag(sub.leaves()))) return d

sents = tokenize.sent_tokenize(text)words = tokenize.word_tokenize(sents[0])d = extract_phrases(chunker.parse(tagger.tag(words)))# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Final Tips

error correction is faster than manual annotation

find close enough corpora

use nltk-trainer to experiment

iterate -> quality

no substitute for human judgement

http://www.nltk.org

https://github.com/japerk/nltk-trainer

http://text-processing.com

Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping...

Documents

Transcript of Corpus Bootstrapping with NLTK - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Corpus Bootstrapping...

Processing with NLTK Natural Language - PyCon Language Processing with NLTK ... Natural Language Processing. What is NLP? ... -Pattern-gensim-MITIE-guess_language

Text Chunking using NLTK

09 bootstrapping

NLTK: The Natural Language Toolkit

Bootstrapping Agile

Criando um corpus sobre desastres climáticos com apoio da ferramenta NLTK · uma investigação única sobre associações e preferências de linguagem e mesmo sobre a indução

Bootstrapping Groovy

Python 3 March 15, 2011. NLTK import nltk nltk.download()

Bird05 nltk-intro

Slide Nltk Kt

Accessing files with NLTK Regular Expressions

Introduction to NLTK

We love NLTK

NLTK and Lexical Information - GitHub Pages · NLTK and Lexical Information Text Statistics References NLTK book examples Concordances Lexical Dispersion Plots Diachronic vs Synchronic

NLTK: The Natural Language Toolkit - Lopered.loper.org/presentations/pycon-nltk-slides.pdf · NLTK: Python-Based NLP Courseware •NLTK: Natural Language Toolkit • A suite of Python

October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.

Basic NLP with Python and NLTK

NLTK-Trainer Documentation - Read the Docs · NLTK-Trainer Documentation, Release 1.0 NLTK-Trainer is a set ofPythoncommand line scripts for natural language processing. With these

HowtoperformsomecommonNLPtasksusing NLTK · 2011. 9. 6. · NLTK ExampleNLPPipeline WordNet Classiﬁers HowtoperformsomecommonNLPtasksusing NLTK MichaelGabilondo CAP5636,Fall2011