The Switchabalizer - our journey from spell checker to homophone corrrecter

73
Introduction The Problem First Attempt Second Attempt Conclusion The Switchabalizer Our journey from spell checker to homophone correcter Oskar Singer July 23, 2014 Oskar Singer The Switchabalizer

description

Presentation given at Open Data Bay Area by Oskar Singer on using Common Crawl and NLP techniques to improve grammar and spelling correction, specifically homophones.

Transcript of The Switchabalizer - our journey from spell checker to homophone corrrecter

Page 1: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The SwitchabalizerOur journey from spell checker to homophone correcter

Oskar Singer

July 23, 2014

Oskar Singer The Switchabalizer

Page 2: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

How I got here

I am a rising senior in the UMass Amherst CS program specializingin machine learning and natural language processing.

Last summer, I interned at an Amherst/Boston-based textanalytics company called Lexalytics

I worked with Lexalytics’ head of software engineering on thisproject

Lexalytics often uses CommonCrawl, and it was a great option fora training data set

Oskar Singer The Switchabalizer

Page 3: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

How I got here

I am a rising senior in the UMass Amherst CS program specializingin machine learning and natural language processing.

Last summer, I interned at an Amherst/Boston-based textanalytics company called Lexalytics

I worked with Lexalytics’ head of software engineering on thisproject

Lexalytics often uses CommonCrawl, and it was a great option fora training data set

Oskar Singer The Switchabalizer

Page 4: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

How I got here

I am a rising senior in the UMass Amherst CS program specializingin machine learning and natural language processing.

Last summer, I interned at an Amherst/Boston-based textanalytics company called Lexalytics

I worked with Lexalytics’ head of software engineering on thisproject

Lexalytics often uses CommonCrawl, and it was a great option fora training data set

Oskar Singer The Switchabalizer

Page 5: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

How I got here

I am a rising senior in the UMass Amherst CS program specializingin machine learning and natural language processing.

Last summer, I interned at an Amherst/Boston-based textanalytics company called Lexalytics

I worked with Lexalytics’ head of software engineering on thisproject

Lexalytics often uses CommonCrawl, and it was a great option fora training data set

Oskar Singer The Switchabalizer

Page 6: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Motivation

Lexalytics provides sentiment analysis software

Sentiment analysis relies heavily in sentence parsing andpart-of-speech tagging

Misspellings and misusage can do serious damage to accuracy forthose two tasks

Oskar Singer The Switchabalizer

Page 7: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Motivation

Lexalytics provides sentiment analysis software

Sentiment analysis relies heavily in sentence parsing andpart-of-speech tagging

Misspellings and misusage can do serious damage to accuracy forthose two tasks

Oskar Singer The Switchabalizer

Page 8: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Motivation

Lexalytics provides sentiment analysis software

Sentiment analysis relies heavily in sentence parsing andpart-of-speech tagging

Misspellings and misusage can do serious damage to accuracy forthose two tasks

Oskar Singer The Switchabalizer

Page 9: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 10: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 11: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 12: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 13: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 14: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Approach

We employed an open-source spell-checker called Hunspell

Hunspell gives an unranked list of correction suggestions

So we took the argmax of a home-baked scoring function that:

penalized string edit distance

penalized keyboard distance

rewarded high word frequencies, which were harvested fromCommonCrawl data

Oskar Singer The Switchabalizer

Page 15: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Failure

Hunspell had an error rate of

216%

Oskar Singer The Switchabalizer

Page 16: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

Failure

Hunspell had an error rate of

216%

Oskar Singer The Switchabalizer

Page 17: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

How is this possible? Two reasons:

Hunspell missed all the mistakes

Hunspell made false corrections

Oskar Singer The Switchabalizer

Page 18: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

How is this possible? Two reasons:

Hunspell missed all the mistakes

Hunspell made false corrections

Oskar Singer The Switchabalizer

Page 19: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

How is this possible? Two reasons:

Hunspell missed all the mistakes

Hunspell made false corrections

Oskar Singer The Switchabalizer

Page 20: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Hunspell was a poor choice for a couple reasons:

Hunspell’s vocabulary is not appropriate or flexible enough forTwitter domain

Hunspell can’t detect correctly spelled words that are out ofcontext

Oskar Singer The Switchabalizer

Page 21: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Hunspell was a poor choice for a couple reasons:

Hunspell’s vocabulary is not appropriate or flexible enough forTwitter domain

Hunspell can’t detect correctly spelled words that are out ofcontext

Oskar Singer The Switchabalizer

Page 22: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Hunspell was a poor choice for a couple reasons:

Hunspell’s vocabulary is not appropriate or flexible enough forTwitter domain

Hunspell can’t detect correctly spelled words that are out ofcontext

Oskar Singer The Switchabalizer

Page 23: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Twitter’s vocabulary of abbreviations and acronyms is constantlygrowing

Hunspell’s internal dictionary is not prepared for this

Oskar Singer The Switchabalizer

Page 24: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Twitter’s vocabulary of abbreviations and acronyms is constantlygrowing

Hunspell’s internal dictionary is not prepared for this

Oskar Singer The Switchabalizer

Page 25: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Example: ur

What was Hunspell’s correction?

Ur (the ancient Sumerian city-state)

Oskar Singer The Switchabalizer

Page 26: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Example: ur

What was Hunspell’s correction?

Ur (the ancient Sumerian city-state)

Oskar Singer The Switchabalizer

Page 27: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

Example: ur

What was Hunspell’s correction?

Ur (the ancient Sumerian city-state)

Oskar Singer The Switchabalizer

Page 28: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

When the issue is misuse rather than misspelling, Hunspellcompletely ignores the problem

Specifically, commonly misused homophones were a huge problemin our data

Examples: two/too/2/to; their/there/they’re; your/you’re

Oskar Singer The Switchabalizer

Page 29: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

When the issue is misuse rather than misspelling, Hunspellcompletely ignores the problem

Specifically, commonly misused homophones were a huge problemin our data

Examples: two/too/2/to; their/there/they’re; your/you’re

Oskar Singer The Switchabalizer

Page 30: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

The ApproachThe Weaknesses

What Happened?

When the issue is misuse rather than misspelling, Hunspellcompletely ignores the problem

Specifically, commonly misused homophones were a huge problemin our data

Examples: two/too/2/to; their/there/they’re; your/you’re

Oskar Singer The Switchabalizer

Page 31: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Addressing Misusage

How do we capture the idea of misuse?

Context

How can we capture context?

Rule set?

Probabilistic approach!

Oskar Singer The Switchabalizer

Page 32: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Addressing Misusage

How do we capture the idea of misuse?

Context

How can we capture context?

Rule set?

Probabilistic approach!

Oskar Singer The Switchabalizer

Page 33: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Addressing Misusage

How do we capture the idea of misuse?

Context

How can we capture context?

Rule set?

Probabilistic approach!

Oskar Singer The Switchabalizer

Page 34: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Addressing Misusage

How do we capture the idea of misuse?

Context

How can we capture context?

Rule set?

Probabilistic approach!

Oskar Singer The Switchabalizer

Page 35: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Addressing Misusage

How do we capture the idea of misuse?

Context

How can we capture context?

Rule set?

Probabilistic approach!

Oskar Singer The Switchabalizer

Page 36: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Bayes network

Conditioned on the preceding and succeeding words

Assumes these two words are independent

Does not use bag-of-words approach (considers position)

Oskar Singer The Switchabalizer

Page 37: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Conditional Probability of Preceding or Succeeding Words

P(pre(wi )|wj) =#(wiwj)

#(wj),

where pre(w) is the event that w is the preceding word and #(∗)is the number of occurences of a sequence of words

P(suc(wi )|wj) =#(wjwi )

#(wj),

where suc(w) is the event that w is the succeeding word

Oskar Singer The Switchabalizer

Page 38: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Conditional Probability of Preceding or Succeeding Words

Conditional Probability of Preceding or Succeeding Words

P(pre(wi )|wj) =#(wiwj)

#(wj),

where pre(w) is the event that w is the preceding word and #(∗)is the number of occurences of a sequence of words

P(suc(wi )|wj) =#(wjwi )

#(wj),

where suc(w) is the event that w is the succeeding word

Oskar Singer The Switchabalizer

Page 39: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Conditional Probability of Both Words

P(pre(wi ), suc(wk)|wj) = P(pre(wi )|wj)× P(suc(wk)|wj)

log(P(pre(wi ), suc(wk)|wj)) = log(P(pre(wi )|wj))

+ log(P(suc(wk)|wj))

The first equation holds because of our assumption ofindependence between the preceding and succeeding words

There is a missing term in the scoring function that I will addressin the Future Work section

Oskar Singer The Switchabalizer

Page 40: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Conditional Probability of Both Words

Conditional Probability of Both Words

P(pre(wi ), suc(wk)|wj) = P(pre(wi )|wj)× P(suc(wk)|wj)

log(P(pre(wi ), suc(wk)|wj)) = log(P(pre(wi )|wj))

+ log(P(suc(wk)|wj))

The first equation holds because of our assumption ofindependence between the preceding and succeeding words

There is a missing term in the scoring function that I will addressin the Future Work section

Oskar Singer The Switchabalizer

Page 41: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Probability Model

Conditional Probability of Both Words

Conditional Probability of Both Words

P(pre(wi ), suc(wk)|wj) = P(pre(wi )|wj)× P(suc(wk)|wj)

log(P(pre(wi ), suc(wk)|wj)) = log(P(pre(wi )|wj))

+ log(P(suc(wk)|wj))

The first equation holds because of our assumption ofindependence between the preceding and succeeding words

There is a missing term in the scoring function that I will addressin the Future Work section

Oskar Singer The Switchabalizer

Page 42: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Switchable Sets

Only certain groups should be compared, e.g. ”too” should not bescored against ”their”

Comparable switchables are groups in switchable sets

Each switchable is mapped to its switchable set

Oskar Singer The Switchabalizer

Page 43: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Switchable Sets

Only certain groups should be compared, e.g. ”too” should not bescored against ”their”

Comparable switchables are groups in switchable sets

Each switchable is mapped to its switchable set

Oskar Singer The Switchabalizer

Page 44: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Switchable Sets

Only certain groups should be compared, e.g. ”too” should not bescored against ”their”

Comparable switchables are groups in switchable sets

Each switchable is mapped to its switchable set

Oskar Singer The Switchabalizer

Page 45: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Picking the Word

The Final Equation

S(wi ,wj ,wk) = log(P(pre(wi ), suc(wk)|wj))

v∗ = argmaxv∈VwjS(wi , v ,wk)

where S(wi ,wj ,wk) is the score for the sequence of words wiwjwk

and Vwj is the switchable set corresponding to wj and v∗ is theideal switchable

Oskar Singer The Switchabalizer

Page 46: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

What about common misspellings that intersect with switchables?

Example: ”ur”

Should we put them in the switchable sets?

Oskar Singer The Switchabalizer

Page 47: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

What about common misspellings that intersect with switchables?

Example: ”ur”

Should we put them in the switchable sets?

Oskar Singer The Switchabalizer

Page 48: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

What about common misspellings that intersect with switchables?

Example: ”ur”

Should we put them in the switchable sets?

Oskar Singer The Switchabalizer

Page 49: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

My opinion: no!

Realistically, its probably okay. I opted for a more elegant solution

Replace all common mispellings with something from theappropriate switchable set

The model’s results are agnositc to the switchable that activates it

Oskar Singer The Switchabalizer

Page 50: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

My opinion: no!

Realistically, its probably okay. I opted for a more elegant solution

Replace all common mispellings with something from theappropriate switchable set

The model’s results are agnositc to the switchable that activates it

Oskar Singer The Switchabalizer

Page 51: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

My opinion: no!

Realistically, its probably okay. I opted for a more elegant solution

Replace all common mispellings with something from theappropriate switchable set

The model’s results are agnositc to the switchable that activates it

Oskar Singer The Switchabalizer

Page 52: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

First Pass

My opinion: no!

Realistically, its probably okay. I opted for a more elegant solution

Replace all common mispellings with something from theappropriate switchable set

The model’s results are agnositc to the switchable that activates it

Oskar Singer The Switchabalizer

Page 53: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Testing

Assume Wikipedia has correct usage of all switchables

Replace target words in Wikipedia articles with words from theirswitchable set

Run the Switchabilizer on corrupted articles

Oskar Singer The Switchabalizer

Page 54: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Testing

Assume Wikipedia has correct usage of all switchables

Replace target words in Wikipedia articles with words from theirswitchable set

Run the Switchabilizer on corrupted articles

Oskar Singer The Switchabalizer

Page 55: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Testing

Assume Wikipedia has correct usage of all switchables

Replace target words in Wikipedia articles with words from theirswitchable set

Run the Switchabilizer on corrupted articles

Oskar Singer The Switchabalizer

Page 56: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Results

How did we do?

20% error

Oskar Singer The Switchabalizer

Page 57: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

BrainstormThe ApproachTesting and Results

Results

How did we do?

20% error

Oskar Singer The Switchabalizer

Page 58: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Ideal Scoring Function

S(wiwjwk) = log(P(wj , pre(wi ), suc(wk))

= log(P(wj)P(wi |wj)P(wk |wj))

Forgot the P(wj) term in the factorization of the joint distribution,which resulted in a slightly unfitting conditional distribution.Remember this for reimplementation!

Oskar Singer The Switchabalizer

Page 59: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Ideal Scoring Function

Ideal Scoring Function

S(wiwjwk) = log(P(wj , pre(wi ), suc(wk))

= log(P(wj)P(wi |wj)P(wk |wj))

Forgot the P(wj) term in the factorization of the joint distribution,which resulted in a slightly unfitting conditional distribution.Remember this for reimplementation!

Oskar Singer The Switchabalizer

Page 60: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Testing conditions were not ideal because:

Test data is not target data

Mistakes are contrived

Somebody make a labeled test set, then tune the algorithm to it!

Oskar Singer The Switchabalizer

Page 61: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Testing conditions were not ideal because:

Test data is not target data

Mistakes are contrived

Somebody make a labeled test set, then tune the algorithm to it!

Oskar Singer The Switchabalizer

Page 62: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Testing conditions were not ideal because:

Test data is not target data

Mistakes are contrived

Somebody make a labeled test set, then tune the algorithm to it!

Oskar Singer The Switchabalizer

Page 63: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Testing conditions were not ideal because:

Test data is not target data

Mistakes are contrived

Somebody make a labeled test set, then tune the algorithm to it!

Oskar Singer The Switchabalizer

Page 64: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Here are some ideas I had for future experiments:

Use a discriminative model like maximum entropy

Consider higher order neighbor words

Implement for other languages

Oskar Singer The Switchabalizer

Page 65: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Here are some ideas I had for future experiments:

Use a discriminative model like maximum entropy

Consider higher order neighbor words

Implement for other languages

Oskar Singer The Switchabalizer

Page 66: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Here are some ideas I had for future experiments:

Use a discriminative model like maximum entropy

Consider higher order neighbor words

Implement for other languages

Oskar Singer The Switchabalizer

Page 67: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Future Work

Here are some ideas I had for future experiments:

Use a discriminative model like maximum entropy

Consider higher order neighbor words

Implement for other languages

Oskar Singer The Switchabalizer

Page 68: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Start Coding!

Anyone else can do this too!

Straight-forward probability model

25-50 lines of Python

Freely accessible data from CommonCrawl!

Go learn about ML and NLP! Get your hands dirty and add yourown mods! Find new problems and try new solutions!

Oskar Singer The Switchabalizer

Page 69: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Start Coding!

Anyone else can do this too!

Straight-forward probability model

25-50 lines of Python

Freely accessible data from CommonCrawl!

Go learn about ML and NLP! Get your hands dirty and add yourown mods! Find new problems and try new solutions!

Oskar Singer The Switchabalizer

Page 70: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Start Coding!

Anyone else can do this too!

Straight-forward probability model

25-50 lines of Python

Freely accessible data from CommonCrawl!

Go learn about ML and NLP! Get your hands dirty and add yourown mods! Find new problems and try new solutions!

Oskar Singer The Switchabalizer

Page 71: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Start Coding!

Anyone else can do this too!

Straight-forward probability model

25-50 lines of Python

Freely accessible data from CommonCrawl!

Go learn about ML and NLP! Get your hands dirty and add yourown mods! Find new problems and try new solutions!

Oskar Singer The Switchabalizer

Page 72: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Start Coding!

Anyone else can do this too!

Straight-forward probability model

25-50 lines of Python

Freely accessible data from CommonCrawl!

Go learn about ML and NLP! Get your hands dirty and add yourown mods! Find new problems and try new solutions!

Oskar Singer The Switchabalizer

Page 73: The Switchabalizer - our journey from spell checker to homophone corrrecter

IntroductionThe ProblemFirst Attempt

Second AttemptConclusion

Future WorkCall to Action

Thank You, CommonCrawl!

Thanks so much to Lisa, Stephen, Grace and the rest of the teamfor providing such a fantastic resource and bringing me down toSan Francisco to present!

Oskar Singer The Switchabalizer