Download - PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Transcript
Page 1: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

How to prepare data for NLP

Loryfel Nunez

@lorynyc

Page 2: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

California Gold Rush

Page 3: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

“

Extracting actionable information

from modern big data sets requires the

equivalent processing infrastructure of

extracting a nugget of GOLD from a mountain of DIRT.

Nikolas Markou

(via LInkedIn)

Page 4: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

Page 5: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

How does it work,

anyway?1

Page 6: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Page 7: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Controlling the input

Document Unit

Representation of text

Page 8: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Inside the Machine

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Smith acquires shares of Novak and Kline for $10.99 per share .

Page 9: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

BREAK IT DOWN

2

Page 10: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Let’s Break it Down

á NovákNovák and

KlineSmith acquires shares of Novak

and Kline for $10.99 per share.

Smith acquires shares of

Novak and Kline for $10.99 per

share.

Smith Inc. acquires shares of

Novak and Kline for $10.99 per

share.

Smith acquires common

shares of N & K for

$10.99/share.

Page 11: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

Page 12: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

… if possible

2

Page 13: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Character

á

&amp;

Do you know the encoding of your input data?

â—‰User tells you

â—‰Metadata

â—‰Figure it out (using chardet, or similar)

â—‰Have your own heuristics

Page 14: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Tokens

Forty-two, 42

Post-colonial, postcolonial

eBay, Ebay, EBAY, ebay

Fed, FED, fed

C.A.T., CAT

Heuristics

Mappings

Transformations

numToWord, POS (from

SpaCy or NLTK)

Page 15: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Tokens

STEMMING vs LEMMATIZATION

import spacy

from nltk.stem.porter import PorterStemmer

nlp = spacy.load('en')

stemmer = PorterStemmer()

doc = nlp(u'She is an intelligence operative.')

for word in doc:

stemmed = stemmer.stem(word.text)

print(word.text, " LEMMA => ", word.lemma_, "

STEM => ", stemmed)

She LEMMA => -PRON- STEM => she

is LEMMA => be STEM => is

an LEMMA => an STEM => an

intelligence LEMMA => intelligence STEM => intellig

operative LEMMA => operative STEM => oper

. LEMMA => . STEM => .

SpaCy, NLTK

Page 16: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Entities

Novak and Kline, NK,

NYSE:NK, Test Company

June 30, 2017

06/30/2017

30/6/2017

Smith acquires shares of Novak and Kline for

$10.99 per share .

Smith acquires shares of NK for $10.99 per

share .

ORG acquires shares of ORG for $10.99 per share

.

Page 17: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Hot or Not

REMOVING HIGHLIGHTING

WORDS Emails, dates, URLs,

stop words

hotwords

More than WORDS tables Hot patterns

textacy

Page 18: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

In the real world

<p><b>Smith Buys Novak</b></p>

<p></p>

<p>by Anna Smith<p>

<p> LONDON --- Smith Inc. acquires shares for Novak &amp; Kline. for

$10.99/share.</p>

<table style="width:100%">

<tr><th>Col1</th><th>Col2</th> </tr>

<tr><td>data1</td><td>data2</td></tr>

</table>

Page 19: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

IRL

{‘title’: ‘Smith Buys …’,

‘original_text’: ‘LONDON --- Smith..’,

‘transformed_text’: {

‘text_with_entities’: ‘LOCATION – ORG acquired …. ‘,

‘lemmatized’: ‘Smith Inc acquire share..’

‘has_acquired: true

},

‘table’: ‘<table>….. </table>’

}

Page 20: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

The General NLP Problem

dog: 3, 2, 1

red coat: 0, 0, 1

😋

😭

Page 21: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Have an intuition on how things work

Breaking data down

Keep it simple .. if possible

1

3

2

-- how algorithms see text

-- from bytes to documents

-- patterns, normalization, metadata, actions

(replace, remove, highlight)

Page 22: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

â—‰ Stanford NLP Group

â—‰ Spacy Documentation

â—‰ SciKit Learn Documentation

â—‰ The hard knocks of NLP projects

References and other stuff

Page 23: PyParis2017 / How to prepare data for NLP, by Loryfel Nunez.pptx

Any questions ?

You can find me at

â—‰ @lorynyc

â—‰ [email protected]

Thanks!