Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me...

28
Talk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Transcript of Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me...

Page 1: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Talk To MeInna Starikova

1Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 2: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

The Three Perceptual Channels

I SEE what you say.

SCREAMING colorHEATED argument

Visual

AuditoryKinesthetic Tactile

Originated from the idea that individuals differ in how they learn and learning modalities adapted from Barbe, Swassing, and Milone “Teaching through Modality Strengths: Concepts and Practices” (1979)

2Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 3: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

The Three Perceptual Channels

VKA AVK

3Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 4: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

The Six Types of People

Based on how people process information during communication

VAK

VKA AVK

AKV

KVA KAV4Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 5: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Purpose

Identify perceptual type by written text

5Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 6: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Usage

AVK KVA

Media

Family

Influence

Axx Kxx KxxA = K = V6Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

cliparts.co

Page 7: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Project Plan1. Collect initial data for model training2. Analyse the data to apply classes3. Train the model4. Test the model on classified data to see the error5. If needed, update model or classification of initial

classes.

7Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 8: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Data Collection

Source:New York Times articleswww.nytimes.com

Method:Screen scraping

8Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 9: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT -Data CollectionDaily Front Page Linkhttp://www.nytimes.com/indexes/YYYY/MM/DD/todayspaper/

Front Page HTML Source<h3><a href="http://.../article-title.html">Article Title </a></h3>

9Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 10: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Data CollectionLibraries:

pandas, requests, BeautifulSoup

Limitations:~ 200 articles per request~ 3 to 5 front pages

10Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 11: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Data Collection

1. Use date range for front page link2. Get daily front page HTML content3. Extract a list of title href links4. Use title links to get article HTML content

Total ~ 3500 articles11Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 12: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Data Cleaning● BeautifulSoup (html handling)

bs_text = BeautifulSoup(html_text) text_parts = bs_text.find_all(text=True) text = ' '.join(text_parts)

● Manual html string manipulation● Removing duplicate rows (pd.DataFrame)

df = df.sort(['href']).drop_duplicates(subset='href', take_last=True)

● Cleaning empty rows (pd.DataFrame)df = df [ df.col_name ! = '' ]

12Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 13: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Data Cleaning - StopwordsNatural Language Toolkit: Stopwords

Setupfrom nltk.corpus import stopwords

Cache stopwords (improve processing speed)stopwords = stopwords.words('english')

Normalizationtext = text.lower()

Processtext_stwrd = ' '.join([w for w in text.split() if w not in stopwords])

13Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 14: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Classification - Vocabulary

Internet sources (~10 words)http://tribehr.com/blog/communication-styles-auditory-visual-kinesthetic

Use synonyms (~ 60 words)

14Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 15: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Classification - Filtering

mean ( V, K, A) > 5std ( V, K, A) > 2

V K A mean std

V1 K1 A1 mean1 ( V1, K1, A1 ) std1 ( V1, K1, A1 )

V2 K2 A2 mean2 ( V2, K2, A2 ) std2 ( V2, K2, A2 )

mean (mean1, mean2) std (std1, std2)

df['mean'].describe()mean 6.359088std 3.40102025% 4.00000050% 6.33333375% 8.666667

df['std'].describe()mean 2.015510std 1.25568825% 1.00000050% 1.73205175% 2.645751

15Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 16: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Classificationdf['senses'].value_counts()

VKA 332KVA 308VAK 243AVK 211KAV 118AKV 99

16Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 17: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Preliminary ResultsCross Validation Scores ( 0.2 - 0.4 )

[ 0.38905775 0.34556575 0.31498471]

Confusion Matrixtest_df.groupby(['pred_category', 'senses']).senses.count().unstack()

17Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 18: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Preliminary Resultssenses AKV AVK VAK KAV VKA KVA

pred_category

AKV LOW HIGH HIGH HIGH HIGH HIGH

AVK HIGH LOW HIGH HIGH HIGH HIGH

VAK HIGH HIGH LOW HIGH HIGH HIGH

KAV HIGH HIGH HIGH LOW HIGH HIGH

VKA HIGH HIGH HIGH HIGH GOOD HIGH

KVA HIGH HIGH HIGH HIGH HIGH GOOD

18Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 19: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Vocabulary Updatedef nlp_text(mytext): tokens = nltk.word_tokenize(mytext.lower()) wrds = [w for w in tokens if w not in stopwords and w.isalpha()] return wrds

NLTK frequency distributionfd = nltk.FreqDist(nlp_text(text))

fd_voc = fd.most_common()

[('one', 31), ('us', 16), ...

19Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 20: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

NYT - Updated Resultsdf['senses'].value_counts()

VK ~90% KV ~10%

20Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 21: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

How is the project?

Dead AliveOR

21Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 22: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

PLS - Data Collection

Source:American Rhetoric political speecheswww.americanrhetoric.com

Method:Screen scraping

22Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 23: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

PLS - ResultsTotal of ~ 1000 records

VK ~25% KV ~75%

23Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 24: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Combination of NYT & PLSTotal number of records

~ 800

Updated filteringstd (V, K) / mean (V, K) > 0.25

Model Logistic Regression

Cross validation scores[ 0.94883721 0.99069767 0.98139535]

24Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 25: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Combination of NYT & PLS

Confusion matrix

25

senses KV VK

pred_category

KV 94 1

VK 10 110

Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 26: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

CriticismProblems:

● Editors’ corrections provide bias● Difference just in NYT and PLT style

Considerations:● NYT articles from different categories● Political speeches from different politicians● Similar people select similar fields● We better study from people of the similar type● Model with ngram_range = 1 (less style dependency)

26Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 27: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Explanation - NYT Results

Successful Journalist:● Reading a lot (V)● Typing, taking notes (V)● Use diagrams, sketches (V)● Watch, observe (V)● Fast analysis (V)● Active (K)● Why? Why not? What if? (K)

27Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly

Page 28: Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me Inna Starikova 1 Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General

Explanation - PLS Results

Successful Politician:● Learn from activity (K)● Address feelings (K)● Active (K)● Why and how things work (K)● Attract active people (K)● Fast analysis (V)● Good memory (V)

28Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly