Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me...
-
Upload
duongkhanh -
Category
Documents
-
view
215 -
download
0
Transcript of Talk To Me - Cloudinaryres.cloudinary.com/.../upload/v1426784578/lfbje6j2u1onkj2p1d55.pdfTalk To Me...
Talk To MeInna Starikova
1Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
The Three Perceptual Channels
I SEE what you say.
SCREAMING colorHEATED argument
Visual
AuditoryKinesthetic Tactile
Originated from the idea that individuals differ in how they learn and learning modalities adapted from Barbe, Swassing, and Milone “Teaching through Modality Strengths: Concepts and Practices” (1979)
2Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
The Three Perceptual Channels
VKA AVK
3Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
The Six Types of People
Based on how people process information during communication
VAK
VKA AVK
AKV
KVA KAV4Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Purpose
Identify perceptual type by written text
5Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Usage
AVK KVA
Media
Family
Influence
Axx Kxx KxxA = K = V6Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
cliparts.co
Project Plan1. Collect initial data for model training2. Analyse the data to apply classes3. Train the model4. Test the model on classified data to see the error5. If needed, update model or classification of initial
classes.
7Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Data Collection
Source:New York Times articleswww.nytimes.com
Method:Screen scraping
8Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT -Data CollectionDaily Front Page Linkhttp://www.nytimes.com/indexes/YYYY/MM/DD/todayspaper/
Front Page HTML Source<h3><a href="http://.../article-title.html">Article Title </a></h3>
9Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Data CollectionLibraries:
pandas, requests, BeautifulSoup
Limitations:~ 200 articles per request~ 3 to 5 front pages
10Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Data Collection
1. Use date range for front page link2. Get daily front page HTML content3. Extract a list of title href links4. Use title links to get article HTML content
Total ~ 3500 articles11Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Data Cleaning● BeautifulSoup (html handling)
bs_text = BeautifulSoup(html_text) text_parts = bs_text.find_all(text=True) text = ' '.join(text_parts)
● Manual html string manipulation● Removing duplicate rows (pd.DataFrame)
df = df.sort(['href']).drop_duplicates(subset='href', take_last=True)
● Cleaning empty rows (pd.DataFrame)df = df [ df.col_name ! = '' ]
12Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Data Cleaning - StopwordsNatural Language Toolkit: Stopwords
Setupfrom nltk.corpus import stopwords
Cache stopwords (improve processing speed)stopwords = stopwords.words('english')
Normalizationtext = text.lower()
Processtext_stwrd = ' '.join([w for w in text.split() if w not in stopwords])
13Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Classification - Vocabulary
Internet sources (~10 words)http://tribehr.com/blog/communication-styles-auditory-visual-kinesthetic
Use synonyms (~ 60 words)
14Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Classification - Filtering
mean ( V, K, A) > 5std ( V, K, A) > 2
V K A mean std
V1 K1 A1 mean1 ( V1, K1, A1 ) std1 ( V1, K1, A1 )
V2 K2 A2 mean2 ( V2, K2, A2 ) std2 ( V2, K2, A2 )
mean (mean1, mean2) std (std1, std2)
df['mean'].describe()mean 6.359088std 3.40102025% 4.00000050% 6.33333375% 8.666667
df['std'].describe()mean 2.015510std 1.25568825% 1.00000050% 1.73205175% 2.645751
15Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Classificationdf['senses'].value_counts()
VKA 332KVA 308VAK 243AVK 211KAV 118AKV 99
16Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Preliminary ResultsCross Validation Scores ( 0.2 - 0.4 )
[ 0.38905775 0.34556575 0.31498471]
Confusion Matrixtest_df.groupby(['pred_category', 'senses']).senses.count().unstack()
17Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Preliminary Resultssenses AKV AVK VAK KAV VKA KVA
pred_category
AKV LOW HIGH HIGH HIGH HIGH HIGH
AVK HIGH LOW HIGH HIGH HIGH HIGH
VAK HIGH HIGH LOW HIGH HIGH HIGH
KAV HIGH HIGH HIGH LOW HIGH HIGH
VKA HIGH HIGH HIGH HIGH GOOD HIGH
KVA HIGH HIGH HIGH HIGH HIGH GOOD
18Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Vocabulary Updatedef nlp_text(mytext): tokens = nltk.word_tokenize(mytext.lower()) wrds = [w for w in tokens if w not in stopwords and w.isalpha()] return wrds
NLTK frequency distributionfd = nltk.FreqDist(nlp_text(text))
fd_voc = fd.most_common()
[('one', 31), ('us', 16), ...
19Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
NYT - Updated Resultsdf['senses'].value_counts()
VK ~90% KV ~10%
20Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
How is the project?
Dead AliveOR
21Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
PLS - Data Collection
Source:American Rhetoric political speecheswww.americanrhetoric.com
Method:Screen scraping
22Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
PLS - ResultsTotal of ~ 1000 records
VK ~25% KV ~75%
23Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Combination of NYT & PLSTotal number of records
~ 800
Updated filteringstd (V, K) / mean (V, K) > 0.25
Model Logistic Regression
Cross validation scores[ 0.94883721 0.99069767 0.98139535]
24Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Combination of NYT & PLS
Confusion matrix
25
senses KV VK
pred_category
KV 94 1
VK 10 110
Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
CriticismProblems:
● Editors’ corrections provide bias● Difference just in NYT and PLT style
Considerations:● NYT articles from different categories● Political speeches from different politicians● Similar people select similar fields● We better study from people of the similar type● Model with ngram_range = 1 (less style dependency)
26Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Explanation - NYT Results
Successful Journalist:● Reading a lot (V)● Typing, taking notes (V)● Use diagrams, sketches (V)● Watch, observe (V)● Fast analysis (V)● Active (K)● Why? Why not? What if? (K)
27Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly
Explanation - PLS Results
Successful Politician:● Learn from activity (K)● Address feelings (K)● Active (K)● Why and how things work (K)● Attract active people (K)● Fast analysis (V)● Good memory (V)
28Inna Starikova - Talk To Me - 16 March 2015 - Data Science - General Assembly