Text analytics intro
-
Upload
benjamin-taylor -
Category
Data & Analytics
-
view
119 -
download
2
description
Transcript of Text analytics intro
Intro 2 text analytics | Ben Taylor @bentaylordata
Text Analytics Are Awesome!
Thank you to our Sponsors!
HIREVUE | TALENT INTERACTION
Agenda
SPAM
Levenshtein distance (word, sentence, cloud)
2
3
4
Text handling, introduction1
Map Reduce / Clustering5
Interview text analytics6
Sentiment
HIREVUE | TALENT INTERACTION
Text handlingInput not expected?
HIREVUE | TALENT INTERACTION
Model
MInput Output
HIREVUE | TALENT INTERACTION
Model
MInput
HIREVUE | TALENT INTERACTION
Model
MInput Output
Stderr: You’re an idiot &
I don’t like you anymore
HIREVUE | TALENT INTERACTION
Input
HIREVUE | TALENT INTERACTION @BENTAYLORDATA
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION
HIREVUE | TALENT INTERACTION @BENTAYLORDATA
Need to map unstructured text to summary metric
HIREVUE | TALENT INTERACTION
SentimentHow are you feeling?
HIREVUE | TALENT INTERACTION
Let’s make this easy.
Problem statement:Expletives + @skullcandy mention? Good or bad?
HIREVUE | TALENT INTERACTION
Negative Sentiment 1048940088:
"I've got two pairs of Ink'd earbuds by @Skullcandy and they both broke in two weeks. I $#@&ing hate @Skullcandy! #$#@&You”
1054044204: “$#@& only one headphone stopped working stupid $#@&ing headphones y is it
only one headphone i blame you @skullcandy”
1376767884: "@skullcandy never buyin another pair of skull candy headphones this is the fourth
pair in the last 2 months that $#@&ed up”
141343855: “My headphones blew $#@& you skullcandy -___-”
16352011: “BAHHHHH My SkullCandys are $#@&ing up AGAIN!”
1376767884: "@skullcandy $#@& skullcandy"
HIREVUE | TALENT INTERACTION
Positive Sentiment 161547390:
"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"
1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"
1117713458: "@skullcandy $#@&in bass is badass",
1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"
1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!!
the bass is truly amazing :)"
132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me
back and I'll hook you up."
HIREVUE | TALENT INTERACTION
Neutral Sentiment
1104061464: "@autoerotique @skullcandy #crushers First pair
died after 2 days. Day 2 for new pair. The Alarm is thrashing my head, un$#@&me these rock”
HIREVUE | TALENT INTERACTION
Conclusion
Sentiment Classification Count
Negative 6
Positive 6
Neutral 1
46% chance tweet is negative, now what?
Welcome to the majority of the sentiment solutions on the market:
Single-word naïve Bayesian classification
HIREVUE | TALENT INTERACTION
Positive Sentiment (second pass) 161547390:
"Getting some skullcandy fix's. #tight #skullcandy #$#@&ingpumped"
1306207039: "@skullcandy @VegasJarhead @justine_mom $#@& yeah!"
1117713458: "@skullcandy $#@&in bass is badass",
1117713458: "@skullcandy ur headphones are bad ass and have awsome $#@&in bass"
1086228384: "Just bough a pair of Skullcandy supreme sound Hesh's $#@&ING AWSOME!!! the bass is truly
amazing :)"
132303540: "@K$#@&INGP I thought you were a man not a pussy. Try Skullcandy. Hit me back and I'll hook
you up.”
1104061464: "@autoerotique @skullcandy #crushers First pair died after 2 days. Day 2 for new pair. The Alarm
is thrashing my head, un$#@&me these rock”
HIREVUE | TALENT INTERACTION
ConclusionSentiment Classification Count
Negative 6
Positive ~0
Neutral ~0
~100% chance tweet is negative with tuple assistance. How to find complex tuples automatically!?
Bayesian bootstrap matrix
Unique words in training cloud
Un
iqu
e w
ord
s i
n t
rain
ing
clo
ud
HIREVUE | TALENT INTERACTION
Basic sentiment output
Credit: Ben Peters
Keyword Negative positive
warranty 28.7 1
cant 11.8 1
back 11.8 1
break 11.8 1
after 11.1 1
what 9.1 1
never 9.1 1
Don’t 9.1 1
second 8.4 1
side 8.4 1
HIREVUE | TALENT INTERACTION
SPAMI can’t handle this
HIREVUE | TALENT INTERACTION
Lost future customer
HIREVUE | TALENT INTERACTION
SPAM examples:
>80%
HIREVUE | TALENT INTERACTION
SPAM list
Keyword spam good
@nikesb 52.0 1
@lrgskate 52.0 1
live 34.0 1
know 1 28.8
have 1 22.3
pair 1 16.3
earbud 16.1 1
Non-ascii-chars 12.4 1
some 1 11.9
check 1 11.6
Credit: Ben Peters
HIREVUE | TALENT INTERACTION
Training…. Where do you get your training set?
What about @#tags? Misspellings? ?
HIREVUE | TALENT INTERACTION
Training…. Where do you get your training set?
What about @#tags? Misspellings? ? SPAM?
HIREVUE | TALENT INTERACTION
Manual trainerhttp://54.186.199.209/
Credit: Ben Peters
HIREVUE | TALENT INTERACTION
LevenshteinNow things are getting interesting
HIREVUE | TALENT INTERACTION
The things we take for grantedYou type: Awsome
Computer: It’s actually spelled Awesome
① kitten → sitten (substitution of "s" for "k")② sitten → sittin (substitution of "i" for "e")③ sittin → sitting (insertion of "g" at the end)
HIREVUE | TALENT INTERACTION
Levenshtein word levelRef:
I am going skiing tomorrowHyp:
I am going skiing on Saturday
HIREVUE | TALENT INTERACTION
Levenshtein word-cloud levelRef:
alphanumeric_sort(word_cloud_1)alphanumeric_sort(unique(word_cloud_1))
Hyp:alphanumeric_sort(word_cloud_2)alphanumeric_sort(unique(word_cloud_2))
>> wer(str1,str1)ans = 0
>> wer(strjoin(sort(strsplit(str1,' ')),' '),str1)ans = 15
HIREVUE | TALENT INTERACTION
MapReduceGreat forText processingi.e. word counts
HIREVUE | TALENT INTERACTION
CLUSTERINGNow things are getting interesting
Group of tweets? Once we have categorized tweets we can build
word clouds!!!
Category A (could be negative sentiment, low selling areas, etc..)
Category B (could be positive sentiment, high selling areas, etc..)
words
words
words
words
words
wordswords
wordswords
words
Levenshtein wordcloud similarity
Levenshtein wordcloud similarity
Cluster 1 example
CampingVirginGamingBattlefield
Cluster 2 example
SkiingwinterStringray
Cluster 3 example
MMABoxingSkateboarding
Twitter Surgery
- =
Training a blacklist filter
Acting…
Getting…
Holding…
Going…
Brings…
Turning..
Blacklist dictionary