Social Media Analytics: The Value Proposition
-
Upload
content-savvy -
Category
Education
-
view
2.403 -
download
0
description
Transcript of Social Media Analytics: The Value Proposition
Social Media Analytics: the Value Proposition
Rohini K. SrihariKDD 2010 Workshop on Social Media Analytics
July 25, 2010
Outline
What is Social Media?
Value Proposition: Why mine social media? Business Analytics Counterterrorism
Challenges
Technology, Challenges
Multilingual social media mining
Future
Data/Text Mining
Analyze Observational Data to find unsuspected relationships and Summarize data in novel ways that are understandable and useful to data owner
Information Discovery
non-trivial, implicit, previously unknown relationships
Ex of Trivial: Those who are pregnant are female
Summarize
as Patterns and Models (usually probabilistic)
Usefulness: meaningful: lead to some advantage, usually economic
Analysis:
Automatic/Semi-Automatic Process (Knowledge Extraction)
Extracting useful information from large data sets
Market Size Business Analytics market projected to be $28 billion
in 2011 (IDC Report) Social Analytics taking leading position of interest within
organizations
Integrating Social Media Analytics and Business Intelligence
Source: HCL India
Customer Relationship Management
Data sources are primarily internal Call center transcripts E-mail Customer feedback
Cost avoidance Product exchange mitigation Early warning detection on new products
Increase in customer satisfaction and loyalty
Insight towards new products, product features
Identification of possible marketing opportunities
e-Service Chat Monitoring
Operator: How can I assist you today?Customer: I need help with operating your coffee maker I bought from Amazon.com yesterday.Operator: Certainly. What problem are you facing?Customer: I fill in the coffee powder, water, and then press the red button on the side, and nothing happens.Operator: The red button enables the ‘clean coffee maker’ process. You will need to use the white knob on the other side to brew coffee. Customer: I see. Customer: BTW, in the Nespresso cappuccino machine I recently bought, it was the red button for start.
Is there anything else I can assist with today? SEND
Alert: COMPETITOR PRODUCT
MENTION
Alert: COMPETITOR PRODUCT
MENTION
Reputation Management
Data sources are primarily external, e.g. www.youtube.com www.epinions.com tripadvisor.com (travel related website)
Consumer Brand Analytics What are people saying about our brand?
Marketing Communications Significant spending on marketing, advertising:
companies trying to position their products Brand analytics helps to determine whether
such campaigns are effective
Mining Product Reviews
Application is Industrial Design Automatically mine product reviews for information on
product features, new requests, etc. Focus on wheelchairs
Features Extracted Easy to use Fit into a car Comfortable chair Light weight Convenient to fold Sturdy Good price
Viral Marketing
Jure Leskovec (Stanford), Lada Adamic (U of Michigan), Bernardo A. Huberman (HP Labs)
Personalized recommendations
Cross-selling“people who bought x also bought y”
Collaborative filtering“based on ratings of users like you…”Delicious, Digg.com
Viral marketing
68% of consumers consult friends and family before purchasing home electronics (Burke 2003)
Success rate: # of purchases following a recommendation / # recommenders
Books overall have a 3% success rate
500 million active users!▪ More than 20 million users update their status at least once each day▪ More than 850 million photos uploaded to the site each month▪ >1 billion pieces of content (web links, blog posts, photos, etc.) shared each week
Many different groups clamoring for data and text analytics:▪ FB Engineers▪ Advertisers▪ Page owners▪ Platform/Connect developers▪ Marketers▪ Academics
An aside: Social Media Marketing
http://www.socialmediaexaminer.com/new-studies-show-value-of-social-media/
Lead Generation
Breakdown of respondents’ top benefits of social networking: 50%: Generating leads 45%: Keeping up with the industry 44%: Monitoring online conversation 38%: Finding vendors/suppliers
Online Forum Users Are Enthusiastic Brand Advocates 79.2% of forum contributors help a friend or family member make a decision
about a product purchase – compared with 47.6% of non-contributors and 53.8% overall.
65% of forum contributors share advice (offline and in person) based on information that they’ve read online – compared with 35% of non-contributors and 40.8% overall.
57.7% of forum contributors proactively recommend someone make a particular purchase – compared with 16.9% of non-contributors and 24.9% overall.
Only 47% of Companies Experimenting With Social Media Gartner study predicts that by the end of 2010, more than 60% of Fortune
1000 companies will manage an online community. ComBlu’s study, The State of Online Branded Communities, shows that most
companies do not understand how to engage within online communities and have no real idea of what their customers want on these sites.
Citizen Response
E-RuleMaking the use of digital technologies by government
agencies in rulemaking, decision making processes
solicit citizen feedback on bills being debated in Congress
What new issues are being raised, what aspects of bill are popular, unpopular
Better to mine social media than using focus groups?
Political Campaigns Why do people support a candidate- is it really
based on issues?
Use Case: Understanding and Visualizing Consumer Responses
15
Extracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping
Twitter: Real-Time Citizen Journalism
16
• Mumbai terror attack regarded as coming of age of Twitter
• citizen journalism provided more valuable information than wire services, broadcast news
• information about places to avoid, well being of relatives, friends, etc.
• many redundant posts, users have to wade through hundreds of posts to locate useful information
• Goal: to mine this data in real-time and produce well organized summaries
Law Enforcement, Homeland Security
17
• Facebook• gang members frequently boast about their activities on their facebook pages
• Chat rooms• Stalkers, pedophiles
• Twitter• protest rallies being planned• who, what, where, when
• Craigslist
G20 Summit Protest
Human Behaviour Analysis Process social media content, provide tools for analysts
to: Identify social networks: groups, members Identify topics of discussion and sentiment
• E.g. angry at govt., wanting retaliation, peacemakers
• Thought influencers
Identify social goals through analysis of verbal communication
• Manipulation: Persuasion, threats, coercion
• Religious supremacy: religious analogues
• recruitment
Social Media Content
Social Media Content
Link DiagramsLink Diagrams
Predictive Modeling
Analyzing Social Media Data
Content Analysis Text analysis, multimedia analysis
Structure Analysis
Usage Analysis Search engine optimization What keywords are driving customers to your site,
competitor sites Query logs, site traffic
Ideally combine all three of these!
Solution Framework
Mark LogicOracle, MySQL
RDF Triple StoresCouchDB
ThetusI2
PalantirAttensityThemisAutonomyJodange, Lexalytics, Cymfony, Blogpulse
Kapow
Enterprise Content
Content Acquisition
Pre-selected, validated sites Epinions.com, Amazon.com, NYT
blogs, reader comments Tripadvisor.com, Craigslist Twitter, Facebook
Blog Search Engines Google Blog Search
http://blogsearch.google.com/ Technorati http://technorati.com/ Blogpulse http://blogpulse.com/
BoardReader http://boardreader.com/ http://www.omgili.com/
Spidering
Search Service
Lucene Index Storage
Data Collection: Spidering
Spider uses breadth and depth first (BFS and DFS) traversal for crawl space URL ordering based on URL tokens, anchor text, and link levels.
• Automated discovery of proxy servers to distribute collection and increase reliability.
•
“Dark Web” : the portion of the WorldWideWeb used to help achieve the
sinister objectives of terrorists and extremists.
Content Analysis Model Based
Develop models that generalize characteristics of data Machine learning: Supervised, semi-supervised, unsupervised
E.g., sequence labeling, classification N-gram language models
Linguistic: based on rules of English grammar Information Extraction
• Pattern Mining• frequency analysis, local patternsGoogle n-gram data
What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo
Query log analysisLearn spelling corrections, Learn lists of named entities, Learn relationshipsDiscover trends
Flu, cough, fever : frequency of queries in certain regions, change from the norm
Combine both approaches
Reliability of Data How much trust in data? (Forrester)
Email from people you know: 77% Consumer product ratings/reviews: 60% Message board posts: 21% Personal blog: 18%, company blog: 16%
Splog: Spam in weblogs UK has lawful intercept program What about results of data mining?
Off-topic posts Comments on blog posts, forums quickly turn into personal
rants, completely off-topic
Possible Remedies Focus on sites where data is known to be more reliable Use technology to filter out spam, splog and off-topic posts
Informal LanguageLoss of Functional Indicators
Missing punctuation
Missing or raNDOm case information
Whole phrases reduced to acronyms
Casual, Phonetic Spelling
tha, teh = the
Explicit Sentiment Commentary
Happy Birthdaaaayyyy!!!1!1!
must go <sigh>
:-P grrr…..
Mistaken auto-correction or replacement
Co-operation = Cupertino
The Queen = Queen Elizabeth, “hundreds of worker bees commanded by Queen Elizabeth”
Twitter Conventions
alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI? http://ow.ly/2dZf7 #oldspice #sm #socialmedia
RT, hashtags #, url shortening
Word Inventions
refudiate, wee-wee’d up
momager, rickRoll
L33t, IMHO, meh
Solutions:
• spelling correction
• acronym look-up
• machine learning: treat it as a machine translation problem!
Legal Issues
Privacy of data UK has lawful intercept program What about results of data mining?
Liability Major issue for pharmaceutical companies: if they
discover report of side effect of drug, they are required to report it
Analysts making positive public statements about company earnings, yet contradicting this on blogs, facebook pages
Workplace Issues Time spent on social media sites during work hours
leading to lower productivity
Accuracy of Analysis
Text analysis is based on natural language processing which is a useful, but imperfect technology
“Bill Gates, the CEO of Microsoft was initially very happy about its site location in Seattle, but now he has other thoughts. He is very displeased with the pollution…. Also, its employees are upset with the construction work…around its vicinity. In all, he wants to abandon the current site…..”
Who is expressing an opinion?
What is the opinion about?
Is it positive or negative?
Validate performance accuracy through benchmarks on specially constructed data sets
1 - http://gretawire.blogs.foxnews.com/ouch-this-is-not-fair-to-president-obama-yes-an-accident-but-one-that-needs-to-be-corrected/#ixzz0uKumt1wi
Sentiment Analysis
I think, Obama needs to begin to take the blame for his failed policies -- his statement "that his policies are getting us out of this mess" are a big lie1.
Opinion HolderOpinion Holder TopicTopic
SENTIMENTAttributesID:ex1 , TargetID:t1, Polarity: Negative
SENTIMENTAttributesID:ex1 , TargetID:t1, Polarity: Negative
TargetTarget
Aims to determine the attitude of a speaker or a writer with respect to some target or topic.
Opinion summary
In product reviews, we are interested in generating a feature-based summary for a product.
Digital_camera_1:
Feature: picture quality
Positive: 253
<individual review sentences>
Negative: 6
<individual review sentences>
Feature: size
Positive: 134
<individual review sentences>
Negative: 10
<individual review sentences>
…
Scalability: Massively Distributed/Parallel Computing
Hadoop Open-source framework for running Map-Reduce on a cluster of commodity
machines, as well as a distributed file system for long-term storage Map-Reduce (invented at Google) provides a way to process large data sets
that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers
Hadoop now an Apache project led by the Grid Computing team at Yahoo!
HIVE SQL-like query language, table partitioning schema, and metadata store
built on top of Hadoop Developed at Facebook, now an Apache subproject
Facebook Analytics:How many people are discussing being laid off; plot percentage of total posts by state
Language Usage Statistics[1]
Urdu speaking Internet users - 12,000,000 (2006)~ 1.6% of 42.4%
[1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009Copyright 2009, Miniwatts Marketing Group
English is not the only language on the internet
Multilingual Social Media Mining
How did people in Egypt, Israel and Pakistan react to the latest presidential speech?
Opinion Extraction Topic: What is the opinion about? Opinion Holder: Who is expressing it? What is the intensity of the opinion? In what context is it being expressed?
Emotion Detection What kind of emotion is being expressed? – goes
beyond just the positive or negative emotion
Required to perform behavioral analysis, cross cultural analysis
Faceted Search: Sentiment about Topic
People are filled with anger and sorrow because of the policies made by Musharaf.OPINION HOLDER – Writer, People
TARGET –Musharaf’s policies (Musharaf is an implied target)
Multilingual Text Analysis
Dealing with script, coding variations
Even low-level text analysis becomes difficult Chinese: no white space between words Arabic: complex diacriticals
Language Training Resources Lexicons, annotated corpora, etc. If sufficient training data exists, new languages
can be adapted to fairly easily E.g. core Russian in 3 weeks!
Treat language porting as a special case of domain porting
Ideally, should involve creation of new data sources, not new code
斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策负责人索拉纳与梅德韦杰夫共进非正式晚餐
Slovenia premier the sand blowing, Council of Europe
President Baluozuo and European Union foreign policy person in charge Solana and Medvedev have the unofficial
supper.
Context Aware Translation
Name translation output:
<NeGPE english="Slovenia">斯洛文尼亚</NeGPE> 总理<NePer english="Jansa">扬沙</NePer> ,<NeOrg english="European Commission">欧
洲 委员会</NeOrg> 主席<NePer english="Barroso">巴罗佐</NePer>
和<NeGPE english="European Union">欧盟</NeGPE> 外交 政策 负责人<NePer english="Solana">索拉纳</NePer> 与<NePer english="Medvedev">梅德韦杰夫</NePer>
共 进 非正式 晚餐 。
Powered by Semantex™ extracted entities, Babelfish translates as:
Slovenia Premier Jansa, Council of Europe President Barroso and European Union foreign policy person in charge Solana and Medvedev have the unofficial supper.
Babelfish TranslationContext Aware Translation
Mining Wikipedia for Lexicons
• Translation lexicons automatically extracted from Chinese Wikipedia, use cross language links to add English translations• Easy to regenerate with new versions of Wikipedia• Chinese Wikipedia is constantly growing
COLABA: Colloquial Arabic Blog Analysis– Proliferation of open source, social media
– Dominance of non-English content
– Use of dialects and colloquial language
– Limited supply of multilingual analysts
Human translation for all Arabic variants below is the same:“There is no electricity, what happened?”
Arabic Dialects are not handled well in current machine translation systems.
COLABA enables MSA tools to interpret dialects correctly.
Tools made for MSA fail on Arabic dialects
42
Arabic Variant
Arabic Source Text Google Translate
Egyptian ليه اتقطعت، الكهربابس؟ كده
Atqtat electrical wires, Why are Posted?
Levantine كهربا، مفيش شكلوهيك؟ ليش
Cklo Mafeesh كهربا, Lech heck?
Iraqi خير؟ كهرباء، ماكو شو Xu MACON electricity, good?
MSA ماذا كهرباء، اليوجدحصل؟
Does not have electricity, what happened?
Code Mixing, Switching Use of Latin script: lack of transliteration
standards makes it difficult to process
Spanglish, Hinglish, Urdish, etc.
Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]
Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]
Solutions:• Apply “romanized” POS tagger, English tagger in tandem: use machine learning to combine evidence and generate final tag, language ID• For longer English spans, use English NLP system
Resource Poor Languages
corrections
corrections
CORRECT SAMPLESCORRECT SAMPLES
TRAININGTRAINING
SEEDSEED
Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the
classifier itself to the training set, and retraining the classifier
Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the
classifier itself to the training set, and retraining the classifier
Useful when there is not enough annotated dataRequirement
NEEDS SEED DATA
DATADATA
The Road Ahead?
Strengths free form facilitates capturing the true voice of customer, wisdom of crowd
can be expressed through voice, text messaging on mobile phones, etc.
Weaknesses language analysis and mining are challenging
susceptible to spam, self-serving use by companies
Behaviour, predictive models need more research
Threats privacy and security issues: possible to assimilate detailed knowledge about person’s activities, whereabouts
can lead to anti-social behaviour!
Opportunities promise of collective problem solving: coordination, cooperation
mobile use supports dealing with societal problems, disaster situations: social network is geospatial proximity