Social Media Analytics: The Value Proposition

46
Social Media Analytics: the Value Proposition Rohini K. Srihari KDD 2010 Workshop on Social Media Analytics July 25, 2010

description

Rohini K. Srihari delivers her powerful presentation at the KDD 2010 Workshop on Social Media Analytics. Overview: -What is Social Media? -Value Proposition: Why mine social media? -Business Analytics -Counterterrorism -Challenges -Technology, Challenges -Multilingual social media mining

Transcript of Social Media Analytics: The Value Proposition

Social Media Analytics: the Value Proposition

Rohini K. SrihariKDD 2010 Workshop on Social Media Analytics

July 25, 2010

Outline

What is Social Media?

Value Proposition: Why mine social media? Business Analytics Counterterrorism

Challenges

Technology, Challenges

Multilingual social media mining

Future

Social Media Data Actionable Intelligence

Consumer Generated, Not Edited, Not Authenticated

Data/Text Mining

Analyze Observational Data to find unsuspected relationships and Summarize data in novel ways that are understandable and useful to data owner

Information Discovery

non-trivial, implicit, previously unknown relationships

Ex of Trivial: Those who are pregnant are female

Summarize

as Patterns and Models (usually probabilistic)

Usefulness: meaningful: lead to some advantage, usually economic

Analysis:

Automatic/Semi-Automatic Process (Knowledge Extraction)

Extracting useful information from large data sets

Value Proposition

Market Size Business Analytics market projected to be $28 billion

in 2011 (IDC Report) Social Analytics taking leading position of interest within

organizations

Integrating Social Media Analytics and Business Intelligence

Source: HCL India

Customer Relationship Management

Data sources are primarily internal Call center transcripts E-mail Customer feedback

Cost avoidance Product exchange mitigation Early warning detection on new products

Increase in customer satisfaction and loyalty

Insight towards new products, product features

Identification of possible marketing opportunities

e-Service Chat Monitoring

Operator: How can I assist you today?Customer: I need help with operating your coffee maker I bought from Amazon.com yesterday.Operator: Certainly. What problem are you facing?Customer: I fill in the coffee powder, water, and then press the red button on the side, and nothing happens.Operator: The red button enables the ‘clean coffee maker’ process. You will need to use the white knob on the other side to brew coffee. Customer: I see. Customer: BTW, in the Nespresso cappuccino machine I recently bought, it was the red button for start.

Is there anything else I can assist with today? SEND

Alert: COMPETITOR PRODUCT

MENTION

Alert: COMPETITOR PRODUCT

MENTION

Reputation Management

Data sources are primarily external, e.g. www.youtube.com www.epinions.com tripadvisor.com (travel related website)

Consumer Brand Analytics What are people saying about our brand?

Marketing Communications Significant spending on marketing, advertising:

companies trying to position their products Brand analytics helps to determine whether

such campaigns are effective

Mining Product Reviews

Application is Industrial Design Automatically mine product reviews for information on

product features, new requests, etc. Focus on wheelchairs

Features Extracted Easy to use Fit into a car Comfortable chair Light weight Convenient to fold Sturdy Good price

Viral Marketing

Jure Leskovec (Stanford), Lada Adamic (U of Michigan), Bernardo A. Huberman (HP Labs)

Personalized recommendations

Cross-selling“people who bought x also bought y”

Collaborative filtering“based on ratings of users like you…”Delicious, Digg.com

Viral marketing

68% of consumers consult friends and family before purchasing home electronics (Burke 2003)

Success rate: # of purchases following a recommendation / # recommenders

Books overall have a 3% success rate

500 million active users!▪ More than 20 million users update their status at least once each day▪ More than 850 million photos uploaded to the site each month▪ >1 billion pieces of content (web links, blog posts, photos, etc.) shared each week

Many different groups clamoring for data and text analytics:▪ FB Engineers▪ Advertisers▪ Page owners▪ Platform/Connect developers▪ Marketers▪ Academics

An aside: Social Media Marketing

http://www.socialmediaexaminer.com/new-studies-show-value-of-social-media/

Lead Generation

Breakdown of respondents’ top benefits of social networking: 50%: Generating leads 45%: Keeping up with the industry 44%: Monitoring online conversation 38%: Finding vendors/suppliers

Online Forum Users Are Enthusiastic Brand Advocates 79.2% of forum contributors help a friend or family member make a decision

about a product purchase – compared with 47.6% of non-contributors and 53.8% overall.

65% of forum contributors share advice (offline and in person) based on information that they’ve read online – compared with 35% of non-contributors and 40.8% overall.

57.7% of forum contributors proactively recommend someone make a particular purchase – compared with 16.9% of non-contributors and 24.9% overall.

Only 47% of Companies Experimenting With Social Media Gartner study predicts that by the end of 2010, more than 60% of Fortune

1000 companies will manage an online community. ComBlu’s study, The State of Online Branded Communities, shows that most

companies do not understand how to engage within online communities and have no real idea of what their customers want on these sites.

Citizen Response

E-RuleMaking the use of digital technologies by government

agencies in rulemaking, decision making processes

solicit citizen feedback on bills being debated in Congress

What new issues are being raised, what aspects of bill are popular, unpopular

Better to mine social media than using focus groups?

Political Campaigns Why do people support a candidate- is it really

based on issues?

Use Case: Understanding and Visualizing Consumer Responses

15

Extracting Entities and Sentiment to Power Alerting, Link Diagrams, and Geo-Mapping

Twitter: Real-Time Citizen Journalism

16

• Mumbai terror attack regarded as coming of age of Twitter

• citizen journalism provided more valuable information than wire services, broadcast news

• information about places to avoid, well being of relatives, friends, etc.

• many redundant posts, users have to wade through hundreds of posts to locate useful information

• Goal: to mine this data in real-time and produce well organized summaries

Law Enforcement, Homeland Security

17

• Facebook• gang members frequently boast about their activities on their facebook pages

• Chat rooms• Stalkers, pedophiles

• Twitter• protest rallies being planned• who, what, where, when

• Craigslist

G20 Summit Protest

Human Behaviour Analysis Process social media content, provide tools for analysts

to: Identify social networks: groups, members Identify topics of discussion and sentiment

• E.g. angry at govt., wanting retaliation, peacemakers

• Thought influencers

Identify social goals through analysis of verbal communication

• Manipulation: Persuasion, threats, coercion

• Religious supremacy: religious analogues

• recruitment

Social Media Content

Social Media Content

Link DiagramsLink Diagrams

Predictive Modeling

Technology, Challenges

Analyzing Social Media Data

Content Analysis Text analysis, multimedia analysis

Structure Analysis

Usage Analysis Search engine optimization What keywords are driving customers to your site,

competitor sites Query logs, site traffic

Ideally combine all three of these!

Solution Framework

Mark LogicOracle, MySQL

RDF Triple StoresCouchDB

ThetusI2

PalantirAttensityThemisAutonomyJodange, Lexalytics, Cymfony, Blogpulse

Kapow

Enterprise Content

Content Acquisition

Pre-selected, validated sites Epinions.com, Amazon.com, NYT

blogs, reader comments Tripadvisor.com, Craigslist Twitter, Facebook

Blog Search Engines Google Blog Search

http://blogsearch.google.com/ Technorati http://technorati.com/ Blogpulse http://blogpulse.com/

BoardReader http://boardreader.com/ http://www.omgili.com/

Spidering

Search Service

Lucene Index Storage

Data Collection: Spidering

Spider uses breadth and depth first (BFS and DFS) traversal for crawl space URL ordering based on URL tokens, anchor text, and link levels.

• Automated discovery of proxy servers to distribute collection and increase reliability.

“Dark Web” : the portion of the WorldWideWeb used to help achieve the

sinister objectives of terrorists and extremists.

Content Analysis Model Based

Develop models that generalize characteristics of data Machine learning: Supervised, semi-supervised, unsupervised

E.g., sequence labeling, classification N-gram language models

Linguistic: based on rules of English grammar Information Extraction

• Pattern Mining• frequency analysis, local patternsGoogle n-gram data

What words are used in conjunction with Buffalo, Buffalo Sabres, University at Buffalo

Query log analysisLearn spelling corrections, Learn lists of named entities, Learn relationshipsDiscover trends

Flu, cough, fever : frequency of queries in certain regions, change from the norm

Combine both approaches

Reliability of Data How much trust in data? (Forrester)

Email from people you know: 77% Consumer product ratings/reviews: 60% Message board posts: 21% Personal blog: 18%, company blog: 16%

Splog: Spam in weblogs UK has lawful intercept program What about results of data mining?

Off-topic posts Comments on blog posts, forums quickly turn into personal

rants, completely off-topic

Possible Remedies Focus on sites where data is known to be more reliable Use technology to filter out spam, splog and off-topic posts

Informal LanguageLoss of Functional Indicators

Missing punctuation

Missing or raNDOm case information

Whole phrases reduced to acronyms

Casual, Phonetic Spelling

tha, teh = the

Explicit Sentiment Commentary

Happy Birthdaaaayyyy!!!1!1!

must go <sigh>

:-P grrr…..

Mistaken auto-correction or replacement

Co-operation = Cupertino

The Queen = Queen Elizabeth, “hundreds of worker bees commanded by Queen Elizabeth”

Twitter Conventions

alanbr82 RT @royjwells: New Blog Post - Will Old Spice Achieve a ROI? http://ow.ly/2dZf7 #oldspice #sm #socialmedia

RT, hashtags #, url shortening

Word Inventions

refudiate, wee-wee’d up

momager, rickRoll

L33t, IMHO, meh

Solutions:

• spelling correction

• acronym look-up

• machine learning: treat it as a machine translation problem!

Legal Issues

Privacy of data UK has lawful intercept program What about results of data mining?

Liability Major issue for pharmaceutical companies: if they

discover report of side effect of drug, they are required to report it

Analysts making positive public statements about company earnings, yet contradicting this on blogs, facebook pages

Workplace Issues Time spent on social media sites during work hours

leading to lower productivity

Accuracy of Analysis

Text analysis is based on natural language processing which is a useful, but imperfect technology

“Bill Gates, the CEO of Microsoft was initially very happy about its site location in Seattle, but now he has other thoughts. He is very displeased with the pollution…. Also, its employees are upset with the construction work…around its vicinity. In all, he wants to abandon the current site…..”

Who is expressing an opinion?

What is the opinion about?

Is it positive or negative?

Validate performance accuracy through benchmarks on specially constructed data sets

1 - http://gretawire.blogs.foxnews.com/ouch-this-is-not-fair-to-president-obama-yes-an-accident-but-one-that-needs-to-be-corrected/#ixzz0uKumt1wi

Sentiment Analysis

I think, Obama needs to begin to take the blame for his failed policies -- his statement "that his policies are getting us out of this mess" are a big lie1.

Opinion HolderOpinion Holder TopicTopic

SENTIMENTAttributesID:ex1 , TargetID:t1, Polarity: Negative

SENTIMENTAttributesID:ex1 , TargetID:t1, Polarity: Negative

TargetTarget

Aims to determine the attitude of a speaker or a writer with respect to some target or topic.

Opinion summary

In product reviews, we are interested in generating a feature-based summary for a product.

Digital_camera_1:

Feature: picture quality

Positive: 253

<individual review sentences>

Negative: 6

<individual review sentences>

Feature: size

Positive: 134

<individual review sentences>

Negative: 10

<individual review sentences>

Scalability: Massively Distributed/Parallel Computing

Hadoop Open-source framework for running Map-Reduce on a cluster of commodity

machines, as well as a distributed file system for long-term storage Map-Reduce (invented at Google) provides a way to process large data sets

that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers

Hadoop now an Apache project led by the Grid Computing team at Yahoo!

HIVE SQL-like query language, table partitioning schema, and metadata store

built on top of Hadoop Developed at Facebook, now an Apache subproject

Facebook Analytics:How many people are discussing being laid off; plot percentage of total posts by state

Multilingual Applications

Language Usage Statistics[1]

Urdu speaking Internet users - 12,000,000 (2006)~ 1.6% of 42.4%

[1] Source:Internet World Stats. Based on 1,733,993,741 estimated internet users for Sept 30, 2009Copyright 2009, Miniwatts Marketing Group

English is not the only language on the internet

Multilingual Social Media Mining

How did people in Egypt, Israel and Pakistan react to the latest presidential speech?

Opinion Extraction Topic: What is the opinion about? Opinion Holder: Who is expressing it? What is the intensity of the opinion? In what context is it being expressed?

Emotion Detection What kind of emotion is being expressed? – goes

beyond just the positive or negative emotion

Required to perform behavioral analysis, cross cultural analysis

Faceted Search: Sentiment about Topic

People are filled with anger and sorrow because of the policies made by Musharaf.OPINION HOLDER – Writer, People

TARGET –Musharaf’s policies (Musharaf is an implied target)

Multilingual Text Analysis

Dealing with script, coding variations

Even low-level text analysis becomes difficult Chinese: no white space between words Arabic: complex diacriticals

Language Training Resources Lexicons, annotated corpora, etc. If sufficient training data exists, new languages

can be adapted to fairly easily E.g. core Russian in 3 weeks!

Treat language porting as a special case of domain porting

Ideally, should involve creation of new data sources, not new code

Chinese Text Analysis

38

斯洛文尼亚总理扬沙,欧洲委员会主席巴罗佐和欧盟外交政策负责人索拉纳与梅德韦杰夫共进非正式晚餐

Slovenia premier the sand blowing, Council of Europe

President Baluozuo and European Union foreign policy person in charge Solana and Medvedev have the unofficial

supper.

Context Aware Translation

Name translation output:

<NeGPE english="Slovenia">斯洛文尼亚</NeGPE> 总理<NePer english="Jansa">扬沙</NePer> ,<NeOrg english="European Commission">欧

洲 委员会</NeOrg> 主席<NePer english="Barroso">巴罗佐</NePer>

和<NeGPE english="European Union">欧盟</NeGPE> 外交 政策 负责人<NePer english="Solana">索拉纳</NePer> 与<NePer english="Medvedev">梅德韦杰夫</NePer>

共 进 非正式 晚餐 。

Powered by Semantex™ extracted entities, Babelfish translates as:

Slovenia Premier Jansa, Council of Europe President Barroso and European Union foreign policy person in charge Solana and Medvedev have the unofficial supper.

Babelfish TranslationContext Aware Translation

Mining Wikipedia for Lexicons

• Translation lexicons automatically extracted from Chinese Wikipedia, use cross language links to add English translations• Easy to regenerate with new versions of Wikipedia• Chinese Wikipedia is constantly growing

COLABA: Colloquial Arabic Blog Analysis– Proliferation of open source, social media

– Dominance of non-English content

– Use of dialects and colloquial language

– Limited supply of multilingual analysts

Human translation for all Arabic variants below is the same:“There is no electricity, what happened?”

Arabic Dialects are not handled well in current machine translation systems.

COLABA enables MSA tools to interpret dialects correctly.

Tools made for MSA fail on Arabic dialects

42

Arabic Variant

Arabic Source Text Google Translate

Egyptian ليه اتقطعت، الكهربابس؟ كده

Atqtat electrical wires, Why are Posted?

Levantine كهربا، مفيش شكلوهيك؟ ليش

Cklo Mafeesh كهربا, Lech heck?

Iraqi خير؟ كهرباء، ماكو شو Xu MACON electricity, good?

MSA ماذا كهرباء، اليوجدحصل؟

Does not have electricity, what happened?

Code Mixing, Switching Use of Latin script: lack of transliteration

standards makes it difficult to process

Spanglish, Hinglish, Urdish, etc.

Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]

Afsoos key baat hai . kal tak jo batain Non Muslim bhi kartay hoay dartay thay abhi this man has brought it out in the open. [It is sad to see that those words that even a non muslim would fear to utter until yesterday, this man has brought it out in the open]

Solutions:• Apply “romanized” POS tagger, English tagger in tandem: use machine learning to combine evidence and generate final tag, language ID• For longer English spans, use English NLP system

Resource Poor Languages

corrections

corrections

CORRECT SAMPLESCORRECT SAMPLES

TRAININGTRAINING

SEEDSEED

Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the

classifier itself to the training set, and retraining the classifier

Bootstrap Learning: process of improving the performance of a trained classifier by iteratively adding data that is labeled by the

classifier itself to the training set, and retraining the classifier

Useful when there is not enough annotated dataRequirement

NEEDS SEED DATA

DATADATA

The Road Ahead?

Strengths free form facilitates capturing the true voice of customer, wisdom of crowd

can be expressed through voice, text messaging on mobile phones, etc.

Weaknesses language analysis and mining are challenging

susceptible to spam, self-serving use by companies

Behaviour, predictive models need more research

Threats privacy and security issues: possible to assimilate detailed knowledge about person’s activities, whereabouts

can lead to anti-social behaviour!

Opportunities promise of collective problem solving: coordination, cooperation

mobile use supports dealing with societal problems, disaster situations: social network is geospatial proximity

THANKS! QUESTIONS?