Processing Twitter Data with MongoDB -...

45
Processing Twitter Data with MongoDB Xiaoxiao Liu

Transcript of Processing Twitter Data with MongoDB -...

Page 1: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Processing Twitter Data with MongoDB

Xiaoxiao Liu

Page 2: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Issue with Facebook Data

● Original, I planned to do this project with Facebook Data.

- Facebook Graph API

- Third-Party Java Library: restFB● I was interested in doing social network analysis,

so the information I need to get including users information, users' friends information, and the relationship between these users.

Page 3: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

However.....

Limitation of Graph API:

As stated by Facebook: “This will only return any friends who have used (via Facebook Login) the app making the request.”

(In this case, the app is graph API itself).

Page 4: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Only one friend

showed up :(

Page 5: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Only myself showed up!

Page 6: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

User

Friend1 Friend2 Friend n…........................

Friend1's Friends

Friend1's Friends

Friend1's Friends

Friends of Friend Friends of Friend authorization exception

Page 7: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

What else can I do?

● Twitter!

-Mid-term election

-tweets related to vote

Page 8: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Data Source: Twitter

● Twitter Rest APIs

- The REST APIs provides programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON.

– Rate Limits:

- Search will be rate limited at 180 queries per 15 minute window for the time being, but we may adjust that over time.

Page 9: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

The Search API

● The Twitter Search API is part of Twitter’s v1.1 REST API. It allows queries against the indices of recent or popular Tweets and behaves similarily to, but not exactly like the Search feature available in Twitter mobile or web clients.

Page 10: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● Geolocalization:

The search operator “near” isn’t available in API, but there is a more precise way to restrict your query by a given location using the geocode parameter specified with the template “latitude,longitude,radius”.

Page 11: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Twitter4J

I used a third-party java library called Twitter 4J. This library makes it easier to integrate Java application with Twitter service.

To use this library, simply download it and add the .jar file to class path.

Page 12: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

QueryString: theSearch keyword

QueryDate: searchTweets sent in

Certain day

Report back how manyTweets were gathered

Page 13: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Search Keywords

● 11/1/2014 – 11/4/2014(Election Day)

- quinn (Democrat Candidate's Lastname)

- rauner (Republican Candidate's Lastname)

- democrat

- republican

- governor

● 11/3/2014 – 11/4/2014(Election Day)

- election

Page 14: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

I stored data in txt file with a wired format

Page 15: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Why MongoDB● My needs:

My input data is basically tweets.

I need to run word count.

I need to query the tweets with different keywords.

I do not want to separate one tweet into several columns.

● MongoDB is great for modeling many of the entities

Form data: MongoDB makes it easy to evolve the structure of form data over time

Blogs / user-generated content: can keep data with complex relationships together in one object

Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas

System configuration: just a nice object graph of configuration values, which is very natural in MongoDB

Log data of any kind: structured log data is the future

Graphs: just objects and pointers – a perfect fit

Location based data: MongoDB understands geo-spatial coordinates and natively supports geo-spatial indexing

Page 16: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

MongoDB● Document-Oriented Storage

JSON-style documents with dynamic schemas offer simplicity and power.

● Full Index Support Index on any attribute, just like you're used to.

● Querying Rich, document-based queries.

● Map/Reduce

Flexible aggregation and data processing.

Page 17: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

- I wanted to re-run my java code to gather tweets again, and this time I would like to store them in json format.- Unfortunately, it did not work out. ”You cannot use the Search API to find Tweets older than about a week”

-I wrote another java application to convert that txt file to a json file

Page 18: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

{'user_name': 'xyz', 'tweet': 'whatever tweet text'}

Page 19: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Import Data to MongoDB:mongoimport --db mydb --collection tweets --file tweets.json

Page 20: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

{“user_name”: “xyz”, “tweet”: “whatever tweet text”}

Page 21: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● Run mongo shell

● Structure/Schema

Page 22: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● Run mapreduce to count words

Page 23: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Relevant Keywords:VotingVoteWageCitizens#democrats#politics#rockthevote

Possible relevant keywords:

shitStupidProtectfuck

Page 24: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Interesting Finds

● Robert Quinn kisses the bicep after that quarterback sack.

(keywords: bicep, quarterback)

Page 25: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Interesting Finds

● @EliseStefanik REPUBLICAN WOMEN Set to Make History Tonight http://t.co/eQOWGBznv8 via @gatewaypundit @JoniErnst @EliseStefanik @MiaBLo…

Page 26: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● @m_silverberg

-Wifi for media at the Bruce Rauner party is $50 a pop...

-Every TV station in Illinois about to go live at 5 from Bruce Rauner's election night party.

Page 27: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

User who sent most tweetsCode

Result

@grammy620: Vote for @JeanneShaheen and this will continue!http://t.co/AleJxTqS1n CLOSE OUR BORDERS! #NHsen Stop the Obama Agenda

@DJGalaxieIL:Vote for Quinn tomorrow!!!!!!!!!!!!!!!!!!!

@Williamjkelly:@progressIL Why I'm NOT drinking the Rauner Kool-Aid http://t.co/kI0H0ohlSN

@haydeevilma06:RT @FitzGeraldForOH: You’re ready to vote, and we’re ready to help you find out where! http://t.co/iOj3wFnf3I

Relevant Users:

Page 28: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Word Count for Keyword ”democrat”

● code

Page 29: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● Result

Page 30: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Word Count for Keyword “republican”

● result @Tigerfists88: Pres. ✰#Obama Brings The Jobless Rate From 10.1% to 5.9% despite republican obstacles http://t.co/852t9ANDF1 #TheyMad #news #p2 #TFB Obama

Page 31: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

The “Big Data” Ecosystem at LinkedIn

Roshan Sumbaly, Jay Kreps, and Sam Shah

Page 32: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● This paper describes the systems that engender effortless ingress and egress out of the Hadoop system and presents case studies of how data mining applications are built at LinkedIn.

● Kafka, Azkaban● Ingress, egress

Page 33: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

● For egress, three main mechanisms are necessary:

- 70% is key-value access – Voldemort

– 20% is stream-oriented access – Kafka

– Multidimensional or OLAP access – Avatara

Given the high velocity of feature development and the difficulty in accurately gauging capacity needs, these systems are all horizontally scalable.

These systems are run as a multitenant service where no real stringent capacity planning needs to be done: rebalancing data is a relatively cheap operation, engendering rapid capacity changes as needs arise.

Page 34: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java
Page 35: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Ingress

● Kafka is a distributed publish-subscribe system that persists messages in a write-ahead log, partitioned and distributed over multiple brokers.

● It allows data publishers to add records to a log.● Each of these logs is referred to as a topic.● Example: search. The search service would produce these records and

publish them to a topic named “SearchQueries” where any number of subscribers might read these messages.

● All Kafka topics support multiple subscribers as it is common to

have many different subscribing systems. ● Kafka supports distributing data consumption within each of these

subscribing systems, because many of these feeds are too large to be handled by a single machine

Page 36: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Ingress: Data Evolution● Two solutions

1. Simply load data stream in whatever form they appear.

2. manually map the source data into a stable, well-through-out schema and perform whatever transformations are necessary to support this.

● LinkedIn's solution:

retains the same structure throughout data pipeline and enforces compatibility and other correctness conventions on changes to this structure.

– Maintain a schema with each topic in a singe consolidated schema registry.

– If data is published to a topic with and incompatible schema, it is rejected.

– If it is published with a new backwards compatible schema, it evolves automatically.

– Each schema also goes through a review process to help ensure consistency with the rest of activity data model.

Page 37: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Ingress: Hadoop Load● The activity data generated and stored on Kafka is pulled into

Hadoop using a map-only job that runs every 10 minutes on a dedicated ETL Hadoop cluster as a part of an Azkaban workflow.

● First, reads the Kafka log offsets and checks for any new topics.

● Then, starts a fixed number of mapper tasks to pull data into HDFS partition files, and finally registers it with LinkedIn's various systems.

● ETL workflow also runs an aggregator job every day to combine and dedup data saved throughout the day into another HDFS location and run predefined retention policies on a per topic basis. (This combining and cleanup prevents having many small files)

Page 38: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Egress

● The result of workflows are usually pushed to other systems, either back for online serving or as a derived data-set for further consumption.

● The workflows appends an extra job at the end of their pipeline for data delivery out of Hadoop.

Page 39: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Egress: Key-Value

● Voldemort is a distributed key-value store akin to Amazon's Dynamo with a simple get(key) and put{key, value} interface.

● Tuples are grouped together into logical stores.● Each key is replicated to multiple nodes

depending on the preconfigured replication factor of its corresponding store.

● Every node is futher split into logical partitions.

Page 40: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Egress: Stream

● The ability to publish data to Kafka is implemented as Hadoop OutputFormat.

● Each MapReduce slot acts as Kafka producer that emits essages, throttling as necessary to avoid overwhelming the Kafka brokers.

● As Kafka is a pull-based queue, the consuming application can read message at its own pace.

Page 41: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Egress: OLAP

● A system called Avatara that moves the cube generation to a high throughput offline system and the query serving to a low latency system.

● By separating the two systems, we lose some freshness of data, but are able to scale them independently.

● This independence also prevents the query layer from the performance impact that will occur due to concurrent cube computation.

Page 42: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Applications

● Key-value

- people you may know– Collaborative

Filtering

– Skill Endorsements

– Related searches

Page 43: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Applications

● Stream

- News Feed Updates– Email

– Relationship Strength

Page 44: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Application

● OLAP

- who viewed my profile?– Who's viewed this job?

Page 45: Processing Twitter Data with MongoDB - csuohio.educis.csuohio.edu/~sschung/cis612/612Presentation_Liu_Corrected.pdf · Data Source: Twitter Twitter ... - I wanted to re-run my java

Thank you!