Flume with Twitter Integration

CS157B - Big Data Management

Date: 03/3/2014Professor: Thanh Tran

by Swathi Kotturu

ETL Using FlumeWhat is Flume?

Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data.

Flume and it’s integration with Hadoop and can be used to capture streaming twitter data which can be filtered based on keywords and locations..

More About FlumeIt has a very simple architecture based on streaming data flows.Flume takes a source and processes it through a memory channel, where the data gets filtered and sinks into the HDFS.

Flume AgentsFlume can deploy any number of agents. An Agent is a container for Flume data flow. It can run any number of sources, sinks, and channels.

It must have a source, channel, and sink.

Flume SourcesSources are not Necessarily restricted to log data.

It is possible to use Flume to transport event data such as network traffic data, social-media-generated data, e-mail messages, etc…

The events can be HTTP POSTS, RPC calls, strings in stdout, etc….

After an event occurs, Flume sources write the event to a channel as a transaction.

Flume ChannelsChannels are internal passive stores with specific characteristics. This allows a source and a sink to run asynchronously.

Two Main Types of Channels

Memory Channels

- Volatile Channel that buffers events in memory only. If JVM crashes, all data is lost.

File Channels

- Persistant Channel that is stored to disk.

You can Run Multiple Agents and Servers to collect data in parallel.

Get Twitter Access

Flume in ClouderaDownload flume-sources-1.0-SNAPSHOT.jar and add it to the flume class path. http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar

In the Cloudera Manager, you can add the class path:

“Services” -> “flume1″ -> “Configuration” -> “Agent(Default)” -> “Advanced” -> “Java Configuration Options for Flume Agent”, add:

–classpath /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0-SNAPSHOT.jar

Flume in Cloudera (cont.)

Flume in Cloudera (cont.)You also have to exclude the original file that came with Flume, pre-installed by renaming it .org. The file is search-contrib-1.0.0-jar-with-dependencies.jar and is in the /usr/lib/flume-ng/lib/ path.

mv search-contrib-1.0.0-jar-with-dependencies.jar search-contrib-1.0.0-jar-with-dependencies.jar.org

Using Hue, create user Flume and give them access to read and write in hdfs.

Flume in Cloudera (cont.)From the Cloudera Manager, go to

“Services” -> “flume1″ -> “Configuration” -> “Agent(Default)” -> “Agent Name”.

Set the Agent Name to Twitter Agent

Flume in Cloudera (cont.)Also set the Configuration File to the following and make sure to replace the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret

Also set the Configuration File to the following and make sure to replace the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret

TwitterAgent.sources = TwitterTwitterAgent.channels = MemChannelTwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels = MemChannelTwitterAgent.sources.Twitter.consumerKey = <consumer key>TwitterAgent.sources.Twitter.consumerSecret = <consumer secret>TwitterAgent.sources.Twitter.accessToken = <access token>TwitterAgent.sources.Twitter.accessTokenSecret = <access token secret>

Flume in Cloudera (cont.)TwitterAgent.sources.Twitter.keywords = flu, runny nose, tissue, sick, ill, cough

TwitterAgent.sinks.HDFS.channel = MemChannelTwitterAgent.sinks.HDFS.type = hdfsTwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/TwitterAgent.sinks.HDFS.hdfs.fileType = DataStreamTwitterAgent.sinks.HDFS.hdfs.writeFormat = TextTwitterAgent.sinks.HDFS.hdfs.batchSize = 1000TwitterAgent.sinks.HDFS.hdfs.rollSize = 0TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent.channels.MemChannel.type = memoryTwitterAgent.channels.MemChannel.capacity = 10000TwitterAgent.channels.MemChannel.transactionCapacity = 100

Restart Flume Agent

Example TweetWe loaded raw tweets into HDFS which are represented as chunks of JSON

Next StepsTell Hive how to read the data

You will need Hive-serdes-1.0-SNAPSHOT.jar

http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar

As Hive is setup to read delimited row format but in this case needs to read json.

Flume ResourcesLearn More

https://dev.twitter.com/docs/streaming-apis/parameters

https://cwiki.apache.org/confluence/display/FLUME/Home

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

Thank you!Q/A

Flume with Twitter Integration

Data & Analytics

Transcript of Flume with Twitter Integration

Flume intro-100717

Extracting twitter data using apache flume

Manual Flow Converter 713 - MJKpt.mjk.com/.../GB_3.1_Flow_Converter_713_Manual_0704.pdf · Parshall flume The most common type of flume is the Parshall flume. The Parshall flume is

Apache Flume - DataDayTexas

Apache Flume (NG)

Using the Twitter Adapter · Using the Twitter Adapter describes how to configure the Twitter Adapter as a connection in an integration in Oracle Integration Cloud Service. Topics

Flume intro-100715

Flume 1.8.0 Developer Guide — Apache Flume · Flume 1.8.0 Developer Guide Introduction Overview Apache Flume is a distributed, reliable, ... SecureRpcClientFactory within a user’s

Apache Storm and twitter Streaming API integration

Introduction ot Flume

Spark+flume seattle

Flume 1.5.2 User Guide - Welcome to Apache Flume ......Flume 1.5.2 User Guide Introduction Overview Apache Flume is a distributed, reliable, and available system for efficiently collecting,

Flume 1.8.0 User Guide — Apache Flumeflume.apache.org/releases/content/1.8.0/FlumeUserGuide.pdf · Flume 1.8.0 User Guide Introduction Overview Apache Flume is a distributed, reliable,

Flume User Guide - Welcome to Apache Flume â€” Apache Flume

Flume User Guide - Welcome to Apache Flume — Apache Flumeflume.apache.org/releases/content/1.2.0/FlumeUserGuide.pdf · 2012-07-23 · Flume 1.2.0 User Guide Introduction Overview

Flume HBase

Flume-Cassandra Log Processor

Flume Final Report 2012

Flume CD Booklet

Solace JMS Integration with Cloudera CDH V5 · Solace JMS Integration with Cloudera CDH 5.4 6 3 Integration with Apache Flume Flume is a very flexible bridge application that runs