Intro to Big Data - Orlando Code Camp 2014

Dipping Your Toes into the Big Data Pool

Orlando CodeCamp 2014

John Ternent

VP Application Development

TravelClick

About Me

20+ years as a consultant, software engineer, architect, and tech executive.

Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.

Presently leading development efforts for TravelClick Channel Management team.

Twitter : @jaternent

Poll : Big Data

How many people are comfortable with the definition?

How many people are “doing” Big Data?

Big Data in the Media

The Three Four V’s of Big Data:Volume (Scale)Variety (Forms)Velocity (Streaming)Veracity (Uncertainty)

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

A New Definition

Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.

“It depends on how capital your B and D are in Big Data…”

What is Big Data to you?

The Big Data Ecosystem

Data Sources

Data Storage

Data Manipulation

Data Manageme

nt

Data Analysis

• Sqoop• Flume

• HDFS• HBase

• Pig• MapReduc

e

• Zookeeper

• Avro• Oozie

• Hive• Mahout• Impala

The Full Hadoop Ecosystem?

Great, but What IS Hadoop?

Implementation of Google MapReduce framework

Distributed processing on commodity hardware

Distributed file system with high failure tolerance

Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)

Candidate Architecture

Data Sources

• Log files• SQL DBs• Text feeds• Search• Structured• Unstructure

d• Semi-

structured

HDFSHDFS

HDFS

Data Manipulation

• MapReduce• Pig• Hive• Impala

Analytic Products

• Search• R/SAS• Mahout• SQL

Server• DW/

DMart

Example : Log File Processing

xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916

Example : Log File ProcessingA = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN((tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int))REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\d+) (\\d+) (\\d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int);B1 = FILTER B BY ts IS NOT NULL;B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^\\w+ \\/(\\S+)[\\?]* \\S+',1) as req;C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;D = GROUP C BY (month, day, hour, req, result);E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count;STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage

Another Real-World Example

2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueName":"expedia-dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"submissionStatusCode":0}

2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionId":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queueName":"expedia-dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"submissionStatusCode":null}

100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.

Pig Example - Pros and Cons

Pros:Don’t need to ETL into a database, all off file systemSame development for one file as 10,000 filesHorizontally scalableUDFs allow fine-grained controlFlexible

Cons:Language can be difficult to work withMapReduce touches ALL the things to get the answer

(compare to indexed search)

Unstructured and Semi-Structured Data

Big Data tools can help with the analysis of data that would be more challenging in a relational databaseTwitter feeds (Natural Language Processing)Social network analysis

Big Data approaches to search are making search tools more accessible and useful than everElasticSearch

ElasticSearch/Kibana

JSON Document

sREST

ElasticSearch

Logslogsta

s

hHadoop

FileSystem

Kibana

Analytics with Big Data

Apache Mahout Machine learning on Hadoop

Recommendation Classification Clustering

RHadoopR mapreduce implementation on HDFS

Tableau Visualization on HDFS/Hive

Main point : You don’t have to roll your own for everything, many tools now using HDFS natively

Return to SQL

Many SQL dialects are being/have been ported to Hadoop

Hive : Create DDL Tables on top of HDFS structuresCREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?")STORED AS TEXTFILE;

SELECT host, COUNT(*)FROM apachelogGROUP BY host;

Cloudera ImpalaMoves SQL processing onto each distributed node

Written for performance

Distribution and reduction of the query handled by the Impala engine

Big Data Tradeoffs

Time tradeoff – loading/building/indexing vs. runtime

ACID properties – different distribution models may compromise one or more of these properties

Be aware of what tradeoffs you’re making

TANSTAAFL – massive scalability, commodity hardware, but at what price?

Tool sophistication

NoSQL – “Not Only SQL”

Sacrificing ACID properties for different scalability benefits.Key/Value Store : SimpleDB, Riak, RedisColumn Family Store : Cassandra, HBaseDocument Database : CouchDB, MongoDBGraph Database : Neo4J

General propertiesHigh horizontal scalabilityFast accessSimple data structuresCaching

Getting Started

Play in the sandbox – Hadoop/Hive/Pig local mode or AWSRandy Zwitch has a great tutorial on this :

http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/

Using Airline data : http://stat-computing.org/dataexpo/2009/the-data.html

Kaggle competitions (data science)

Lots of big data sets available, look for machine learning repositories

http://stat-computing.org/dataexpo/2009/the-data.html

http://stat-computing.org/dataexpo/2009/the-data.html

Getting Started

Books for Developers

Books for Managers

MOOCs

Unprecedented access to very high-quality online courses, including

Udacity : Data Science Track Intro to Data ScienceData Wrangling with MongoDB Intro to Hadoop and MapReduce

Coursera : Machine Learning courseData Science Certificate Track (R, Python)

Waikato University : Weka

Bonus Round : Data Science

Outro

We live in exciting times!

Confluence of data, processing power, and algorithmic sophistication.

More data is available to make better decisions more easily than any other time in human history.

Intro to Big Data - Orlando Code Camp 2014

Technology

Transcript of Intro to Big Data - Orlando Code Camp 2014