Intro to Big Data - Orlando Code Camp 2014

25
Dipping Your Toes into the Big Data Pool Orlando CodeCamp 2014 John Ternent VP Application Development TravelClick

description

Very high-level introduction to Big Data technologies, with an emphasis on how folks can get started easily.

Transcript of Intro to Big Data - Orlando Code Camp 2014

Page 1: Intro to Big Data - Orlando Code Camp 2014

Dipping Your Toes into the Big Data Pool

Orlando CodeCamp 2014

John Ternent

VP Application Development

TravelClick

Page 2: Intro to Big Data - Orlando Code Camp 2014

About Me

20+ years as a consultant, software engineer, architect, and tech executive.

Mostly data-focused, RDBMS, object database, and big data/NoSQL/analytics/data science.

Presently leading development efforts for TravelClick Channel Management team.

Twitter : @jaternent

Page 3: Intro to Big Data - Orlando Code Camp 2014

Poll : Big Data

How many people are comfortable with the definition?

How many people are “doing” Big Data?

Page 4: Intro to Big Data - Orlando Code Camp 2014

Big Data in the Media

The Three Four V’s of Big Data:Volume (Scale)Variety (Forms)Velocity (Streaming)Veracity (Uncertainty)

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 5: Intro to Big Data - Orlando Code Camp 2014

A New Definition

Big Data is about a tool set and approach that allows for non-linear scalability of solutions to data problems.

“It depends on how capital your B and D are in Big Data…”

What is Big Data to you?

Page 6: Intro to Big Data - Orlando Code Camp 2014

The Big Data Ecosystem

Data Sources

Data Storage

Data Manipulation

Data Manageme

nt

Data Analysis

• Sqoop• Flume

• HDFS• HBase

• Pig• MapReduc

e

• Zookeeper

• Avro• Oozie

• Hive• Mahout• Impala

Page 7: Intro to Big Data - Orlando Code Camp 2014

The Full Hadoop Ecosystem?

Page 8: Intro to Big Data - Orlando Code Camp 2014

Great, but What IS Hadoop?

Implementation of Google MapReduce framework

Distributed processing on commodity hardware

Distributed file system with high failure tolerance

Can support activity directly on top of distributed file system (MapReduce jobs, Impala, Hive queries, etc)

Page 9: Intro to Big Data - Orlando Code Camp 2014

Candidate Architecture

Data Sources

• Log files• SQL DBs• Text feeds• Search• Structured• Unstructure

d• Semi-

structured

HDFSHDFS

HDFS

Data Manipulation

• MapReduce• Pig• Hive• Impala

Analytic Products

• Search• R/SAS• Mahout• SQL

Server• DW/

DMart

Page 10: Intro to Big Data - Orlando Code Camp 2014

Example : Log File Processing

xxx.16.23.133 - - [15/Jul/2013:04:03:01 -0400] "POST /update-channels HTTP/1.1" 500 378 "-" "Zend_Http_Client" 53051 65921 617- - - - [15/Jul/2013:04:03:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 544 94 590xxx.16.23.133 - - [15/Jul/2013:04:04:00 -0400] "POST /update-channels HTTP/1.1" 200 104 "-" "Zend_Http_Client" 617786 4587 360- - - [15/Jul/2013:04:04:02 -0400] "GET /server-status?auto HTTP/1.1" 200 411 "-" "collectd/5.1.0" 568 94 590- - - [15/Jul/2013:04:05:02 -0400] "GET /server-status?auto HTTP/1.1" 200 412 "-" "collectd/5.1.0" 560 94 591xxx.16.23.70 - - [15/Jul/2013:04:05:09 -0400] "POST /fetch-channels HTTP/1.1" 200 3718 "-" "-" 452811 536 3975xxx.16.23.70 - - [15/Jul/2013:04:05:10 -0400] "POST /fetch-channels HTTP/1.1" 200 6598 "-" "-" 333213 536 6855xxx.16.23.70 - - [15/Jul/2013:04:05:11 -0400] "POST /fetch-channels HTTP/1.1" 200 5533 "-" "-" 282445 536 5790xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 8266 "-" "-" 462575 536 8542xxx.16.23.70 - - [15/Jul/2013:04:05:12 -0400] "POST /fetch-channels HTTP/1.1" 200 42640 "-" "-" 1773203 536 42916

Page 11: Intro to Big Data - Orlando Code Camp 2014

Example : Log File ProcessingA = LOAD '/Users/jternent/Documents/logs/api*' USING TextLoader as (line:chararray); B = FOREACH A GENERATE FLATTEN((tuple(chararray, chararray, chararray, chararray, chararray, int, int, chararray, chararray, int, int, int))REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" (\\d+) (\\d+) (\\d+)')) as (forwarded_ip:chararray, rem_log:chararray,rem_user:chararray, ts:chararray, req_url:chararray, result:int, resp_size:int, referrer:chararray, user_agent:chararray, svc_time:int, rec_bytes:int, resp_bytes:int);B1 = FILTER B BY ts IS NOT NULL;B2 = FILTER B BY req_url MATCHES '.*[fetch|update].*';B3 = FOREACH B2 GENERATE *, REGEX_EXTRACT(req_url, '^\\w+ \\/(\\S+)[\\?]* \\S+',1) as req;C = FOREACH B3 GENERATE forwarded_ip, GetMonth(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as month, GetDay(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as day, GetHour(ToDate(ts,'d/MMM/yyyy:HH:mm:ss Z')) as hour, req, result, svc_time;D = GROUP C BY (month, day, hour, req, result);E = FOREACH D GENERATE flatten(group), MAX(C.svc_time) as max, MIN(C.svc_time) as min, COUNT(C) as count;STORE E INTO '/Users/jternent/Documents/logs/ezy-logs-output' USING PigStorage

Page 12: Intro to Big Data - Orlando Code Camp 2014

Another Real-World Example

2013-08-10T04:03:50-04:00 INFO (6): {"eventType":3,"eventTime":"Aug 10, 2013 4:03:50 AM","hotelId":8186,"channelId":9173,"submissionId":1376121011,"sessionId":null,"documentId":"9173SS8186_13761210111434582378cds.txt","queueName":"expedia-dx","roomCount":1,"submissionDayCount":1,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":2,"submissionStatusCode":0}

2013-08-10T04:03:53-04:00 INFO (6): {"eventType":2,"eventTime":"Aug 10, 2013 4:03:53 AM","hotelId":8525,"channelId":50091,"submissionId":1376116653,"sessionId":null,"documentId":"50091SS8525_13761166531434520293cds.txt","queueName":"expedia-dx","roomCount":5,"submissionDayCount":2,"serverName":"orldc-auto-11.ezyield.com","serverLoad":1.18,"queueSize":0,"submissionStatus":1,"submissionStatusCode":null}

100 million (ish) / week of these. 25MB zipped per server per day (15 servers right now), 750MB uncompressed.

Page 13: Intro to Big Data - Orlando Code Camp 2014

Pig Example - Pros and Cons

Pros:Don’t need to ETL into a database, all off file systemSame development for one file as 10,000 filesHorizontally scalableUDFs allow fine-grained controlFlexible

Cons:Language can be difficult to work withMapReduce touches ALL the things to get the answer

(compare to indexed search)

Page 14: Intro to Big Data - Orlando Code Camp 2014

Unstructured and Semi-Structured Data

Big Data tools can help with the analysis of data that would be more challenging in a relational databaseTwitter feeds (Natural Language Processing)Social network analysis

Big Data approaches to search are making search tools more accessible and useful than everElasticSearch

Page 15: Intro to Big Data - Orlando Code Camp 2014

ElasticSearch/Kibana

JSON Document

sREST

ElasticSearch

Logslogsta

s

hHadoop

FileSystem

Kibana

Page 16: Intro to Big Data - Orlando Code Camp 2014

Analytics with Big Data

Apache Mahout Machine learning on Hadoop

Recommendation Classification Clustering

RHadoopR mapreduce implementation on HDFS

Tableau Visualization on HDFS/Hive

Main point : You don’t have to roll your own for everything, many tools now using HDFS natively

Page 17: Intro to Big Data - Orlando Code Camp 2014

Return to SQL

Many SQL dialects are being/have been ported to Hadoop

Hive : Create DDL Tables on top of HDFS structuresCREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?")STORED AS TEXTFILE;

SELECT host, COUNT(*)FROM apachelogGROUP BY host;

Page 18: Intro to Big Data - Orlando Code Camp 2014

Cloudera ImpalaMoves SQL processing onto each distributed node

Written for performance

Distribution and reduction of the query handled by the Impala engine

Page 19: Intro to Big Data - Orlando Code Camp 2014

Big Data Tradeoffs

Time tradeoff – loading/building/indexing vs. runtime

ACID properties – different distribution models may compromise one or more of these properties

Be aware of what tradeoffs you’re making

TANSTAAFL – massive scalability, commodity hardware, but at what price?

Tool sophistication

Page 20: Intro to Big Data - Orlando Code Camp 2014

NoSQL – “Not Only SQL”

Sacrificing ACID properties for different scalability benefits.Key/Value Store : SimpleDB, Riak, RedisColumn Family Store : Cassandra, HBaseDocument Database : CouchDB, MongoDBGraph Database : Neo4J

General propertiesHigh horizontal scalabilityFast accessSimple data structuresCaching

Page 21: Intro to Big Data - Orlando Code Camp 2014

Getting Started

Play in the sandbox – Hadoop/Hive/Pig local mode or AWSRandy Zwitch has a great tutorial on this :

http://randyzwitch.com/big-data-hadoop-amazon-ec2-cloudera-part-1/

Using Airline data : http://stat-computing.org/dataexpo/2009/the-data.html

Kaggle competitions (data science)

Lots of big data sets available, look for machine learning repositories

Page 22: Intro to Big Data - Orlando Code Camp 2014

Getting Started

Books for Developers

Books for Managers

Page 23: Intro to Big Data - Orlando Code Camp 2014

MOOCs

Unprecedented access to very high-quality online courses, including

Udacity : Data Science Track Intro to Data ScienceData Wrangling with MongoDB Intro to Hadoop and MapReduce

Coursera : Machine Learning courseData Science Certificate Track (R, Python)

Waikato University : Weka

Page 24: Intro to Big Data - Orlando Code Camp 2014

Bonus Round : Data Science

Page 25: Intro to Big Data - Orlando Code Camp 2014

Outro

We live in exciting times!

Confluence of data, processing power, and algorithmic sophistication.

More data is available to make better decisions more easily than any other time in human history.