Big Data Overview Part 1

25
Big Data Overview – Part 1 Wm. Barrett Simms [email protected] @wbsimms

description

Big Data has become the new buzzword like “Agile” and “Cloud”. Like those two others, it’s a transformative technology. We’ll be discussing: •What is it? •Technology key words •HDFS •Hadoop •MapReduce This will be part 1 of 2 (at least). This first talk will not be overly technical. We’ll go over the concepts and terms you’ll encounter when considering a big data solution.

Transcript of Big Data Overview Part 1

Page 1: Big Data Overview Part 1

Big DataOverview – Part 1

Wm. Barrett Simms

[email protected]

@wbsimms

Page 2: Big Data Overview Part 1

Opening remarks

• Sponsors• Pluralsight

• Free month gift card give away. Enter your name in the pot!

• DevExpress• $250 in developer JustCode tools.

• O’Reilly• Book give away. Enter your name in the pot!

• Boston Code Camp 22 (November 22nd)• http://www.bostoncodecamp.com/

• Thanks to 3thought for the space

Page 3: Big Data Overview Part 1

About Me

Software Developer

Agile Team Member

Team LeadAgile

Advocate

SDLC Implementer

Page 4: Big Data Overview Part 1

SDLC

Page 5: Big Data Overview Part 1

Big Data

“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”

- Wikipedia

Page 6: Big Data Overview Part 1

The 3 Vs

• Volume• A few Gigabytes -> Petabyte

• Velocity• Arrives quickly

• Variety• Multiple Sources

Page 7: Big Data Overview Part 1

Volume

• Traditional SQL architectures don’t scale to very large

• Maybe this isn’t so true…but the MMP systems are expensive

Page 8: Big Data Overview Part 1

An example problem (Volume)

• You own a chain of stores

• … with 25,000 stores and 100,000 POS systems

• Need information on inventory changes• By region

• By store

Page 9: Big Data Overview Part 1

Velocity

• Traditional solutions don’t handle fast inbound data

• Maybe this isn’t so true…but you lose data.

Page 10: Big Data Overview Part 1

Another example (Velocity)

• You host a website

• … on 10,000 servers

• Monitor logs for errors

Page 11: Big Data Overview Part 1

Variety

• Most traditional solutions don’t handle a variety of data types well

• Maybe this isn’t so true…But you need to write a custom importer for every type.

Page 12: Big Data Overview Part 1

A final example (Variety)

• You own a business

• With a sales and marketing teams

• … in different regions around the world

• Correlate sales numbers against marketing expenses

Page 13: Big Data Overview Part 1

The First Problem : Computing Power

First Second Third

First Second Third

First Second Third

First Second Third

First Second Third

Limited by cores (Scaling up)

Page 14: Big Data Overview Part 1

Solution: Scale out (not up!)

Server 1 Server 2

Server 3 Server 4

Coordinator

Page 15: Big Data Overview Part 1

Coordination

Job Coordinator

Runner

Runner

Runner

Page 16: Big Data Overview Part 1

MapReduce

• A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. – Wikipedia

WHAT?

Page 17: Big Data Overview Part 1

Map and Reduce

• Map• Process data returning key value pairs

• Reduce• Aggregate/Filter key value pairs into result

Map

Map

Data

Data

Reduce Result

Page 18: Big Data Overview Part 1

Mapping

• Easy example

• Store Sales• Find most sales per store in 2010

Year Month Store Id SalesTotal2010 1 13 1,0002010 3 43 12,0002010 3 21 21,0002010 4 13 3,0002010 2 56 4,0002010 6 32 12,0002010 7 1 4,0002010 2 23 2,000

Page 19: Big Data Overview Part 1

Solution – Map

1. Mapper feeds document rows to your program

2. You return key value pairs

StoreId Sales

21 2,000

23 3,000

2 1,000

21 23,000

Page 20: Big Data Overview Part 1

Solution - Reduce

• Data is merged• Merged into Key/Values:

{21, [2,000, 23,000]}

{23, [3,000]}

{2, [1,000]}

• You process each row

Page 21: Big Data Overview Part 1

Data Access

• Each process needs access to data

Typical Desired

Page 22: Big Data Overview Part 1

HDFS

• Hadoop File System• Open-source implementation of the Google File System (GFS)

Hard drives last about 1,000 days. So, if you have 1K hard drives, you’ll lose one per day.

Page 23: Big Data Overview Part 1

The ecosystem• Hive

• SQL-like query language• Define and enforce schema

• Pig• SQL-like query language

• Sqoop• SQL/Hadoop integration

• Oozie• Scheduling

• Mahout• Machine Learning interface

• Storm• Stream-based MapReduce

… and Many Others

Page 24: Big Data Overview Part 1

Vendors

• Hortonworks• Single click install of Sandbox

• Cloudera• Downloadable VM

• Syncfusion• Single click install of Syncfusion Big Data

• Amazon AWS• Elastic MapReduce

• Microsoft Azure• HDInsight

Page 25: Big Data Overview Part 1

Contact Me

Barrett Simms

[email protected]

http://wbsimms.com

Twitter: @wbsimms

Phone: 781.405.4686