Big Data Overview Part 1

Big DataOverview – Part 1

Wm. Barrett Simms

[email protected]

@wbsimms

Opening remarks

• Sponsors• Pluralsight

• Free month gift card give away. Enter your name in the pot!

• DevExpress• $250 in developer JustCode tools.

• O’Reilly• Book give away. Enter your name in the pot!

• Boston Code Camp 22 (November 22nd)• http://www.bostoncodecamp.com/

• Thanks to 3thought for the space

http://www.bostoncodecamp.com/

About Me

Software Developer

Agile Team Member

Team LeadAgile

Advocate

SDLC Implementer

Big Data

“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”

- Wikipedia

The 3 Vs

• Volume• A few Gigabytes -> Petabyte

• Velocity• Arrives quickly

• Variety• Multiple Sources

Volume

• Traditional SQL architectures don’t scale to very large

• Maybe this isn’t so true…but the MMP systems are expensive

An example problem (Volume)

• You own a chain of stores

• … with 25,000 stores and 100,000 POS systems

• Need information on inventory changes• By region

• By store

Velocity

• Traditional solutions don’t handle fast inbound data

• Maybe this isn’t so true…but you lose data.

Another example (Velocity)

• You host a website

• … on 10,000 servers

• Monitor logs for errors

Variety

• Most traditional solutions don’t handle a variety of data types well

• Maybe this isn’t so true…But you need to write a custom importer for every type.

A final example (Variety)

• You own a business

• With a sales and marketing teams

• … in different regions around the world

• Correlate sales numbers against marketing expenses

The First Problem : Computing Power

First Second Third

First Second Third

First Second Third

First Second Third

First Second Third

Limited by cores (Scaling up)

Solution: Scale out (not up!)

Server 1 Server 2

Server 3 Server 4

Coordinator

Coordination

Job Coordinator

Runner

Runner

Runner

MapReduce

• A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. – Wikipedia

WHAT?

Map and Reduce

• Map• Process data returning key value pairs

• Reduce• Aggregate/Filter key value pairs into result

Map

Map

Data

Data

Reduce Result

Mapping

• Easy example

• Store Sales• Find most sales per store in 2010

Year Month Store Id SalesTotal2010 1 13 1,0002010 3 43 12,0002010 3 21 21,0002010 4 13 3,0002010 2 56 4,0002010 6 32 12,0002010 7 1 4,0002010 2 23 2,000

Solution – Map

1. Mapper feeds document rows to your program

2. You return key value pairs

StoreId Sales

21 2,000

23 3,000

2 1,000

21 23,000

Solution - Reduce

• Data is merged• Merged into Key/Values:

{21, [2,000, 23,000]}

{23, [3,000]}

{2, [1,000]}

• You process each row

Data Access

• Each process needs access to data

Typical Desired

HDFS

• Hadoop File System• Open-source implementation of the Google File System (GFS)

Hard drives last about 1,000 days. So, if you have 1K hard drives, you’ll lose one per day.

The ecosystem• Hive

• SQL-like query language• Define and enforce schema

• Pig• SQL-like query language

• Sqoop• SQL/Hadoop integration

• Oozie• Scheduling

• Mahout• Machine Learning interface

• Storm• Stream-based MapReduce

… and Many Others

Vendors

• Hortonworks• Single click install of Sandbox

• Cloudera• Downloadable VM

• Syncfusion• Single click install of Syncfusion Big Data

• Amazon AWS• Elastic MapReduce

• Microsoft Azure• HDInsight

Contact Me

Barrett Simms

[email protected]

http://wbsimms.com

Twitter: @wbsimms

Phone: 781.405.4686

Big Data Overview Part 1

Technology

Transcript of Big Data Overview Part 1