Big Data Overview Part 1
-
Upload
william-simms -
Category
Technology
-
view
162 -
download
3
description
Transcript of Big Data Overview Part 1
Opening remarks
• Sponsors• Pluralsight
• Free month gift card give away. Enter your name in the pot!
• DevExpress• $250 in developer JustCode tools.
• O’Reilly• Book give away. Enter your name in the pot!
• Boston Code Camp 22 (November 22nd)• http://www.bostoncodecamp.com/
• Thanks to 3thought for the space
About Me
Software Developer
Agile Team Member
Team LeadAgile
Advocate
SDLC Implementer
SDLC
Big Data
“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”
- Wikipedia
The 3 Vs
• Volume• A few Gigabytes -> Petabyte
• Velocity• Arrives quickly
• Variety• Multiple Sources
Volume
• Traditional SQL architectures don’t scale to very large
• Maybe this isn’t so true…but the MMP systems are expensive
An example problem (Volume)
• You own a chain of stores
• … with 25,000 stores and 100,000 POS systems
• Need information on inventory changes• By region
• By store
Velocity
• Traditional solutions don’t handle fast inbound data
• Maybe this isn’t so true…but you lose data.
Another example (Velocity)
• You host a website
• … on 10,000 servers
• Monitor logs for errors
Variety
• Most traditional solutions don’t handle a variety of data types well
• Maybe this isn’t so true…But you need to write a custom importer for every type.
A final example (Variety)
• You own a business
• With a sales and marketing teams
• … in different regions around the world
• Correlate sales numbers against marketing expenses
The First Problem : Computing Power
First Second Third
First Second Third
First Second Third
First Second Third
First Second Third
Limited by cores (Scaling up)
Solution: Scale out (not up!)
Server 1 Server 2
Server 3 Server 4
Coordinator
Coordination
Job Coordinator
Runner
Runner
Runner
MapReduce
• A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. – Wikipedia
WHAT?
Map and Reduce
• Map• Process data returning key value pairs
• Reduce• Aggregate/Filter key value pairs into result
Map
Map
Data
Data
Reduce Result
Mapping
• Easy example
• Store Sales• Find most sales per store in 2010
Year Month Store Id SalesTotal2010 1 13 1,0002010 3 43 12,0002010 3 21 21,0002010 4 13 3,0002010 2 56 4,0002010 6 32 12,0002010 7 1 4,0002010 2 23 2,000
Solution – Map
1. Mapper feeds document rows to your program
2. You return key value pairs
StoreId Sales
21 2,000
23 3,000
2 1,000
21 23,000
Solution - Reduce
• Data is merged• Merged into Key/Values:
{21, [2,000, 23,000]}
{23, [3,000]}
{2, [1,000]}
• You process each row
Data Access
• Each process needs access to data
Typical Desired
HDFS
• Hadoop File System• Open-source implementation of the Google File System (GFS)
Hard drives last about 1,000 days. So, if you have 1K hard drives, you’ll lose one per day.
The ecosystem• Hive
• SQL-like query language• Define and enforce schema
• Pig• SQL-like query language
• Sqoop• SQL/Hadoop integration
• Oozie• Scheduling
• Mahout• Machine Learning interface
• Storm• Stream-based MapReduce
… and Many Others
Vendors
• Hortonworks• Single click install of Sandbox
• Cloudera• Downloadable VM
• Syncfusion• Single click install of Syncfusion Big Data
• Amazon AWS• Elastic MapReduce
• Microsoft Azure• HDInsight