Final deck
-
Upload
steve-watt -
Category
Technology
-
view
4.437 -
download
3
description
Transcript of Final deck
Big Data for Everyone
Twitter: #bd4e
Introduction to Big Data
Steve Watt Hadoop Strategy
@wattsteve #bd4e
3
What is “Big Data”?
“Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google
Current state of affairs: Explosion of user generated content Storage is really cheap so we can store what we want Traditional data stores have reached critical mass
Issues: Enterprise Amnesia Traditional architectures become brittle and slow when
tasked with trying to process data at petabyte scale How do we process unstructured data?
4
How were these issues addressed?
2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale
The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements
This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance.
The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project.
5
What is Apache Hadoop ?
It is a cluster technology with a single master and multiple slaves, designed for commodity hardware
It consists of two runtimes, the Hadoop distributed file system (HDFS) and Map/Reduce
As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy
Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine (data locality).
Hadoop may execute or re-execute a job on any node in the cluster.
Node failures are automatically handled by the framework.
6
The Big Data Ecosystem
ClusterChef / Apache Whirr / EC2
Hadoop
Pig / WuKong /Cascading
Cassandra / HBase
Offline Systems (Analytics) Human Consumption
BigSheets / DataMeer
Hive / Karmasphere
Provisioning
Nutch / SQOOP / Flume
Scripting
DBA
Non-Programmer
Import/Export Tooling
Visualizations
Online Systems
(OLTP @ Scale)
NoSQL
Commodity Hardware
Offline customer scenario
Eric Sammer Solution Architect
@esammer #bd4e
Use Case: Product Recommendations
“We can provide a better experience (and make more money) if we provide meaningful product recommendations.”
We need data:
- What products did a user buy?
- What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months?
- What are the attributes of the user? (e.g. income, gender, friends)
- What are our margins on products, inventory, upcoming promotions?
Problems
That’s a lot of data! (2 months of activity + all purchase data + all user data) Activity: ~20GB per day x ~60 days = 1.2TB User Data: ~2GB Purchase Data: ~5GB Misc: Inventory, product costs, promotion schedules
Distilling data to aggregates would reduce fidelity.
Easy to see how looking at more data could improve recommendations.
How do we keep this information current?
The Answer
Calculate all qualifying products once a day for each user and store them for quick display
Use Hadoop to process data in parallel on hundreds of machines
Online customer scenario
Matt Pfeil CEO
@mattz62 #bd4e
04/11/23 13
What is Apache Cassandra?
Use Case: Managing Email
“My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.”
The Problem
How do we achieve scalability, redundancy, high performance?
How do we store billions of files on commodity hardware?
How do we increase capacity by simply adding machines? (No SANs!)
How do we make it FAST?
Requirements
Storage for Email Billions of emails (<100KB avg) 2M users, 100 MB of storage each = 190 TB Growth of 50% every 6 months Durable
Requirements for Storage System No Master/Single Point of Failure Linear scalability + redundancy Multiple Active Data Centers Many reads, many writes Millisecond response times Commodity hardware
Solution
800 TB of Storage
~1.75 Million reads or writes/sec (No Cache!)
130 Machines Read/Write at both Data
Centers No “Master” Data Center
Where to next?The Adjacent Possible
Flip Kromer CTO
@mrflip #bd4e
Something about AnythingSomething about Anything
Everything about SomethingEverything about Something
Bigger than One Computer
Bigger than Frontal Lobe
Bigger than Excel
what’s coming to help
myth of the “data base”
Hold your data
Ask questions
Managing & Shipping
Hadoop FTWCassandra, HBase, ElasticSearch, ...
Integration is still too hard
Dev OpsReliable Decoupling: Flume, Graphite
Data flutters by label
Elephants make sturdy piles {GROUP}
Number becomes thoughtprocess_group
Hadoop
class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)
Twitter Parser in a Tweet
pure functionality
pure functionality
Cassandra
HBase
ElasticSearch
MySQL
Redis
TokyoTyrant
SimpleDB
MongoDB
sqlite
whisper (graphite)
file system
S3
Data Stores in Production
Cassandra
HBase
ElasticSearch
MySQL
Redis
TokyoTyrant
SimpleDB
MongoDB
sqlite
whisper (graphite)
file system
S3
Dev Ops: Rethink Hard
Still Blind
Visual Grammar to see it: NYTimes, Stamen, Ben Fry
Interactive tools: Tableau, Spotfirebloom.io, d3.js, Gephi
Human-Scale Tools
Data-as-a-Service: Infochimps, SimpleGeoDrawnToScale
Business IntelligenceFamiliar Paradigm, New Scale
BigSheets, Datameer
Panel Discussion
Stu Hood Software Engineer
@stuhood #bd4e
Thanks for coming!Stu Hood @stuhood Flip Kromer @mrflipMatt Pfeil @mattz62Eric Sammer @esammerSteve Watt @wattsteve