BW Tech Meetup: Hadoop and The rise of Big Data
-
Upload
mindgrub-technologies -
Category
Technology
-
view
216 -
download
1
description
Transcript of BW Tech Meetup: Hadoop and The rise of Big Data
![Page 2: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/2.jpg)
About Don
![Page 3: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/3.jpg)
Hadoop
• Distributed platform up to thousands of nodes• Data storage and application framework• Started at Yahoo!• Open source• Based on a few Google papers (2003, 2004)• Runs on commodity hardware
I’M HERE TO TELL YOU WHY HADOOP IS AWESOME
![Page 4: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/4.jpg)
Hadoop users• Yahoo!• Facebook• eBay• AOL
• Riot Games• ComScore• Twitter• LinkedIn
Hadoop Companies• Cloudera, Hortonworks, EMC/Greenplum, IBM• Numerous startups
![Page 5: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/5.jpg)
Buzzword glossary
• Unstructured & Structured Data• NoSQL• Big Data (volume, velocity, variety)• Data Science• Cloud computing
![Page 6: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/6.jpg)
Hadoop component overview
• Core components:– HDFS (Hadoop Distributed File System)– MapReduce (Data analysis framework)
• Ecosystem– HBase (key-value store)– Pig (high-level data analysis language)– Hive (SQL-like data analysis language)– ZooKeeper (stores metadata)– Other stuff
![Page 7: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/7.jpg)
Use cases
• Text processing– Indexing, counting, processing
• Large-scale reports• Data science• Mixing data sources (data lakes)• Ad targeting• Image/Video/Audio processing• Cybersecurity
![Page 8: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/8.jpg)
HDFS
• Stores files in folders (that’s it)– Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-1GB)• Blocks are scattered all over the place• 3 replicates of each block (better safe than sorry)• One NameNode (might be sorry)– Knows which computers blocks live on– Knows which blocks belong to which files
• One DataNode per computer (slaves!)– Hosts files
![Page 9: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/9.jpg)
HDFS Demonstration
![Page 10: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/10.jpg)
MapReduce• Analyzes data in HDFS where the data is• Jobs are split into Mappers and Reducers• JobTracker – keeps track of running jobs• TaskTracker – one per computer, executes tasks• Mappers (you code this)– Loads data from HDFS– Filter, transform, parse– Outputs (key, value) pairs
• Reducers (you code this, too)– Groups by the mapper’s output key– Aggregate, count, statistics– Outputs to HDFS
![Page 11: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/11.jpg)
MapReduce Demonstration
![Page 12: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/12.jpg)
Hadoop ecosystem
• HDFS and MapReduce don’t do everything• Pig – high-level language
• Hive – high-level SQL language
• HBase – key/value store
grpd = GROUP logs BY userAgent;counts = FOREACH grpd GENERATE group, AVG(logs.timeMicroSec)/1.0E+06 AS loadTimeSec;byCount = ORDER counts BY loadTimeSec DESC;top = limit byCount 15;
SELECT grp, SUM(col2), COUNT(*) FROM table1 GROUP BY grp;
![Page 13: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/13.jpg)
Cool thing #1: Linear Scalability
• HDFS and MapReduce scale linearly• If you have twice as many computers, things run
twice as fast• If you have twice as much data, things run twice
as slow• If you have twice as many computers, you can
store twice as much data• This stays true (some minor caveats)• DATA LOCALITY!!
![Page 14: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/14.jpg)
Cool thing #2: Schema on Read
LOAD DATA ???? PROFIT!!
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
Before:ETL, schema design, tossing out original data
Keep original data around!Have multiple views of the same data!Store first, figure out what to do with it later!
NOW:
![Page 15: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/15.jpg)
Cool thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DON’T CARE
MapReduceSolution
… I just have to fit my solution into this tiny box
Fault tolerance?
Code deployment?RPC?
Message passing?
Locking?
Data center fires?
![Page 16: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/16.jpg)
Cool thing #4: Cheap
• Commodity hardware (meh)• Open source (people cost more though)• Add more hardware later
![Page 17: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/17.jpg)
How to get started
• Install Hadoop in a Linux VM– Wait how is this helpful?? Hadoop is distributed!
• Use Google (seriously)
• Some prerequisites: Java, Linux, Data, Time
![Page 18: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/18.jpg)
Stuff Hadoop is good at
• Batch processing• Processing lots of data• Outputting lots of data• Storing lots of historical data• Flexible analysis of data• Dealing with unstructured or structured data
![Page 19: BW Tech Meetup: Hadoop and The rise of Big Data](https://reader036.fdocuments.us/reader036/viewer/2022062511/54be2f244a795949228b45ab/html5/thumbnails/19.jpg)
Stuff Hadoop is not good at
• Hadoop is a freight truck, not a sports car• Updating data (think “append-only”)• Being easy to use– Java– Administration
• Hadoop is not good storage (don’t throw away your EMC stuff!)