Hadoop for the Absolute Beginner
-
Upload
ike-ellis -
Category
Technology
-
view
109 -
download
3
description
Transcript of Hadoop for the Absolute Beginner
![Page 1: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/1.jpg)
Hadoop for the Absolute BeginnerIke Ellis, MVP
![Page 2: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/2.jpg)
Agenda
• What is Big Data?• Why is it a problem?• What is Hadoop?
– MapReduce– HDFS
• Pig• Hive• Sqoop• HCAT• The Players• Maybe data visualization (depending on time)• Q&A
![Page 3: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/3.jpg)
What is Big Data?
• Trendy?• Buzz words?• Process?
• Big data is “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” – Wikipedia
• So how do you know your data is big data?• When your existing data processing methodologies
are no longer good enough.
![Page 4: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/4.jpg)
Traditional Data Warehouse Stack
![Page 5: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/5.jpg)
There are a lot of moving pieces back there…
• Sometimes, that’s our biggest challenge– Simple question – massive data
• Do we really need to go through the pain of that huge stack?
![Page 6: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/6.jpg)
Big Data Characteristics
• Volume– Large amount of data
• Velocity– Need to be processed quickly
• Variety– Excel, SQL, OData feeds, CSVs, web downloads, JSON
• Variability– Different semantics, in terms of meaning or context
• Value
![Page 7: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/7.jpg)
Big Data Examples
• Structured Data– Pre-defined Schema– Highly Structured– Relational
• Semi-structured Data– Inconsistent structure– Cannot be stored in rows and tables in a typical database– Logs, tweets, data feeds, GPS coordinates
• Unstructured Data– Lacks structure– Free-form text– Customer feedback forms– Audio– Video
![Page 8: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/8.jpg)
The Problem
• So you have some data
![Page 9: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/9.jpg)
The Problem
• And you want to clean and/or analyze it
![Page 10: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/10.jpg)
So you use the technology that you know
• Excel• SQL Server• SQL Server Integration Services• SQL Server Reporting Services
![Page 11: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/11.jpg)
But what happens if it’s TONS of data
• Like all the real estate transactions in the US for the last ten years?
• Or GPS data from every bike in your bike rental store?
• Or every swing and every pitch from every baseball game since 1890?
![Page 12: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/12.jpg)
Or what happens when the analysis is very complicated?
• Tell me when earthquakes happen!• Tell me how shoppers view my website!• Tell me how to win my next election!
![Page 13: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/13.jpg)
So you use SQL Server, and have a lot of data, so….
• YOU SCALE UP!• But SQL can only have so much RAM, CPU, Disk I/O,
Network I/O• So you hit a wall, probably with disk I/O• So you….
![Page 14: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/14.jpg)
Scale Out!
• Add servers until the pain goes away….
All analysis is done away from the data servers
![Page 15: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/15.jpg)
But that’s easier said than done
• What’s the process?• You take one large task, and break it up into lots of
smaller tasks– How do you break them up?– Once it’s broken up and processed, how do you put them back
together?– How do you make sure you break them up evenly so they all
execute at the same rate?– And really, you’re breaking up two things:
• Physical data• Computational Analysis
– If one small task fails, how to you restart it? Log it? Recover from failure?
– If one SQL Server fails, how do you divert all the new tasks away from it?
– How do you load balance?
• So you end up writing a lot of plumbing code….and even when you get done….you have one GIANT PROBLEM!
![Page 16: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/16.jpg)
Data MovementData moves to achieve fault tolerance, to segment data, to reassemble data, to derive data, to output data, etc, etc….and network (and disk) is SLOW..you’ve saturated it.
![Page 17: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/17.jpg)
Oh, and another problem
• In SQL, the performance between a query over 1MB of data and 1TB of data is significant
• The performance of a query over one server and over 20 servers is also significant
![Page 18: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/18.jpg)
So to summarize and repeat
• Drive seek time….BIG PROBLEM• Drive channel latency…BIG PROBLEM• Data + processing time…BIG PROBLEM• Network Pipe I/O saturation…BIG PROBLEM• Lots of human problems
– Building a data warehouse stack is a difficult challenge
• Semi-structured data is difficult to handle– When data changes, it becomes less structured and less
valuable as it changes– Flexible structures often give us fits
![Page 19: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/19.jpg)
Enter Hadoop
• Why write your own framework to handle fault tolerance, logging, data partitioning, heavy analysis when you can just use this one?
![Page 20: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/20.jpg)
What is Hadoop?
• Hadoop is a distributed storage and processing technology for large scale applications– HDFS
• Self-healing, distributed file system. Breaks files into blocks and stores them redundantly across cluster
– MapReduce• Framework for running large data processing jobs in
parallel across many nodes and combining results
• Open Source• Distributed Data Replication• Commodity hardware• Disparate hardware• Data and analysis co-location• Scalability• Reliable error handling
![Page 21: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/21.jpg)
Hadoop Ecosystem
![Page 22: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/22.jpg)
Under the coversHadoop works by keeping the compute next to the data (to minimize network I/O costs)
![Page 23: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/23.jpg)
MapReduce
![Page 24: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/24.jpg)
Segmentation Problem
![Page 25: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/25.jpg)
MapReduce Process – Very simple example
![Page 26: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/26.jpg)
Programming MapReduce
• Steps– Define the inputs
• Usually some files in HDFS/HBase (Or Azure Blob Storage)
– Write a map function– Write a reduce function– Define outputs
• Usually some files in HDFS/HBase (Or Azure Blob Storage)
• Lots of options for both inputs and outputs• Functions are usually written in Java
– Or Python– Even .NET (C#, F#)
![Page 27: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/27.jpg)
Scalability
• Hadoop scales linearly with data size– Or analysis complexity– Scales to hundreds of petabytes
• Data-parallel or computer-parallel• Extensive machine learning on <100GB of image
data• Simple SQL queries on >100TB of clickstream data• Hadoop works for both!
![Page 28: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/28.jpg)
Hadoop allows you to write a query like this
Select productname, sum(costpergoods)From salesordersGroup by productname
• Over a ton of data, or a little data, and have it perform about the same
• If it slows down, throw more nodes at it• Map is like the GROUP BY• While reduce is like the aggregate
![Page 29: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/29.jpg)
Why use Hadoop?
• Who wants to write all that plumbing?– Segmenting data– Making it redundant and fault tolerant– Overcoming job failure– Logging– All those data providers– All the custom scripting languages and tooling– Synchonization– Scale-free programming model
• Wide adoption• You specify the map() and reduce() functions
– Let the framework do the rest
![Page 30: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/30.jpg)
What is Hadoop Good For?
• Enormous datasets• Log Analysis• Calculating statistics on enormous datasets• Running large simulations• ETL• Machine learning• Building inverted indexes• Sorting
– World record
• Distributed Search• Tokenization• Image processing• No fancy hardware…good in the cloud• And so much more!
![Page 31: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/31.jpg)
What is Hadoop Bad For?
• Low latency (not current data)• Sequential algorithms
– Recursion
• Joins (sometimes)• When all the data is structured and can fit on one
database server with scaling up– It is NOT a replacement for a good RDBMS
![Page 32: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/32.jpg)
Relational vs Hadoop
![Page 33: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/33.jpg)
Another Problem
• MapReduce functions are written in Java, Python, .NET, and a few other languages
• Those are languages that are widely known• Except by analysts and DBAs, the exact kind of
people who struggle with big data• Enter Pig & Hive
– Abstraction for MapReduce– Sits over MapReduce– Spawns MapReduce jobs
![Page 34: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/34.jpg)
What MapReduce Functions look like
function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)
function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
![Page 35: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/35.jpg)
Introduction to Pig
• Pig – ETL for big data– Structure– Pig Latin
• Parallel data processing for Hadoop• Not trying to get you to learn Pig. Just want you to
want to learn it.
![Page 36: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/36.jpg)
Here’s what SQL looks like
Select customername, count(orderdate) as totalOrdersFrom salesOrders so Join customers c
On so.custid = c.custid
Group by customername
![Page 37: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/37.jpg)
Pig
Trx = load ‘transaction’ as (customer, orderamount);Grouped = group trx by customer;Ttl = foreach grouped generate group, sum(trx.orderamount) as tp;Cust = load ‘customers’ as (customer, postalcode);Result = join ttl by group, cust by customer;Dump result;
Executes on step at a time
![Page 38: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/38.jpg)
Pig is like SSIS
• One step at a time. One thing executes, then the next in the script, acting on the variable declarations above it
![Page 39: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/39.jpg)
How Pig Works
• Pig Latin goes to pre-processor• Pre-processor creates MapReduce jobs that get
submitted to the JobTracker
![Page 40: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/40.jpg)
Pig Components
• Data Types• Inputs & Outputs• Relational Operators• UDFs• Scripts & Testing
![Page 41: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/41.jpg)
Pig Data Types
• Scalar– Int– Long– Float– Double– CharArray– ByteArray
• Complex– Map (key/value pair)– Tuple (fixed-size ordered collection)– Bag(collection of tuples)
![Page 42: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/42.jpg)
Pig: Inputs/Outputs
• Load– PigStorage– TextLoader– HBaseStorage
• Store– PigStorage– HBaseStorage
• Dump– Dumps to console– Don’t dump a ton of data…uh oh…
![Page 43: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/43.jpg)
Pig: Relational Operators
• Foreach – projection operator, applies expression to every row in the pipeline– Flatten – used with complex types, PIVOT
• Filter – WHERE• Group, Cogroup – GROUP BY (Cogroup on multiple
keys)• ORDER BY• Distinct• JOIN (INNER, OUTER, CROSS)• LIMIT – TOP• Sample – Random sample• Parallel – level of parallelism on the reducer side• Union
![Page 44: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/44.jpg)
Pig: UDFs
• Written in Java/Python• String manipulation, math, complex type
operations, parsing
![Page 45: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/45.jpg)
Pig: Useful commands
• Describe – shows schema• Explain – shows the logical and physical
MapReduce plan• Illustrate – runs a sample of your data to test your
script• Stats – produced after every run and includes
start/end times, # of records, MapReduce info• Supports parameter substitution and parameter
files• Supports macros and functions (define)• Supports includes for script organization
![Page 46: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/46.jpg)
Pig Demo
![Page 47: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/47.jpg)
Introduction to HIVE
• Very popular• Hive Query Language• Defining Tables, Views, Partitioning• Querying and Integration• VERY SQL-LIKE• Developed by FaceBook• Data Warehouse for Hadoop• Based on SQL-92 specification
![Page 48: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/48.jpg)
SQL vs Hive
• Almost useless to compare the two, because they are so similar
• Create table Internal/External• Hive is schema on read
– It defines a schema over your data that already exists in the database
![Page 49: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/49.jpg)
Hive is not a replacement for SQL
• So don’t throw out SQL just yet• Hive is for batch processing large data sets that
may span hundreds, or even thousands, of machines– Not for row-level updates
• Hive has high overhead when starting a job. It translates queries to MR so it takes time
• Hive does not cache data• Hive performance tuning is mainly Hadoop
performance tuning• Similarity in the query engine, but different
architectures for different purposes• Way too slow for OLTP workloads
![Page 50: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/50.jpg)
Hive Components
• Data Types• DDL• DML• Queries• Views, Indexes, Partitions• UDFs
![Page 51: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/51.jpg)
Hive Data Types
• Scalar– TinyInt– SmallInt– Int– BigInt– Boolean– Float– Double– TimeStamp– String– Binary
• Complex– Struct– Array(Collection)– Map(key/value pair)
![Page 52: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/52.jpg)
What is a Hive Table?
• CREATE DATABASE NewDB– LOCATION ‘hdfs\hua\NewDB’
• CREATE TABLE• A Hive table consists of:
– Data: typically a file in HDFS– Schema: in the form of metadata stored in a relational
database
• Schema and data are separate– A schema can be defined for existing data– Data can be added or removed independently– Hive can be pointed to existing data
• You have to define schema if you have existing data in HDFS that you want to use in Hive
![Page 53: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/53.jpg)
How does Hive work?
• Hive as a Translation Tool– Compiles and executes queries– Hive translates the SQL Query to a MapReduce job
• Hive as a structuring tool– Creates a schema around the data in HDFS– Tables stored in directories
• Hive Tables have rows and columns and data types• Hive Metastore
– Namespace with a set of tables– Holds table definitions
• Partitioning– Choose a partition key– Specify key when you load data
![Page 54: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/54.jpg)
Define a Hive Table
Create Table myTable (name string, age int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘;’STORED AS TEXFILE;
![Page 55: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/55.jpg)
Loading Data
• Use LOAD DATA to import data into a Hive table• LOAD DATA LOCAL INPATH ‘input/mydata/data.txt’• INTO TABLE myTable• The files are not modified in Hive – they are loaded
as is• Use the word OVERWRITE to write over a file of the
same name• Hive can read all the files in particular directory• The schema is checked when the data is queried
– If a row does not match the schema, it will be read as null
![Page 56: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/56.jpg)
Querying Data
• SELECT– WHERE– UNION ALL/DISTINCT– GROUP BY– HAVING– LIMIT– REGEX
• Subqueries• JOIN
– INNER– OUTER
• ORDER BY– Reducer is 1
• SORT BY– Multiple reducers with a sorted file from each
![Page 57: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/57.jpg)
Hive Demo
![Page 58: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/58.jpg)
Pig Vs Hive
• Famous Yahoo Blog Post– http://
developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html
• PIG– ETL– For preparing data for easier analysis– Good for SQL authors that take the time to learn
something new– Unless you store it, all data goes away when the script is
finished
• Hive– Analysis
• When you have to answer a specific question– Good for SQL authors– Excel connectivity– Persists data in the Hadoop data store
![Page 59: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/59.jpg)
Sqoop
• SQL to Hadoop– SQL Server/Oracle/Something with a JDBC driver
• Import – From RDBME into HFDS
• Export– From HDFS into RDMBS
• Other Commands– Create hive table– Evaluate import statement
![Page 60: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/60.jpg)
HUE
• Hadoop User Experience
![Page 61: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/61.jpg)
HCatalog
• Metadata and table management system for Hadoop
• Provides a shared schema and data type mechanism for various Hadoop tools (Pig, Hive, MapReduce)– Enables interoperability across data processing tools– Enables users to choose the best tools for their
environments
• Provides a table abstraction so that users need not be concerned with how data is stored– Presents users with a relational view of data
![Page 62: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/62.jpg)
![Page 63: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/63.jpg)
HCatalog DDL
• CREATE/ALTER/DROP Table• SHOW TABLES• SHOW FUNCTIONS• DESCRIBE• Supports a subset of Hive DDL
![Page 64: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/64.jpg)
Why do we have HCat?
• Tools don’t tend to agree on– What a schema is– What data types are– How data is stored
• HCatalog solution– Provides one consistent dta model for various Hadoop
tools– Provides shared schema– Allows users to see when shared data is available
![Page 65: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/65.jpg)
HCatalog – HBase Integration
• Connects HBase tables to HCatalog• Uses various Hadoop tools• Provides flexibility with data in HBase or HDFS
![Page 66: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/66.jpg)
HCat Demo
![Page 67: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/67.jpg)
HBase
• NoSQL Database• Modeled after Google BigTable• Written in Java• Runs on top of HDFS• Features
– Compression– In-memory operations– Bloom filters
• Can serve as input or output for MapReduce jobs• FaceBook’s messaging platform uses it
![Page 68: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/68.jpg)
Yarn
• Apache Hadoop Next Gen MapReduce• Yet aNother Resource Negotiator• Seperates resource management and processing
components– Breaking up the job tracker
• YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce
![Page 69: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/69.jpg)
Impala
• Cloudara• Real-time queries for Hadoop• Low-latency Queries using SQL to HDFS or HBase
![Page 70: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/70.jpg)
Storm
• Free and open source distributed real-time computation system
• Makes it easy to process unbounded streams of data
• Storm is fast– Million tuples processed per second per node
![Page 71: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/71.jpg)
The Players
• HortonWorks• Cloudara• MapR• Microsoft HDInsight• Microsoft PDW• IBM• Oracle• Amazon• Rackspace• Google
![Page 72: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/72.jpg)
The Future
• Hadoop features will push into RDBMS systems• RDBMS features will continue to push into Hadoop• Tons of 3rd party vendors and open source projects
have applications for Hadoop and RDBMS/Hadoop integration
• Lots of buy-in, lots of progress, lots of changes
![Page 73: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/73.jpg)
How to Learn Hadoop
• Lots of YouTube videos online• HortonWorks, MapR, and Cloudara all have good
videos for free• HortonWorks sandbox• Azure HDInsight VMs• Hadoop: The Definitive Guide• Tons of blog posts• Lots of open source projects
![Page 74: Hadoop for the Absolute Beginner](https://reader035.fdocuments.us/reader035/viewer/2022081516/54c6c5044a7959f72f8b4579/html5/thumbnails/74.jpg)
Ike Ellis
• www.ikeellis.com• SQL Pass Book Readers – VC Leader• @Ike_Ellis• 619.922.9801• Microsoft MVP• Quick Tips – YouTube• San Diego TIG Founder and Chairman• San Diego .NET User Group Steering Committee
Member