November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...

38
November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Cindy Gross SQLCAT BI/Big Data PM Microsoft http:// blogs.msdn.com/cindygross @ SQLCindy [email protected] Ed Katibah SQLCAT Spatial PM Microsoft http://blogs.msdn.com/b/ edkatibah/ @ Spatial_Ed [email protected]

Transcript of November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...

Page 1: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

November 6-9, Seattle, WA

SQLCAT: Big Data – All Abuzz About Hive

Cindy GrossSQLCAT BI/Big Data PM

Microsofthttp://blogs.msdn.com/cindygross

@[email protected]

Ed KatibahSQLCAT Spatial PM

Microsofthttp://blogs.msdn.com/b/edkatibah/

@[email protected]

Page 2: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

BIG AGENDA

HiveHadoopBig Data

Analytics to Insights

Page 3: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

How do I optimize my fleet based on weather and traffic patterns?

SOCIAL & WEB ANALYTICS

LIVE DATA FEEDS

ADVANCED ANALYTICS

What’s the social sentiment for my brand or products

How do I better predict future outcomes?

A NEW SET OF QUESTIONS

Page 4: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

NEW OPPORTUNITIES

Revenue Growth

Increases ad revenue by processing 3.5 billion events per day

Massive Volumes

Processes 464 billion rows per quarter, with average query time under 10 secs.

Businesses Innovation

Measures and ranks online user influence by processing 3 billion signals per day

Cloud Connectivity

Connects across 15 social networks via the cloud for data and API access

Operational Efficiencies

Uses sentiment analysis and web analytics for its internal cloud

GE

Real-Time Insight

Improves operational decision making for IT managers and users

Page 5: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

RelationalNon-Relational Streaming

MANAGE ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

Unified Monitoring, Management & Security

Data Movement

Page 6: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

6

VVVVROOM!

Variability – Multiple interpretations

Velocity – Need decisions fast

Variety – Many formats

Volume – beyond what environment can handle

Page 7: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

8

BIG DATA

Schema on Read Not Write

Scale Out for Pay As You Go

BASE Not ACID

MapReduce, Streaming, Machine Learning,

Massively Parallel Processing

Too Big, Complex, or Expensive for Current Environment

Page 8: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

BIG DATA REQUIRES AN END-TO-END APPROACH

Discover Combine Refine

Relational Non-relational Streaming

INSIGHT

DATA

ENRICHMENT

DATA

MANAGEMENT

Self-Service Collaboration Corporate Apps Devices

Analytical

Page 9: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

Distributed Storage(HDFS)

Query(Hive)

Hadoop architecture.

Distributed Processing(Map Reduce)

Scripting

(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/ R

EST)

Busin

ess In

tellig

ence

(E

xcel, Po

werV

iew

…)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoop)

Pipelin

e /

workfl

ow

(Oozie

)

Log fi

le

aggre

gatio

n(Flu

me)

Active

D

irecto

ry (S

ecu

rity)Syste

m C

ente

r

Page 10: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

Hive

Hive Web Interface (HWI)

Metastore

Thrift Server

Command Line Interface (CLI)

HiveQL

Hadoop

Head Node Name Node

Data Nodes / Task Nodes

JDBCODBC

Compiler, Optimizer, Executor

HIVE ARCHITECTURE

Page 11: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

14 November 6-9, Seattle, WA

DEMO:

Analyzing a Frankenstorm

Page 12: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

15November 6-9, Seattle, WA

Behind the Scenes

Page 13: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

16

GET HDINSIGHT

Sign up for Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP)

Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)

Page 14: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

17

CREATE TABLECREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string)ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE;ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010';

Page 15: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

18

INSIDE A HIVE TABLE

DATA TYPESEXTERNAL / INTERNALPARTITIONED BY | CLUSTERED BY | SKEWED BYTerminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BYLOCATION

Page 16: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

19

METADATA

Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server

ViewSHOW TABLES 'ce.*';DESCRIBE census;DESCRIBE census.population;DESCRIBE EXTENDED census;DESCRIBE FORMATTED census;SHOW FUNCTIONS "x.*";SHOW FORMATTED INDEXES ON census;

Page 17: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

20

DATA TYPES

Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp

Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float>

Properties No fixed lengths NULL handling depends on SerDe

Page 18: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

21

STORAGE – EXTERNAL AND INTERNAL

CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census';

Use EXTERNAL when

Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV

Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT)

And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL

Page 19: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

22

STORAGE – PARTITION AND BUCKET

CREATE EXTERNAL TABLE census (…)PARTIONED BY (Year string) CLUSTERED BY (population) into 256 BUCKETS

PartitionDirectory for each distinct combination of string partition valuesPartition key name cannot be defined in table itselfAllows partition eliminationUseful in range searchesCan slow performance if partition is not referenced in query

BucketsSplit data based on hash of a columnOne HDFS file per bucket within partition sub-directoryPerformance may improve for aggregates and join queriesSampling

Page 20: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

23

STORAGE – FILE FORMATS AND SERDES

CREATE EXTERNAL TABLE census (…) ROW FORMAT DELIMITED FIELDS TERMINATED by ‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO

FormatTEXTFILE is common, useful when data is shared and all alphanumericExtensible storage formats via custom input, output formatsExtensible on disk/in-memory representation via custom SerDes

Page 21: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

24

CREATE INDEX

CREATE INDEX census_population ON TABLE census (population) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

WITH DEFERRED REBUILD IN TABLE census_population_index;ALTER INDEX census_population ON census REBUILD;

Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable;

Indexing May Help Avoid many small partitions GROUP BY

Page 22: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

25

CREATE VIEW

CREATE VIEW censusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > 500000 ORDER BY population;

Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop;

Key Points Not materialized Can have ORDER BY or LIMIT

Page 23: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

26

QUERY

SELECT c.state_fips, c.county_fips, c.population FROM census c WHERE c.median_household_income > 100000 GROUP BY c.state_fips, c.county_fips, c.population ORDER BY county_fips LIMIT 100;

Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query

Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering

Page 24: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

27

SORTING

ORDER BY One reducer does final sort, can be a big bottleneck

SORT BY Sorted only within each reducer, much faster

DISTRIBUTE BY Determines how map data is distributed to reducers

SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution

Page 25: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

28

JOINS

Supported Hive Join Types Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI

Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)

Page 26: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

29

JOINS

Characteristics Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts

Page 27: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

30

EXPLAIN

EXPLAIN SELECT * FROM census;EXPLAIN SELECT * FROM census WHERE population > 100000;EXPLAIN EXTENDED SELECT * FROM census;

Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc.

ABSTRACT SYNTAX TREE:(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))

STAGE DEPENDENCIES:Stage-0 is a root stage

STAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1

Page 28: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

31

CONFIGURE HIVE

Configuration Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log

Page 29: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

32

WHY USE HIVE

BUZZ! Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster

Page 30: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

33

BI ON BIG DATA

Gain Insights Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data

Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel

Page 31: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

BIG DEAL

HiveHadoopBig Data

Analytics to Insights

Page 32: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

35

NEXT STEPS

Get Involved Read a bit

http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx Programming Hive Book http://blogs.msdn.com/cindygross

Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses

Page 33: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

Microsoft Big Data http://microsoft.com/bigdataDenny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://tinyurl.com/6wbfxy9 Cindy Gross http://tinyurl.com/SmallBitesBigData

BIG DATA REFERENCES

Hadoop: The Definitive Guide by Tom WhiteSQL Server Sqoop http://bit.ly/rulsjX JavaScript http://bit.ly/wdaTv6Twitter https://twitter.com/#!/search/%23bigdata

Hive http://hive.apache.orgExcel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjjHadoop On Azure Videos http://tinyurl.com/6munnx2Klout http://tinyurl.com/6qu9php

Page 34: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

MICROSOFT BIG DATA AT PASS SUMMIT

37

BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset”

Wednesday 1015am | Darwin Schweitzer

BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BIThursday 130pm | Denny Lee, Dave Mariani

AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko

DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt

AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez

BIA-305-A SQLCAT: Big Data – All Abuzz About Hive Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah

AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys

Manage

Enrich

Insight

Page 35: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

Win prizes with new online evaluations

Build experience with Hands On Labs

NEW: TCC 304

Attend David DeWitt’s spotlight session Big Data Meets SQL Server

DBA-410-S, Room 6EFriday, 9:45 AM

Be SQL Server 2012 Certified with onsite testing

Room 212-214

Find hidden session announcements by following:

@sqlserver #sqlpass

Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks

NEW: 4C-3 & 4C-4

Don’t Miss!

Page 37: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

40November 6-9, Seattle, WA

Thank youfor attending this session and the 2012 PASS Summit in Seattle

Page 38: November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM Microsoft Dipti.Sangani@microsoft.com Cindy Gross SQLCAT.

November 6-9, Seattle, WA

SQLCAT: Big Data – All Abuzz About Hive

Dipti SanganiSQL Big Data PM

[email protected]

om

Cindy GrossSQLCAT BI/Big Data PM

Microsofthttp://blogs.msdn.com/cindygross

@[email protected]

Ed KatibahSQLCAT Spatial PM

Microsofthttp://blogs.msdn.com/b/edkatibah/

@[email protected]

Please fill out evaluations!