November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...

November 6-9, Seattle, WA

SQLCAT: Big Data – All Abuzz About Hive

Cindy GrossSQLCAT BI/Big Data PM

Microsofthttp://blogs.msdn.com/cindygross

@[email protected]

Ed KatibahSQLCAT Spatial PM

Microsofthttp://blogs.msdn.com/b/edkatibah/

@[email protected]

http://blogs.msdn.com/cindygross


https://twitter.com/sqlcindy


https://twitter.com/spatial_ed


BIG AGENDA

HiveHadoopBig Data

Analytics to Insights

How do I optimize my fleet based on weather and traffic patterns?

SOCIAL & WEB ANALYTICS

LIVE DATA FEEDS

ADVANCED ANALYTICS

What’s the social sentiment for my brand or products

How do I better predict future outcomes?

A NEW SET OF QUESTIONS

NEW OPPORTUNITIES

Revenue Growth

Increases ad revenue by processing 3.5 billion events per day

Massive Volumes

Processes 464 billion rows per quarter, with average query time under 10 secs.

Businesses Innovation

Measures and ranks online user influence by processing 3 billion signals per day

Cloud Connectivity

Connects across 15 social networks via the cloud for data and API access

Operational Efficiencies

Uses sentiment analysis and web analytics for its internal cloud

GE

Real-Time Insight

Improves operational decision making for IT managers and users

RelationalNon-Relational Streaming

MANAGE ANY DATA, ANY SIZE, ANYWHERE

010101010101010101101010101010101001010101010101101010101010

Unified Monitoring, Management & Security

Data Movement

6

VVVVROOM!

Variability – Multiple interpretations

Velocity – Need decisions fast

Variety – Many formats

Volume – beyond what environment can handle

8

BIG DATA

Schema on Read Not Write

Scale Out for Pay As You Go

BASE Not ACID

MapReduce, Streaming, Machine Learning,

Massively Parallel Processing

Too Big, Complex, or Expensive for Current Environment

BIG DATA REQUIRES AN END-TO-END APPROACH

Discover Combine Refine

Relational Non-relational Streaming

INSIGHT

DATA

ENRICHMENT

DATA

MANAGEMENT

Self-Service Collaboration Corporate Apps Devices

Analytical

Distributed Storage(HDFS)

Query(Hive)

Hadoop architecture.

Distributed Processing(Map Reduce)

Scripting

(Pig)

NoSQ

L Data

base

(HB

ase

)

Metadata(HCatalog)

Data

Inte

gra

tion

( OD

BC

/ SQ

OO

P/ R

EST)

Busin

ess In

tellig

ence

(E

xcel, Po

werV

iew

…)

Machine Learning(Mahout)

Graph(Pegasus)

Stats processin

g(RHadoop)

Pipelin

e /

workfl

ow

(Oozie

)

Log fi

le

aggre

gatio

n(Flu

me)

Active

D

irecto

ry (S

ecu

rity)Syste

m C

ente

r

Hive

Hive Web Interface (HWI)

Metastore

Thrift Server

Command Line Interface (CLI)

HiveQL

Hadoop

Head Node Name Node

Data Nodes / Task Nodes

JDBCODBC

Compiler, Optimizer, Executor

HIVE ARCHITECTURE

14 November 6-9, Seattle, WA

DEMO:

Analyzing a Frankenstorm

15November 6-9, Seattle, WA

Behind the Scenes

16

GET HDINSIGHT

Sign up for Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP)

Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)

http://hadooponazure.com/


http://microsoft.com/bigdata


17

CREATE TABLECREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string)ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE;ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010';

18

INSIDE A HIVE TABLE

DATA TYPESEXTERNAL / INTERNALPARTITIONED BY | CLUSTERED BY | SKEWED BYTerminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BYLOCATION

19

METADATA

Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server

ViewSHOW TABLES 'ce.*';DESCRIBE census;DESCRIBE census.population;DESCRIBE EXTENDED census;DESCRIBE FORMATTED census;SHOW FUNCTIONS "x.*";SHOW FORMATTED INDEXES ON census;

20

DATA TYPES

Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp

Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float>

Properties No fixed lengths NULL handling depends on SerDe

21

STORAGE – EXTERNAL AND INTERNAL

CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census';

Use EXTERNAL when

Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV

Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT)

And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL

22

STORAGE – PARTITION AND BUCKET

CREATE EXTERNAL TABLE census (…)PARTIONED BY (Year string) CLUSTERED BY (population) into 256 BUCKETS

PartitionDirectory for each distinct combination of string partition valuesPartition key name cannot be defined in table itselfAllows partition eliminationUseful in range searchesCan slow performance if partition is not referenced in query

BucketsSplit data based on hash of a columnOne HDFS file per bucket within partition sub-directoryPerformance may improve for aggregates and join queriesSampling

23

STORAGE – FILE FORMATS AND SERDES

CREATE EXTERNAL TABLE census (…) ROW FORMAT DELIMITED FIELDS TERMINATED by ‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO

FormatTEXTFILE is common, useful when data is shared and all alphanumericExtensible storage formats via custom input, output formatsExtensible on disk/in-memory representation via custom SerDes

24

CREATE INDEX

CREATE INDEX census_population ON TABLE census (population) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

WITH DEFERRED REBUILD IN TABLE census_population_index;ALTER INDEX census_population ON census REBUILD;

Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable;

Indexing May Help Avoid many small partitions GROUP BY

25

CREATE VIEW

CREATE VIEW censusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > 500000 ORDER BY population;

Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop;

Key Points Not materialized Can have ORDER BY or LIMIT

26

QUERY

SELECT c.state_fips, c.county_fips, c.population FROM census c WHERE c.median_household_income > 100000 GROUP BY c.state_fips, c.county_fips, c.population ORDER BY county_fips LIMIT 100;

Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query

Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering

27

SORTING

ORDER BY One reducer does final sort, can be a big bottleneck

SORT BY Sorted only within each reducer, much faster

DISTRIBUTE BY Determines how map data is distributed to reducers

SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution

28

JOINS

Supported Hive Join Types Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI

Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)

29

JOINS

Characteristics Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts

30

EXPLAIN

EXPLAIN SELECT * FROM census;EXPLAIN SELECT * FROM census WHERE population > 100000;EXPLAIN EXTENDED SELECT * FROM census;

Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc.

ABSTRACT SYNTAX TREE:(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))

STAGE DEPENDENCIES:Stage-0 is a root stage

STAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1

31

CONFIGURE HIVE

Configuration Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log

32

WHY USE HIVE

BUZZ! Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster

33

BI ON BIG DATA

Gain Insights Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data

Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel

BIG DEAL

HiveHadoopBig Data

Analytics to Insights

35

NEXT STEPS

Get Involved Read a bit

http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx Programming Hive Book http://blogs.msdn.com/cindygross

Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses

http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx



http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335





Microsoft Big Data http://microsoft.com/bigdataDenny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://tinyurl.com/6wbfxy9 Cindy Gross http://tinyurl.com/SmallBitesBigData

BIG DATA REFERENCES

Hadoop: The Definitive Guide by Tom WhiteSQL Server Sqoop http://bit.ly/rulsjX JavaScript http://bit.ly/wdaTv6Twitter https://twitter.com/#!/search/%23bigdata

Hive http://hive.apache.orgExcel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjjHadoop On Azure Videos http://tinyurl.com/6munnx2Klout http://tinyurl.com/6qu9php



http://dennyglee.com/category/bigdata/

http://dennyglee.com/category/bigdata/

http://tinyurl.com/6wbfxy9

http://tinyurl.com/6wbfxy9

http://tinyurl.com/SmallBitesBigData

http://tinyurl.com/SmallBitesBigData

http://bit.ly/rulsjX

http://bit.ly/wdaTv6

http://bit.ly/wdaTv6

https://twitter.com/#!/search/%23bigdata

https://twitter.com/#!/search/%23bigdata

http://hive.apache.org/

http://tinyurl.com/7c4qjjj

http://tinyurl.com/7c4qjjj

http://tinyurl.com/6munnx2

http://tinyurl.com/6munnx2

http://tinyurl.com/6qu9php

http://tinyurl.com/6qu9php

MICROSOFT BIG DATA AT PASS SUMMIT

37

BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset”

Wednesday 1015am | Darwin Schweitzer

BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BIThursday 130pm | Denny Lee, Dave Mariani

AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko

DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt

AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez

BIA-305-A SQLCAT: Big Data – All Abuzz About Hive Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah

AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys

Manage

Enrich

Insight

http://www.sqlpass.org/summit/2012/Sessions/SessionDetails.aspx?sid=3045













Win prizes with new online evaluations

Build experience with Hands On Labs

NEW: TCC 304

Attend David DeWitt’s spotlight session Big Data Meets SQL Server

DBA-410-S, Room 6EFriday, 9:45 AM

Be SQL Server 2012 Certified with onsite testing

Room 212-214

Find hidden session announcements by following:

@sqlserver #sqlpass

Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks

NEW: 4C-3 & 4C-4

Don’t Miss!

39

PASS Resources

Free SQL Server and BI training Free 1-day Training Events Regional Event

Local and Virtual User Groups Free Online Technical Training

Learning Center

This is Community

http://www.sqlpass.org/Events/24HoursofPASS.aspx

http://www.sqlsaturday.com/




http://www.sqlpass.org/Events/PASSSQLRally.aspx

http://www.sqlpass.org/Events/PASSSQLRally.aspx


http://www.sqlpass.org/LearningCenter.aspx

http://www.sqlpass.org/PASSChapters.aspx

http://www.sqlpass.org/PASSChapters/VirtualChapters.aspx

http://www.sqlpass.org/LearningCenter.aspx

http://www.sqlpass.org/PASSChapters/VirtualChapters.aspx

http://www.sqlpass.org/summit/2012/

http://www.sqlpass.org/summit/2012/

40November 6-9, Seattle, WA

Thank youfor attending this session and the 2012 PASS Summit in Seattle

November 6-9, Seattle, WA

SQLCAT: Big Data – All Abuzz About Hive

Dipti SanganiSQL Big Data PM

[email protected]

om

Cindy GrossSQLCAT BI/Big Data PM

Microsofthttp://blogs.msdn.com/cindygross

@[email protected]

Ed KatibahSQLCAT Spatial PM

Microsofthttp://blogs.msdn.com/b/edkatibah/

@[email protected]

Please fill out evaluations!







November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...

Documents

Transcript of November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...