November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...
-
Upload
peter-mosley -
Category
Documents
-
view
215 -
download
0
Transcript of November 6-9, Seattle, WA SQLCAT: Big Data – All Abuzz About Hive Dipti Sangani SQL Big Data PM...
November 6-9, Seattle, WA
SQLCAT: Big Data – All Abuzz About Hive
Cindy GrossSQLCAT BI/Big Data PM
Microsofthttp://blogs.msdn.com/cindygross
Ed KatibahSQLCAT Spatial PM
Microsofthttp://blogs.msdn.com/b/edkatibah/
BIG AGENDA
HiveHadoopBig Data
Analytics to Insights
How do I optimize my fleet based on weather and traffic patterns?
SOCIAL & WEB ANALYTICS
LIVE DATA FEEDS
ADVANCED ANALYTICS
What’s the social sentiment for my brand or products
How do I better predict future outcomes?
A NEW SET OF QUESTIONS
NEW OPPORTUNITIES
Revenue Growth
Increases ad revenue by processing 3.5 billion events per day
Massive Volumes
Processes 464 billion rows per quarter, with average query time under 10 secs.
Businesses Innovation
Measures and ranks online user influence by processing 3 billion signals per day
Cloud Connectivity
Connects across 15 social networks via the cloud for data and API access
Operational Efficiencies
Uses sentiment analysis and web analytics for its internal cloud
GE
Real-Time Insight
Improves operational decision making for IT managers and users
RelationalNon-Relational Streaming
MANAGE ANY DATA, ANY SIZE, ANYWHERE
010101010101010101101010101010101001010101010101101010101010
Unified Monitoring, Management & Security
Data Movement
6
VVVVROOM!
Variability – Multiple interpretations
Velocity – Need decisions fast
Variety – Many formats
Volume – beyond what environment can handle
8
BIG DATA
Schema on Read Not Write
Scale Out for Pay As You Go
BASE Not ACID
MapReduce, Streaming, Machine Learning,
Massively Parallel Processing
Too Big, Complex, or Expensive for Current Environment
BIG DATA REQUIRES AN END-TO-END APPROACH
Discover Combine Refine
Relational Non-relational Streaming
INSIGHT
DATA
ENRICHMENT
DATA
MANAGEMENT
Self-Service Collaboration Corporate Apps Devices
Analytical
Distributed Storage(HDFS)
Query(Hive)
Hadoop architecture.
Distributed Processing(Map Reduce)
Scripting
(Pig)
NoSQ
L Data
base
(HB
ase
)
Metadata(HCatalog)
Data
Inte
gra
tion
( OD
BC
/ SQ
OO
P/ R
EST)
Busin
ess In
tellig
ence
(E
xcel, Po
werV
iew
…)
Machine Learning(Mahout)
Graph(Pegasus)
Stats processin
g(RHadoop)
Pipelin
e /
workfl
ow
(Oozie
)
Log fi
le
aggre
gatio
n(Flu
me)
Active
D
irecto
ry (S
ecu
rity)Syste
m C
ente
r
Hive
Hive Web Interface (HWI)
Metastore
Thrift Server
Command Line Interface (CLI)
HiveQL
Hadoop
Head Node Name Node
Data Nodes / Task Nodes
JDBCODBC
Compiler, Optimizer, Executor
HIVE ARCHITECTURE
14 November 6-9, Seattle, WA
DEMO:
Analyzing a Frankenstorm
15November 6-9, Seattle, WA
Behind the Scenes
16
GET HDINSIGHT
Sign up for Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP)
Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP)
17
CREATE TABLECREATE EXTERNAL TABLE censusP (State_FIPS int, County_FIPS int, Population bigint, Pop_Age_Over_69 bigint, Total_Households bigint, Median_Household_Income bigint, KeyID string) COMMENT 'US Census Data' PARTITIONED BY (Year string)ROW FORMAT DELIMITED FIELDS TERMINATED by '\t' STORED AS TEXTFILE;ALTER TABLE censusP ADD PARTITION (Year = '2010') LOCATION '/user/demo/census/2010';
18
INSIDE A HIVE TABLE
DATA TYPESEXTERNAL / INTERNALPARTITIONED BY | CLUSTERED BY | SKEWED BYTerminators ROW FORMAT DELIMITED | SERDE STORED AS FIELDS/COLLECTION ITEMS/MAP KEYS TERMINATED BYLOCATION
19
METADATA
Metadata is stored in a MetaStore database such as Derby SQL Azure SQL Server
ViewSHOW TABLES 'ce.*';DESCRIBE census;DESCRIBE census.population;DESCRIBE EXTENDED census;DESCRIBE FORMATTED census;SHOW FUNCTIONS "x.*";SHOW FORMATTED INDEXES ON census;
20
DATA TYPES
Primitives Numbers: Int, SmallInt, TinyInt, BigInt, Float, Double Characters: String Special: Binary, Timestamp
Collections STRUCT<City:String, State:String> | Struct (‘Boise’, ‘Idaho’) ARRAY <String> | Array (‘Boise’, ‘Idaho’) MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’) UNIONTYPE <BigInt, String, Float>
Properties No fixed lengths NULL handling depends on SerDe
21
STORAGE – EXTERNAL AND INTERNAL
CREATE EXTERNAL TABLE census(…) LOCATION '/user/demo/census'; LOCATION ‘hdfs:///user/demo/census'; LOCATION ‘asv://user/demo/census';
Use EXTERNAL when
Data also used outside of Hive Data needs to remain even after a DROP TABLE Use custom location such as ASV
Hive should not own data and control settings, directories, etc. Not creating table based on existing table (AS SELECT)
And ASV = Azure Storage Vault (blob store) INTERNAL is NOT a keyword, just leave off EXTERNAL
22
STORAGE – PARTITION AND BUCKET
CREATE EXTERNAL TABLE census (…)PARTIONED BY (Year string) CLUSTERED BY (population) into 256 BUCKETS
PartitionDirectory for each distinct combination of string partition valuesPartition key name cannot be defined in table itselfAllows partition eliminationUseful in range searchesCan slow performance if partition is not referenced in query
BucketsSplit data based on hash of a columnOne HDFS file per bucket within partition sub-directoryPerformance may improve for aggregates and join queriesSampling
23
STORAGE – FILE FORMATS AND SERDES
CREATE EXTERNAL TABLE census (…) ROW FORMAT DELIMITED FIELDS TERMINATED by ‘\001‘ STORED AS TEXTFILE, RCFILE, SEQUENCEFILE, AVRO
FormatTEXTFILE is common, useful when data is shared and all alphanumericExtensible storage formats via custom input, output formatsExtensible on disk/in-memory representation via custom SerDes
24
CREATE INDEX
CREATE INDEX census_population ON TABLE census (population) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD IN TABLE census_population_index;ALTER INDEX census_population ON census REBUILD;
Key Points No keys Index data is another table Requires REBUILD to include new data SHOW FORMATTED INDEXES on MyTable;
Indexing May Help Avoid many small partitions GROUP BY
25
CREATE VIEW
CREATE VIEW censusBigPop (state_fips, county_fips, population) AS SELECT state_fips, county_fips, population FROM census WHERE population > 500000 ORDER BY population;
Sample Code SELECT * FROM censusBigPop; DESCRIBE FORMATTED censusBigPop;
Key Points Not materialized Can have ORDER BY or LIMIT
26
QUERY
SELECT c.state_fips, c.county_fips, c.population FROM census c WHERE c.median_household_income > 100000 GROUP BY c.state_fips, c.county_fips, c.population ORDER BY county_fips LIMIT 100;
Key Points Minimal caching, statistics, or optimizer Generally reads entire data set for every query
Performance The order of columns, tables can make a difference to performance Use partition elimination for range filtering
27
SORTING
ORDER BY One reducer does final sort, can be a big bottleneck
SORT BY Sorted only within each reducer, much faster
DISTRIBUTE BY Determines how map data is distributed to reducers
SORT BY + DISTRIBUTE BY = CLUSTER BY Can mimic ORDER BY, better perf if even distribution
28
JOINS
Supported Hive Join Types Equality OUTER - LEFT, RIGHT, FULL LEFT SEMI
Not Supported Non-Equality IN/EXISTS subqueries (rewrite as LEFT SEMI JOIN)
29
JOINS
Characteristics Multiple MapReduce jobs unless same join columns in all tables Put largest table last in query to save memory Joins are done left to right in query order JOIN ON completely evaluated before WHERE starts
30
EXPLAIN
EXPLAIN SELECT * FROM census;EXPLAIN SELECT * FROM census WHERE population > 100000;EXPLAIN EXTENDED SELECT * FROM census;
Characteristics Does not execute the query Shows parsing Lists stages, temp files, dependencies, modes, output operators, etc.
ABSTRACT SYNTAX TREE:(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME census))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))
STAGE DEPENDENCIES:Stage-0 is a root stage
STAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1
31
CONFIGURE HIVE
Configuration Hive default configuration <install-dir>/conf/hive-default.xml Configuration variables <install-dir>/conf/hive-site.xml Hive configuration directory HIVE_CONF_DIR environment variable Log4j configuration <install-dir>/conf/hive-log4j.properties Typical Log: c:\Hadoop\hive-0.9.0\logs\hive.log
32
WHY USE HIVE
BUZZ! Cross-pollinate your existing SQL skills! Makes Hadoop cross-correlations, joins, filters easier Allows storage of intermediate results for faster/easier querying Batch based processing Individual queries still often slower than a relational database E2E insight may be much faster
33
BI ON BIG DATA
Gain Insights Mash-up Hive + other data in Excel Hive data source to PowerPivot for in-memory analytics Power View on top of PowerPivot for spectacular visualizations leading to insights Securely share on SharePoint for collaboration, re-use, centralized data
Microsoft on top of Hadoop / Hive includes PowerPivot Power View Analysis Services PDW StreamInsight SQL Server SQL Azure Excel
BIG DEAL
HiveHadoopBig Data
Analytics to Insights
35
NEXT STEPS
Get Involved Read a bit
http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big-data-learning-resources.aspx Programming Hive Book http://blogs.msdn.com/cindygross
Sign up: Windows Azure HDInsight Service http://HadoopOnAzure.com (Cloud CTP) Download Microsoft HDInsight Server http://microsoft.com/bigdata (On-Prem CTP) Think about how you can fit Big Data into your company data strategy Suggest uses, be prepared to combat misuses
Microsoft Big Data http://microsoft.com/bigdataDenny Lee http://dennyglee.com/category/bigdata/ Carl Nolan http://tinyurl.com/6wbfxy9 Cindy Gross http://tinyurl.com/SmallBitesBigData
BIG DATA REFERENCES
Hadoop: The Definitive Guide by Tom WhiteSQL Server Sqoop http://bit.ly/rulsjX JavaScript http://bit.ly/wdaTv6Twitter https://twitter.com/#!/search/%23bigdata
Hive http://hive.apache.orgExcel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjjHadoop On Azure Videos http://tinyurl.com/6munnx2Klout http://tinyurl.com/6qu9php
MICROSOFT BIG DATA AT PASS SUMMIT
37
BIA-204-M MAD About Data: Solve Problems and Develop a “Data Driven Mindset”
Wednesday 1015am | Darwin Schweitzer
BIA-306-M How Klout Changed the Landscape of Social Media with Hadoop and BIThursday 130pm | Denny Lee, Dave Mariani
AD-316-M Harnessing Big Data with Hadoop Friday 8am | Mike Flasko
DBA-410-S Big Data Meets SQL Server Friday 945am | David DeWitt
AD-300-M Bootstrapping Data Warehousing in Azure for Use with Hadoop Thursday 1015am | Steve Howard, James Podgorski, Olivier Matrat, Rafael Fernandez
BIA-305-A SQLCAT: Big Data – All Abuzz About Hive Wednesday 1015am | Cindy Gross, Dipti Sangani, Ed Katibah
AD-315-M NoSQL and Big Data Programmability Friday 415p | Michael Rys
Manage
Enrich
Insight
Win prizes with new online evaluations
Build experience with Hands On Labs
NEW: TCC 304
Attend David DeWitt’s spotlight session Big Data Meets SQL Server
DBA-410-S, Room 6EFriday, 9:45 AM
Be SQL Server 2012 Certified with onsite testing
Room 212-214
Find hidden session announcements by following:
@sqlserver #sqlpass
Visit the SQL Clinic and new “I MADE THAT!” Developer Chalk talks
NEW: 4C-3 & 4C-4
Don’t Miss!
39
PASS Resources
Free SQL Server and BI training Free 1-day Training Events Regional Event
Local and Virtual User Groups Free Online Technical Training
Learning Center
This is Community
40November 6-9, Seattle, WA
Thank youfor attending this session and the 2012 PASS Summit in Seattle
November 6-9, Seattle, WA
SQLCAT: Big Data – All Abuzz About Hive
Dipti SanganiSQL Big Data PM
om
Cindy GrossSQLCAT BI/Big Data PM
Microsofthttp://blogs.msdn.com/cindygross
Ed KatibahSQLCAT Spatial PM
Microsofthttp://blogs.msdn.com/b/edkatibah/
Please fill out evaluations!