SQL on Hadoop CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Big Data & SQL: The On-Ramp to Hadoop
-
Upload
inside-analysis -
Category
Technology
-
view
104 -
download
2
description
Transcript of Big Data & SQL: The On-Ramp to Hadoop
Grab some coffee and enjoy the pre-show banter before the top of the hour!
The Briefing Room
Big Data & SQL: The On-Ramp to Hadoop
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
This Month: BIG DATA
May: DATABASE
June: ANALYTICS & MACHINE LEARNING
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Big Data
Twitter Tag: #briefr
The Briefing Room
Analyst: Robin Bloor
Robin Bloor is Chief Analyst at The Bloor Group
[email protected] @robinbloor
Twitter Tag: #briefr
The Briefing Room
HP Vertica
! Vertica was founded in 2005 by Michael Stonebreaker and Andrew Palmer; it was acquired by HP in 2011
! HP Vertica Analytics Platform is a grid-based, column-oriented database management system
! The latest release, Version 7, offers new platform components that allow for Hadoop exploration and analysis using SQL
Twitter Tag: #briefr
The Briefing Room
Guests Eamon O Neill, Manager, Product Management, HP Vertica Eamon leads the product management efforts for the HP Vertica Analytics Platform. He has more than 15 years of high-tech product management experience and deep knowledge of mobile applications, software defined networking and storage, database marketing, and distributed systems. Eamon had a founding role in the creation of the cloud services platform at BladeLogic (now BMC Software). In addition to BMC Software, Eamon held product management, software engineering, and business consulting roles at Hitachi Data Systems, Unica (now IBM), and Cambridge Technology Partners.
Jeff Healey, Director of Product Marketing, HP Vertica Jeff leads the product marketing and customer marketing efforts for the HP Vertica Analytics Platform. Jeff has more than 15 years of high-tech marketing experience and deep knowledge in messaging, positioning, and content development. Jeff previously led product marketing initiatives for Axeda Corporation, an M2M platform for sensor data and the Internet of Things. Prior to Axeda, Jeff held product marketing, customer marketing, and lead editorial roles at The MathWorks, Macromedia (now Adobe), Sybase (now SAP), and The Boston Globe.
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
With the HP Vertica Analytics Platform Jeff Healey, Director of Product Marketing, HP Vertica Eamon O’Neill, Director of Product Management, HP Vertica
Big Data & SQL: The On-Ramp to Hadoop
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11
695,000 status updates
98,000+ tweets
698,445 Google searches
1,820TB of data created
11million instant messages
168 million+ emails sent
217 new mobile web users
Growing Internet of Things (IoT)
Pervasive Connectivity
Explosion of Information
Smart Device Expansion
Every 60 seconds
2013 By 2020
40 Trillion GB(2)
10 Million(3)
… for 8 Billion(4)
(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org
A New Era of Accelerated Innovation Forever changing how consumers and businesses interact, enabling new opportunities
30 Billion(1)
DATA
Mobile Apps
Devices
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12
The Time is Now
Data Volumes
Acc
urac
y an
d In
sigh
t
CRM ERP Data Warehouse Web Social Log Files Machine Data Semi-structured
Dark Data
Big Data Traditional Enterprise Data
Unstructured
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13
13
HP Vertica Analytics Platform
MPP Shared Nothing
High Availability & Redundancy
JSO
N, C
EF,
Del
imite
d Database D
esigner
HP ConvergedSystem 300 for Vertica
Vertica Flex Zone Vertica Enterprise
Column Store Optimizer & Execution Engine
Managem
ent Console
HD
FS, H
cata
log,
Flu
me,
File
s
API & SDK (supports R, C++, Java)
Time Series
Analytics Functions
Distributed R
SQL
ODBC JDBC
Search Functionality
Geospatial &
Sentiment
Key Value API H
P BS
M &
Security
Community BI Ecosystem 3rd party apps Marketplace
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14
The Richest, Most Open SQL on Hadoop
Challenge: Extracting data from Hadoop requires complex and brittle ETL processes
Solution: Hadoop Navigation and Analytics Benefits: • Navigate Hadoop data using its native catalog • Quickly and easily load native data types from Hadoop to Vertica • Avoid creating and maintaining time-consuming schemas • Use the full power of HP Vertica SQL and analytics • Choose your own Hadoop distribution
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15
The Richest, Most Open SQL on Hadoop
Challenge: Extracting Data from Hadoop requires complex and brittle ETL processes
Solution: Hadoop Navigation and Analytics Benefits: • Navigate Hadoop data using its native catalog • Quickly and easily load native data types from Hadoop to Vertica • Avoid creating and maintaining time-consuming schemas • Use the full power of HP Vertica SQL and Analytics • Choose your own Hadoop distribution
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16
HP Vertica and MapR Solution
Optimized, interactive SQL-on-Hadoop solution for fastest value from big data • Complete SQL-on-Hadoop Solution
• Broader Analytics Capabilities
• Lower TCO & Manageability
• Enterprise-Grade Reliability
HP Vertica Analytics Platform on MapR
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17
Most Complete SQL on Hadoop
• Full interactive ANSI SQL on Hadoop
• More complete SQL maturity
• Clients can leverage existing SQL skills • Handling complex joins and advanced analytic
functions, query optimization, and many concurrent users
• Certified integration with BI/visualization environments
• Dynamic handling of mixed workloads
• Supports a limited subset of HiveQL1 – 1HiveQL is SQL-like dialect - subset of ANSI SQL
• HiveQL is not as mature as SQL
• Requires new skills
• Immature query optimization for planning efficient joins and for processing
• Onus on customer to integrate with BI/
visualization environments
• Lack of workload management for high number of concurrent users
HP Vertica on MapR Limited “Query on Hadoop” Options
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18
• Analyzing gene variants using SNPs and Microarray data
The problem:
• Hadoop to find the variants between a sample sequence and a reference genome
• HP Vertica to determine oncology targets
• Tools: Pipeline Pilot, Spotfire, R
The solution: • Queries went from 5
hours to 5 minutes • Scale to 100s of TB of
data • More experiments =>
faster discoveries!
The value:
Accelerating Drug Discovery
Innovative Healthcare Products Company
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19
HP Vertica Flex Zone Avoid creating and maintaining time-consuming schemas
on semi-structured data Faster SQL querying
semi-structured data loading Auto-schematization
for JSON and delimited data Flexible parsers
for blazing-fast performance
One-step schema
Load, manage, and explore semi-structured data
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20
Exploring the Value of Dark Data
Challenges • Takes 900 hours per year to
ingest semi-structured data for analysis
• As requirements change, must again “re-structure” the data for exploration
• Must meet ever-increasing requests for analytic insight in short timeframes
Leading online source for health and medical news and information
HP Vertica Flex Zone Solution • Slash development time by
eliminating schema creation
• Explore data with existing BI/visualization tools for maximum insight
• Operationalize data in one single step for fast analytics
• Focus team on data analysis (not wrestling with data formats)
© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you!
Jeff Healey [email protected] 617-386-4591
Eamon O’Neill [email protected] 617-386 4604
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Robin Bloor
The Data Reservoir
Robin Bloor, Ph.D.
Hadoop as the Data Reservoir
Big Data and the Data Reservoir
The Workload Paradigm Shift
u Previously, we viewed database workloads as an i/o optimization problem
u With analytics the workload is a very variable mix of i/o and calculation
u No databases were built precisely for this – not even Big Data databases
A Process, Not an Activity
u Data analytics is a multi-disciplinary end-to-end process
u Until recently it was a walled-garden, but the walls were torn down by • Data availability • Scalable technology • Open source tools
u Hadoop has a role here
The Hadoop Ecosystem
u Even though it may not seem so, Hadoop is in its infancy
u Hadoop’s popularity guarantees its future
u Its future is also guaranteed by its commercial ecosystem
u What do you see as the fundamental division of workload between Hadoop, Flex Zone and Vertica?
u Which specific components of the Hadoop ecosystem do you recommend using?
u Do you support JSON? If so, for which contexts and in what way?
u Is there any special optimization in Vertica between query and analytical workloads.
u Please describe the discovery and definition of metadata from Hadoop, through Flex Zone and into Vertica
u Why do you think Hadoop is important from a technical perspective?
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: BIG DATA
May: DATABASE
June: ANALYTICS & MACHINE LEARNING
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!