Big Data & SQL: The On-Ramp to Hadoop

33
Grab some coffee and enjoy the pre-show banter before the top of the hour!

description

The Briefing Room with Dr. Robin Bloor and HP Vertica Live Webcast on April 15, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=ad9b301880f27007e836560cf3cd8904 Hadoop has emerged as a chief solution for big data challenges, and businesses are eager to capture the potential value from the vast pools of newly available information assets. But when the data lake is comprised of semi-structured data – click stream logs, sensor data, text files – it can make access and performance a bit more difficult to achieve. One way to clear the hurdle is by using an analytic platform, specifically one that has been designed to enable exploration of Hadoop data using standard SQL queries. Register for this episode of The Briefing Room to hear from veteran Analyst Robin Bloor as he explains the role big data plays in enterprise analytics. He’ll be briefed by Jeff Healey and Eamon O'Neill of HP Vertica, who will tout their company’s Flex Zone, a new component of its Analytics Platform. They will discuss how Flex Zone empowers data scientists and business analysts by tapping into Hadoop via SQL, providing a one stop shop for real-time analytics on massive volumes of data. Visit InsideAnlaysis.com for more information.

Transcript of Big Data & SQL: The On-Ramp to Hadoop

Page 1: Big Data & SQL: The On-Ramp to Hadoop

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Page 2: Big Data & SQL: The On-Ramp to Hadoop

The Briefing Room

Big Data & SQL: The On-Ramp to Hadoop

Page 3: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Page 4: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Page 5: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Topics

This Month: BIG DATA

May: DATABASE

June: ANALYTICS & MACHINE LEARNING

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

Page 6: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Big Data

Page 7: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor

Page 8: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

HP Vertica

! Vertica was founded in 2005 by Michael Stonebreaker and Andrew Palmer; it was acquired by HP in 2011

!   HP Vertica Analytics Platform is a grid-based, column-oriented database management system

!   The latest release, Version 7, offers new platform components that allow for Hadoop exploration and analysis using SQL

Page 9: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Guests Eamon O Neill, Manager, Product Management, HP Vertica Eamon leads the product management efforts for the HP Vertica Analytics Platform. He has more than 15 years of high-tech product management experience and deep knowledge of mobile applications, software defined networking and storage, database marketing, and distributed systems. Eamon had a founding role in the creation of the cloud services platform at BladeLogic (now BMC Software). In addition to BMC Software, Eamon held product management, software engineering, and business consulting roles at Hitachi Data Systems, Unica (now IBM), and Cambridge Technology Partners.

Jeff Healey, Director of Product Marketing, HP Vertica Jeff leads the product marketing and customer marketing efforts for the HP Vertica Analytics Platform. Jeff has more than 15 years of high-tech marketing experience and deep knowledge in messaging, positioning, and content development. Jeff previously led product marketing initiatives for Axeda Corporation, an M2M platform for sensor data and the Internet of Things. Prior to Axeda, Jeff held product marketing, customer marketing, and lead editorial roles at The MathWorks, Macromedia (now Adobe), Sybase (now SAP), and The Boston Globe.

Page 10: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

With the HP Vertica Analytics Platform Jeff Healey, Director of Product Marketing, HP Vertica Eamon O’Neill, Director of Product Management, HP Vertica

Big Data & SQL: The On-Ramp to Hadoop

Page 11: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11

695,000 status updates

98,000+ tweets

698,445 Google searches

1,820TB of data created

11million instant messages

168 million+ emails sent

217 new mobile web users

Growing Internet of Things (IoT)

Pervasive Connectivity

Explosion of Information

Smart Device Expansion

Every 60 seconds

2013 By 2020

40 Trillion GB(2)

10 Million(3)

… for 8 Billion(4)

(1) IDC Directions 2013: Why the Datacenter of the Future Will Leverage a Converged Infrastructure, March 2013, Matt Eastwood ; (2) & (3) IDC Predictions 2012: Competing for 2020, Document 231720, December 2011, Frank Gens; (4) http://en.wikipedia.org

A New Era of Accelerated Innovation Forever changing how consumers and businesses interact, enabling new opportunities

30 Billion(1)

DATA

Mobile Apps

Devices

Page 12: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12

The Time is Now

Data Volumes

Acc

urac

y an

d In

sigh

t

CRM ERP Data Warehouse Web Social Log Files Machine Data Semi-structured

Dark Data

Big Data Traditional Enterprise Data

Unstructured

Page 13: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13

13

HP Vertica Analytics Platform

MPP Shared Nothing

High Availability & Redundancy

JSO

N, C

EF,

Del

imite

d Database D

esigner

HP ConvergedSystem 300 for Vertica

Vertica Flex Zone Vertica Enterprise

Column Store Optimizer & Execution Engine

Managem

ent Console

HD

FS, H

cata

log,

Flu

me,

File

s

API & SDK (supports R, C++, Java)

Time Series

Analytics Functions

Distributed R

SQL

ODBC JDBC

Search Functionality

Geospatial &

Sentiment

Key Value API H

P BS

M &

Security

Community BI Ecosystem 3rd party apps Marketplace

Page 14: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14

The Richest, Most Open SQL on Hadoop

Challenge: Extracting data from Hadoop requires complex and brittle ETL processes

Solution: Hadoop Navigation and Analytics Benefits: •  Navigate Hadoop data using its native catalog •  Quickly and easily load native data types from Hadoop to Vertica •  Avoid creating and maintaining time-consuming schemas •  Use the full power of HP Vertica SQL and analytics •  Choose your own Hadoop distribution

Page 15: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15

The Richest, Most Open SQL on Hadoop

Challenge: Extracting Data from Hadoop requires complex and brittle ETL processes

Solution: Hadoop Navigation and Analytics Benefits: •  Navigate Hadoop data using its native catalog •  Quickly and easily load native data types from Hadoop to Vertica •  Avoid creating and maintaining time-consuming schemas •  Use the full power of HP Vertica SQL and Analytics •  Choose your own Hadoop distribution

Page 16: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16

HP Vertica and MapR Solution

Optimized, interactive SQL-on-Hadoop solution for fastest value from big data •  Complete SQL-on-Hadoop Solution

•  Broader Analytics Capabilities

•  Lower TCO & Manageability

•  Enterprise-Grade Reliability

HP Vertica Analytics Platform on MapR

Page 17: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17

Most Complete SQL on Hadoop

•  Full interactive ANSI SQL on Hadoop

•  More complete SQL maturity

•  Clients can leverage existing SQL skills •  Handling complex joins and advanced analytic

functions, query optimization, and many concurrent users

•  Certified integration with BI/visualization environments

•  Dynamic handling of mixed workloads

•  Supports a limited subset of HiveQL1 –  1HiveQL is SQL-like dialect - subset of ANSI SQL

•  HiveQL is not as mature as SQL

•  Requires new skills

•  Immature query optimization for planning efficient joins and for processing

•  Onus on customer to integrate with BI/

visualization environments

•  Lack of workload management for high number of concurrent users

HP Vertica on MapR Limited “Query on Hadoop” Options

Page 18: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18

• Analyzing gene variants using SNPs and Microarray data

The problem:

• Hadoop to find the variants between a sample sequence and a reference genome

• HP Vertica to determine oncology targets

• Tools: Pipeline Pilot, Spotfire, R

The solution: • Queries went from 5

hours to 5 minutes • Scale to 100s of TB of

data • More experiments =>

faster discoveries!

The value:

Accelerating Drug Discovery

Innovative Healthcare Products Company

Page 19: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19

HP Vertica Flex Zone Avoid creating and maintaining time-consuming schemas

on semi-structured data Faster SQL querying

semi-structured data loading Auto-schematization

for JSON and delimited data Flexible parsers

for blazing-fast performance

One-step schema

Load, manage, and explore semi-structured data

Page 20: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20

Exploring the Value of Dark Data

Challenges •  Takes 900 hours per year to

ingest semi-structured data for analysis

•  As requirements change, must again “re-structure” the data for exploration

•  Must meet ever-increasing requests for analytic insight in short timeframes

Leading online source for health and medical news and information

HP Vertica Flex Zone Solution •  Slash development time by

eliminating schema creation

•  Explore data with existing BI/visualization tools for maximum insight

•  Operationalize data in one single step for fast analytics

•  Focus team on data analysis (not wrestling with data formats)

Page 21: Big Data & SQL: The On-Ramp to Hadoop

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Thank you!

Jeff Healey [email protected] 617-386-4591

Eamon O’Neill [email protected] 617-386 4604

Page 22: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

Page 23: Big Data & SQL: The On-Ramp to Hadoop

The Data Reservoir

Robin Bloor, Ph.D.

Page 24: Big Data & SQL: The On-Ramp to Hadoop

Hadoop as the Data Reservoir

Page 25: Big Data & SQL: The On-Ramp to Hadoop

Big Data and the Data Reservoir

Page 26: Big Data & SQL: The On-Ramp to Hadoop

The Workload Paradigm Shift

u  Previously, we viewed database workloads as an i/o optimization problem

u With analytics the workload is a very variable mix of i/o and calculation

u No databases were built precisely for this – not even Big Data databases

Page 27: Big Data & SQL: The On-Ramp to Hadoop

A Process, Not an Activity

u  Data analytics is a multi-disciplinary end-to-end process

u  Until recently it was a walled-garden, but the walls were torn down by • Data availability •  Scalable technology • Open source tools

u  Hadoop has a role here

Page 28: Big Data & SQL: The On-Ramp to Hadoop

The Hadoop Ecosystem

u  Even though it may not seem so, Hadoop is in its infancy

u  Hadoop’s popularity guarantees its future

u  Its future is also guaranteed by its commercial ecosystem

Page 29: Big Data & SQL: The On-Ramp to Hadoop

u  What do you see as the fundamental division of workload between Hadoop, Flex Zone and Vertica?

u  Which specific components of the Hadoop ecosystem do you recommend using?

u  Do you support JSON? If so, for which contexts and in what way?

Page 30: Big Data & SQL: The On-Ramp to Hadoop

u  Is there any special optimization in Vertica between query and analytical workloads.

u  Please describe the discovery and definition of metadata from Hadoop, through Flex Zone and into Vertica

u  Why do you think Hadoop is important from a technical perspective?

Page 31: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Page 32: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: BIG DATA

May: DATABASE

June: ANALYTICS & MACHINE LEARNING

Page 33: Big Data & SQL: The On-Ramp to Hadoop

Twitter Tag: #briefr

The Briefing Room

THANK YOU for your

ATTENTION!