Hadoop and the Data Warehouse: Point/Counter Point

45
Grab some coffee and enjoy the pre-show banter before the top of the hour!

description

Robin Bloor and Teradata Live Webcast on April 22, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6 Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment? Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop. Visit InsideAnlaysis.com for more information.

Transcript of Hadoop and the Data Warehouse: Point/Counter Point

Page 1: Hadoop and the Data Warehouse: Point/Counter Point

Grab some coffee and enjoy the pre-show banter before the top of the hour!

Page 2: Hadoop and the Data Warehouse: Point/Counter Point

The Briefing Room

Hadoop and the Data Warehouse: Point/Counter Point

Page 3: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Page 4: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Page 5: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Topics

This Month: BIG DATA

May: DATABASE

June: ANALYTICS & MACHINE LEARNING

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

Page 6: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Page 7: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected] @robinbloor

Page 8: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Teradata

!   Teradata is known for its analytics data solutions with a focus on integrated data warehousing, big data analytics and business applications

!   It offers a broad suite of technology platforms and solutions and a wide range of data management applications

!   Teradata’s SQL-H allows users and applications to join Hadoop data to the Teradata Data Warehouse and the Aster Discovery Platform

Page 9: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Guest: Dan Graham

Dan Graham is the Technical Marketing Director for Teradata. With over 30 years in IT, Dan joined Teradata Corporation in 1989 where he was the senior product manager for the DBC/1012 parallel database computer. He then joined IBM where he wrote product plans and launched the RS/6000 SP parallel server. He then became Strategy Executive for IBM’s Global Business Intelligence Solutions. As Enterprise Systems General Manager at Teradata, Dan was responsible for strategy, go-to-market success, and competitive differentiation for the Active Enterprise Data Warehouse platform. He currently leads Teradata’s technical marketing activities.

Page 10: Hadoop and the Data Warehouse: Point/Counter Point

Point, Counterpoint Myths and Magic

HADOOP AND THE DATA WAREHOUSE

Page 11: Hadoop and the Data Warehouse: Point/Counter Point

11 Copyright Teradata

• Words and modern terminology

•  Is Hadoop a data integration product?

•  Is Hadoop a data warehouse?

• Hadoop – the magic

• Teradata and Hadoop

AGENDA

http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/

Page 12: Hadoop and the Data Warehouse: Point/Counter Point

12 Copyright Teradata

Our Host, the Word Smith

@RobinBloor

Page 13: Hadoop and the Data Warehouse: Point/Counter Point

13 Copyright Teradata

Page 14: Hadoop and the Data Warehouse: Point/Counter Point

14 Copyright Teradata

• At least most of the time

• Queries are significantly faster but not always instantaneous > Simple selects à A couple of seconds > Join queries à 10s of seconds

Now we have something that can provide us “Real-time” in Hadoop

Source: Slideshare, Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup, June 2013

Page 15: Hadoop and the Data Warehouse: Point/Counter Point

15 Copyright Teradata

Term Hadoop meaning BI/DW meaning

Real time query

•  Self- service interactive queries that run in under minutes, preferably < 10s of seconds

•  Query responses in milliseconds •  +Advanced query prioritization

SQL •  Subset of ANSI 92 SQL •  Primary data types •  UDFs

•  ANSI 2008 SQL + •  Some/all ANSI SQL 2011 •  All SQL data types •  Integrity constraints, window

functions, UDFs, triggers, XML •  ACID transactions (start

transaction, commit, rollback) •  Geospatial, temporal

OLAP •  Any query < 10 seconds

•  Subsecond multi-dimensional aggregate queries

•  Roll-up, drill-down hierarchies •  MOLAP and ROLAP

Hadoop Translator

See: Wikipedia

Page 16: Hadoop and the Data Warehouse: Point/Counter Point

16 Copyright Teradata

q  Real-Time Query: Real-Time is really business time. It is almost always performance critical (otherwise why would you engineer for it?).

q  SQL sophistication depends on what you want to use it for. SQL-92 is rather primitive. There are consequences – performance consequences.

q  The appropriateness of Hadoop Interactive (OLAP) capability is user dependent. But why would you use Hadoop for this?

Shoop, Shoop Hadoop!

Page 17: Hadoop and the Data Warehouse: Point/Counter Point

17 Copyright Teradata

Current HDFS Availability & Data Integrity

•  Simple design, storage fault tolerance >  Storage: Rely in OS’s file system rather than use raw disk >  Storage Fault Tolerance: multiple replicas, active monitoring >  Single NameNode Master – Persistent state: multiple copies + checkpoints – Restart on failure

•  How well did it work? >  Lost 19 out of 329 Million blocks on 10 clusters with 20K

nodes in 2009 –  7-9’s of reliability – Fixed in 20 and 21.

>  18 months Study: 22 failures on 25 clusters - 0.58 failures per year per cluster – Only 8 would have benefitted from HA failover!! (0.23 failures

per cluster year) > NN is very robust and can take a lot of abuse – NN is resilient against overload caused by misbehaving apps

Source: Slideshare, NameNode HA, 2011

Page 18: Hadoop and the Data Warehouse: Point/Counter Point

18 Copyright Teradata

Term Hadoop meaning

BI/DW meaning

High Availability

•  Data replication •  Name node fail

over

•  Redundant access paths (network, nodes, disks)

•  RAID storage, high quality hardware •  Minimized planned downtime •  No single point of failure •  HA administration tools, event alerts

tracking and auto recovery •  Backups

Fault tolerant

Query automatically restarts on another node without resubmission using replicated data

•  Nonstop system (no unplanned system halt or reboot)

•  Extreme hardware reliability •  99.999% uptime •  Fault isolation and containment •  Graceful degradation •  Rolling upgrades

Hadoop Translator

Page 19: Hadoop and the Data Warehouse: Point/Counter Point

19 Copyright Teradata

Hadoop Falling Over!

q  Hadoop was built for the recovery of large batch on large commodity grids. q The goal was not to lose the work q This is really about disk failure

q  HA/FT is always configured according to workload characteristics. Enterprise HA is best thought of as “transactional” and OLTP, at the least, if not a real-time event.

Page 20: Hadoop and the Data Warehouse: Point/Counter Point

20 Copyright Teradata

Is Hadoop a Data Integration Platform?

•  Yes >  “Lots of customers doing

ETL in Hadoop” > Data refineries > Unstructured data – Weblogs and sensor data

> Data Hub/Data Lake

• No > No built-ins – Data quality tools –  Transformations

> All do-it-yourself code > No ETL process

management > No metadata repository

Page 21: Hadoop and the Data Warehouse: Point/Counter Point

21 Copyright Teradata

•  Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).

•  Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer's code, one data element at a time, or one program at a time.

•  Because Hadoop work streams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.

Hadoop Is Not a Data Integration Solution

29 January 2013

Page 22: Hadoop and the Data Warehouse: Point/Counter Point

22 Copyright Teradata

•  Facebook, Twitter, LinkedIn • Sensor data • XML • Web logs •  JSON •  eMail • Documents •  Images

• Not so much >  Audio >  Video

Unstructured Data in the Data Warehouse

20%

15%

10%

5%

0% Social Docs eMail JPGs A/V Sensor XML Web

logs

2013

Sources: Derived from TDWI, Wikibon, Gartner, IDC

Page 23: Hadoop and the Data Warehouse: Point/Counter Point

23 Copyright Teradata

Hadoop For Data Integration!

q  Hadoop serves a useful function as a data reservoir. q The revenge of the ISAM file q Some ETL q Some cleansing q Some analytics

q  Personally, I would want drag and drop ETL, ELT. Those who write code maintain code.

Page 24: Hadoop and the Data Warehouse: Point/Counter Point

24 Copyright Teradata

Is Hadoop a Data Warehouse?

Page 25: Hadoop and the Data Warehouse: Point/Counter Point

25 Copyright Teradata

At Facebook, we have unique storage scalability challenges when it comes to our data warehouse. Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB. In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure. There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS. The most widely used system at Facebook for large data transformations on raw logs is Hive, a query engine based on Corona Map Reduce used for processing and creating large tables in our data warehouse. In this post, we will focus primarily on how we evolved the Hive storage format to compress raw data as efficiently as possible into the on-disk data format.

Scaling the Facebook Data Warehouse to 300 PB

April 10, 2014 https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/

Page 26: Hadoop and the Data Warehouse: Point/Counter Point

26 Copyright Teradata

• A data design pattern, an architecture >  Size doesn’t matter >  A perpetual evolution

• Definition: Gartner (2005) /Inmon (1992) /Wikipedia >  Subject oriented – Detailed data + modeling of sales, inventory, finance, etc.

>  Integrated logical model – Merged data – Consistent, standardized data formats and values

> Nonvolatile – Data stored unmodified for long periods of time

>  Time variant – Record versioning or temporal services

> Persistent storage, not virtual, not federated

What is a Data Warehouse?

Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; Bill Inmon, Building the Data Warehouse, 1992, Wiley and Sons

Page 27: Hadoop and the Data Warehouse: Point/Counter Point

27 Copyright Teradata

Subject Areas: A Model of ‘Our’ Business

Price history

Inventory Supplier

Contracts

Product/Services

Channels

E-Commerce

Labor

Associate

Customer

Sales transactions

Point of Sale

Shipment Carrier

Campaigns

Promotion

Warehouse

Each subject area has numerous large FACT tables (=big joins)

Page 28: Hadoop and the Data Warehouse: Point/Counter Point

28 Copyright Teradata

You Wish You Had Redundant Data!

App Cust_ID First Last DOB Social Address

ERP 30391-244 William Franks 04/12/00 563-49-1234 123 Oak, Atlanta

CRM 30391244 W. Franks 04/12/70 563491234

SCM 30391244 Bill Franks 04/12/70 Atlanta

XYZ 30391-244 Frank Williams 563491234 123 Oak St. #14

Cust_ID First Last DOB Social Address

30391244 William Franks 04/12/70 563491234 123 Oak St. #14

Final integrated record

Match keys

ETL

Page 29: Hadoop and the Data Warehouse: Point/Counter Point

29 Copyright Teradata

• A targeted project that will be finished >  A subset of data, not all the data > Not for all of the people

• Often heavily denormalized

• Volatility > Often completely reloaded

•  Time variance and currency >  Can restate the data “as of” a point in time

• Virtualization option >  Can be a logical set of views, cubes

What is a Data Mart?

Source: Gartner: Of Data Warehouses, Operational Data Stores, Data Marts and Data 'Outhouses‘; Inmon, Building the Data Warehouse, 1992, Wiley and Sons

Page 30: Hadoop and the Data Warehouse: Point/Counter Point

30 Copyright Teradata

Why Hadoop Is Not a Data Warehouse

Page 31: Hadoop and the Data Warehouse: Point/Counter Point

31 Copyright Teradata

Words Matter!

q  The meaning of data warehouse is changing: q JSON (hierarchical capability) q Network queries (possibly offload) q Analytics

q  The meaning of data warehouse is extending. But it still includes “optimization.”

q  It’s no longer a data staging area, it’s a reservoir.

Page 32: Hadoop and the Data Warehouse: Point/Counter Point

Where Hadoop Excels

HADOOP MAGIC

Page 33: Hadoop and the Data Warehouse: Point/Counter Point

33 Copyright Teradata

• Apache Hadoop framework > Hadoop Common – Libraries and utilities

> Hadoop Distributed File System (HDFS) > Hadoop YARN > Hadoop MapReduce

•  Looks like Hadoop too >  Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig,

ZooKeeper, Oozie, Sqoop, Tez

What is Hadoop?

•  SQL on Hadoop o  Presto, Drill, Shark, Impala, Hawq, JAQL

•  New Hadoop modules or replacements o  GPFS, Mesos, Spark, Storm, Accumulo, Sentry, Falcon,

Knox, Whirr , Sentry, Tachyon, SOLR, Lucene

Page 34: Hadoop and the Data Warehouse: Point/Counter Point

34 Copyright Teradata

Hadoop 2.0

Applica'ons  Run  Na'vely  IN  Hadoop  

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

STREAMING  (Storm)  

GRAPH  (Giraph)  

IN-­‐MEMORY  (Spark)  

HPC  MPI  (OpenMPI)  

ONLINE  (HBase)  

OTHER  (ex.  Search)  

Hadoop 2.0

Page 35: Hadoop and the Data Warehouse: Point/Counter Point

35 Copyright Teradata

• Rapid ingest >  File copy vs database load

•  Temporary data

• Data not ready for the data warehouse

• Data that never à data warehouse

• Archives >  Alternative to magnetic tape

Data Lake Benefits: The Landing Zone

Page 36: Hadoop and the Data Warehouse: Point/Counter Point

36 Copyright Teradata

• Ad hoc projects > One-shot complex analytics > Hurry up, short term efforts

• Alternative analytics > Not SQL-friendly algorithms > Markov chains, random forest >  JPG, audio analysis

• Sandbox – hunting in the dark >  Prototyping > Data exploration >  Trial and error new algorithms

Hadoop Enables Another Data Platform

Page 37: Hadoop and the Data Warehouse: Point/Counter Point

37 Copyright Teradata

What Hadoop Is For!

q  Data reservoir

q  Prototyping

q  Analytical or BI sandboxing (data wrangling)

q  Archive

q  File system API (HDFS)

Page 38: Hadoop and the Data Warehouse: Point/Counter Point

Math and Stats

Data Mining

Business Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

INTEGRATED DISCOVERY PLATFORM

INTEGRATED DATA WAREHOUSE

ERP

SCM

CRM

Images

Audio and Video

Machine Logs

Text

Web and Social

SOURCES

DATA PLATFORM

ACCESS MANAGE MOVE

TERADATA UNIFIED DATA ARCHITECTURE System Conceptual View

Marketing Executives

Operational Systems

Frontline Workers

Customers Partners

Engineers

Data Scientists

Business Analysts

TERADATA DATABASE

HORTONWORKS

TERADATA DATABASE

TERADATA ASTER DATABASE

Page 39: Hadoop and the Data Warehouse: Point/Counter Point

39 Copyright Teradata

•  Joint R&D with Hortonworks > Donated to Apache

• Business user query with favorite BI tools

•  Join Hadoop data to >  Teradata Data Warehouse >  Aster Discovery Platform

•  Teradata 15.0 >  Bi-directional SQL >  Push down filters to Hive

•  Fast, secure, reliable

Teradata SQL-H Teradata SQL-H Aster SQL-H

Hadoop Layer: HDFS

Pig

Hive

Hadoop MR

HCatalog

Dat

a

Dat

a Fi

lter

ing

Page 40: Hadoop and the Data Warehouse: Point/Counter Point

40 Copyright Teradata

Teradata 15: Teradata QueryGrid™

SQL, SQL-MR, SQL-GR

TERADATA ASTER

DATABASE

Teradata Systems

TERADATA DATABASE

OTHER DATABASES

Remote Data

LANGUAGES

SAS, Perl, Python, R, Ruby, etc,

HADOOP

Push-down to Hadoop

IDW Discovery

TERADATA DATABASE

TERADATA ASTER

DATABASE

Business users Data Scientists

Page 41: Hadoop and the Data Warehouse: Point/Counter Point

41 Copyright Teradata

Market Possibilities

q  The scale-out file system will not die (because it’s only an API)

q  YARN (& Cascading) will prosper

q  Hadoop will play a role in data flow

q  It will never replace the EDW, except by deception

q  The struggle for a unified architecture will continue

Page 42: Hadoop and the Data Warehouse: Point/Counter Point

42 Copyright Teradata

Hadoop and the Data Warehouse: Competitive or Complementary?

http://www.teradata.com/analyst-reports/Hadoop-and-the-Data-Warehouse-Competitive-or-Complementary/

Page 43: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Page 44: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

Upcoming Topics

www.insideanalysis.com

2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room

This Month: BIG DATA

May: DATABASE

June: ANALYTICS & MACHINE LEARNING

Page 45: Hadoop and the Data Warehouse: Point/Counter Point

Twitter Tag: #briefr

The Briefing Room

THANK YOU for your

ATTENTION!