Using Data Virtualization to Integrate With Big Data

11
The Role of Data Virtualization in a World of Big Data June 6, 2012 Mark Madsen @markmadsen www.ThirdNature.net Information Management Through Human History New technology development (innovation) creates New methods to cope (maturation) creates New information scale and availability (saturation) creates… Copyright Third Nature, Inc.

description

Hadoop and big data don't sit as an island in organizations. To analyze event streams and similar data requires integrating with other data from systems in the organization. This isn't easy with big data systems today because there are disparities in the technoogies and environments when compared to traditional IT. Data virtualization is one way to smooth over the integration and allow Hadoop to access other data, or allow SQL-oriented tools to access Hadoop

Transcript of Using Data Virtualization to Integrate With Big Data

Page 1: Using Data Virtualization to Integrate With Big Data

The Role of Data Virtualization in a World of Big Data

June 6, 2012

Mark [email protected]

Information Management Through Human History

New technology development(innovation)

createsNew methods to cope

(maturation)

createsNew information scale and availability

(saturation)

creates…

Copyright Third Nature, Inc.

Page 2: Using Data Virtualization to Integrate With Big Data

Big Data

You keep using that word. I do not think it means what you think it means.

Page 3: Using Data Virtualization to Integrate With Big Data

What makes data “big”?

Hierarchical structures

Nested structures

Encoded values

Non‐standard (for a database) types

Deep structure

Very large amounts

Human authored text

“big” is better off being defined as “complex” or “hard to manage”

Copyright Third Nature, Inc.

Page 4: Using Data Virtualization to Integrate With Big Data

You could store this data in the data warehouse but…

Old database technology has so many problems

Page 5: Using Data Virtualization to Integrate With Big Data

“Big Data”

New technology has so many problems

Page 6: Using Data Virtualization to Integrate With Big Data

Reality is multiple data stores and platformsSeparate, purpose-built databases and processing systems for different types of data and query / computing workloads is the norm for information delivery. Data flows between most of these environments.

BI, Reporting, Dashboards

1 Marge   Inover a $150,000 St at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Ivan Awf ulit c h $160,000 Derm atol ogi st

4 Nadi a  Geddit $36,000 DBA

1 M arge  I nover a $150, 000 S t at is t ic ian

2 Ani ta  Bat h $120, 000 Sew er i nspec tor3 I v an  Awful it ch $160, 000 Der matol og i st

4 N adi a  Geddit $36, 000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 M arge  I nover a $150, 000 S t at is t ic ian

2 Ani ta  Bat h $120, 000 Sew er i nspec tor3 I v an  Awful it ch $160, 000 Der matol og i st

4 N adi a  Geddit $36, 000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

Data Warehouse

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

Example “big data”: Web tracking dataUSER_ID 301212631165031

SESSION_ID 590387153892659

VISIT_DATE 1/10/2010 0:00

SESSION_START_DATE 1:41:44 AM

PAGE_VIEW_DATE 1/10/2010 9:59

DESTINATION_URL

https://www.phisherking.com/gifts/store/LogonForm?mmc=link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐1&storeId=1055&URL=BECGiftListItemDisplay

REFERRAL_NAME Direct

REFERRAL_URL ‐

PAGE_ID PROD_24259_CARD

REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS

SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE

SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY

IP_ADDRESS 67.189.110.179

BROWSER_OS_NAMEMOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)

Page 7: Using Data Virtualization to Integrate With Big Data

Example “big data”: Web tracking dataUSER_ID 301212631165031

SESSION_ID 590387153892659

VISIT_DATE 1/10/2010 0:00

SESSION_START_DATE 1:41:44 AM

PAGE_VIEW_DATE 1/10/2010 9:59

DESTINATION_URL

https://www.phisherking.com/gifts/store/LogonForm?mmc=link‐src‐email‐_‐m100109‐_‐44IOJ1‐_‐shop&langId=‐1&storeId=1055&URL=BECGiftListItemDisplay

REFERRAL_NAME Direct

REFERRAL_URL ‐

PAGE_ID PROD_24259_CARD

REL_PRODUCTS PROD_24654_CARD, PROD_3648_FLOWERS

SITE_LOCATION_NAME VALENTINE'S DAY MICROSITE

SITE_LOCATION_ID SHOP‐BY‐HOLIDAY VALENTINES DAY

IP_ADDRESS 67.189.110.179

BROWSER_OS_NAMEMOZILLA/4.0 (COMPATIBLE; MSIE 7.0; AOL 9.0; WINDOWS NT 5.1; TRIDENT/4.0; GTB6; .NET CLR 1.1.4322)

The event stream contains IDs, but no reference data…

Reference data, aka dimensions, master data. This isn’t an OLTP DB, there is no reference data available from the source.

.

It would be logical to keep all the data in one place.

I need that data now.

The typical situation for analysts

It will take 6 months

Page 8: Using Data Virtualization to Integrate With Big Data

There are two architectural approaches to facilitating analysis, depending on where the analyst works in the environment:

1. Back end integration: For analysts working within the BD environment ‐ Reaching out from the environment to get other data that's needed to make sense of information.

2. Front end integration: For analysts working in a more conventional BI / analysis environment ‐reaching in to the BD environment from other tools.

Solution: copy the data into Hadoop?Just load it from the DW. If it’s there. Otherwise, dump and load the data from the sources.

Great for one-time analysis, but if you need to do it again next week, or if you need current values on a regular basis?

You can build custom extracts from each source. But…

• Poor tool support

• Problem of on-demand / current values

• Minimal data management possible in the Hadoop environment

• The analyst waits

OLTP SourcesData warehouse

Page 9: Using Data Virtualization to Integrate With Big Data

OLTP SourcesData warehouse

Alternative: data virtualization to enable accessA data virtualization layer can be used to make other sources (OLTP, the data warehouse) appear locally accessible to the analyst or Hadoop programmer. Then, two choices are possible:▪ extract the data and load it into the local environment

▪ access it dynamically from within the environment 

OLTP SourcesData warehouse

Alternative: data virtualization to bridge storesA data virtualization layer can be used to bridge the database and big data environments, hiding the back end complexities.

Allows one to access raw or processed data from Hadoop alongside data from other environments with some benefits: no limited Hive connectors, no client‐side data merging, no difficult metadata layer integrations.

Page 10: Using Data Virtualization to Integrate With Big Data

Data virtualization can simplify access across the entire data environment, “big” or not

DV also enables shared metadata across environments, avoiding the costs of model integration and burying it in source code.

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 M arge  I nover a $150, 000 S t at is t ic ian

2 Ani ta  Bat h $120, 000 Sew er i nspec tor3 I v an  Awful it ch $160, 000 Der matol og i st

4 N adi a  Geddit $36, 000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 M arge  I nover a $150, 000 S t at is t ic ian

2 Ani ta  Bat h $120, 000 Sew er i nspec tor3 I v an  Awful it ch $160, 000 Der matol og i st

4 N adi a  Geddit $36, 000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

1 Marge   Inover a $150,000 S t at is t ic ian

2 Anit a  Bat h $120,000 Sewer  i ns pec tor3 Iv an Awf ulit c h $160,000 Derm atol og i st

4 Nadi a  Geddit $36,000 DBA

Data Warehouse

BI, Reporting, Dashboards

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

Data virtualization layer (front end)

DV  layer (back end)

Bridge the data environment to uses beyond BI

The use cases are now interactive applications, lower latency data, complex analytics and extend beyond read‐only queries.

Page 11: Using Data Virtualization to Integrate With Big Data

About the PresenterMark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, analytics and information management. Mark is an award-winning author, architect and former CTO whose work has been featured in numerous industry publications. During his career Mark received awards from the American Productivity & Quality Center, TDWI, Computerworld and the Smithsonian Institute. He is an international speaker, contributing editor at Intelligent Enterprise, and manages the open source channel at the Business Intelligence Network. For more information or to contact Mark, visit http://ThirdNature.net.

About Third Nature

Third Nature is a research and consulting firm focused on new and emerging technology and practices in business intelligence, analytics and performance management. If your question is related to BI, analytics, information strategy and data then you‘re at the right place.

Our goal is to help companies take advantage of information-driven management practices and applications. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.