MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop...

29
Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved. MicroStrategy Hadoop Gateway Comparing this native gateway with other big data connectors Benjamin Reyes, Product Management, Data

Transcript of MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop...

Page 1: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

MicroStrategy Hadoop GatewayComparing this native gateway with other big data connectorsBenjamin Reyes, Product Management, Data

Page 2: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

MicroStrategy Hadoop Gateway

2

Agenda

Comparing this native gateway with other big data connectors

• What is the MicroStrategy Hadoop Gateway?

• Benefits of using a native connector vs. other types of connectors

• The MicroStrategy Hadoop Gateway architecture

• In-memory vs. Live Connect datasets

• Filtering, aggregating and wrangling data

• How to install and configure the MicroStrategy Hadoop Gateway

• How to secure the MicroStrategy Hadoop Gateway with Kerberos authentication

• Real-world examples

• Q&A

Page 3: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

What is the MicroStrategy Hadoop Gateway?

3

At a high level:

High-performance native access to data in Hadoop

• A high-performance, native gateway for querying and processing data stored in the Hadoop Distributed File System (HDFS)

• A Spark-based distributed data processing engine that runs directly on the Hadoop cluster.

• Enables parallel data transfer from the Hadoop nodes directly to the Intelligence Server, thus achieving much higher throughput than via SQL-on-Hadoop with ODBC

• Data processing tasks for data wrangling are distributed to the nodes of the Hadoop cluster, instead of being performed on the Intelligence Server.

Page 4: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.4

ODBC/JDBC

Two approaches to analytics on Hadoop

Hadoop Gateway

MicroStrategy on Hadoop

• SQL based access for reporting and dashboarding

• Leverage Project Schema to build models on top of Hadoop or use Data Import to create in-memory or live-connect datasets.

• Build reports, documents and dashboards via live-connect or in-memory datasets

• Preferred method if requirements include:• Leverage Hadoop layer security at runtime• Project schema is required

• High-performance, parallelized native access to Hadoop

• Uses Data Import functionality to publish in-memory datasets. Since 10.9, users can create live-connect datasets to access more detail data on the source.

• Build reports, documents and dashboards via live-connect or in-memory datasets

• Preferred method if requirements include:• Data wrangling on Spark• Browse and preview Hadoop files via data

import interface.

+APACHEIMPALA

Page 5: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Supported Data File Formats

5

Data file formats

Import files directly from Hadoop Distributed File System

AvroRow-oriented

ParquetColumn-oriented

ORCOptimized Row Columnar

CSVText

JSON

Page 6: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.6

Built from the ground up for speed and scaleHadoop Gateway Architecture

Browse files and preview data

MicroStrategyIntelligence Server

MicroStrategyHadoop Gateway

YARN Resource Manager

Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster

Page 7: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Hadoop Gateway Architecture

7

Built from the ground up for speed and scale

Requests are distributed to the corresponding nodes

MicroStrategyIntelligence Server

MicroStrategyHadoop Gateway

YARN Resource Manager

Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster

Page 8: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Hadoop Gateway Architecture

8

Built from the ground up for speed and scale

Parallel data transfer

MicroStrategyIntelligence Server

MicroStrategyHadoop Gateway

YARN Resource Manager

Name Node

Worker Node

Worker Node

Worker Node

Worker Node

Hadoop Cluster

Page 9: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.9

In-memory

Response time vs. data volumeIn-memory vs. Live Connect Datasets

• All the data is transferred to the Intelligence Server in order to populate a dataset in memory

• The amount of data that can be in the dataset is limited by the amount of memory on the Intelligence Server`

• Big Data scenarios commonly require to aggregate or filter data to limit the data brought into memory

• Wrangle• Aggregate• Filter

MicroStrategyHadoop Gateway

HDFS

MicroStrategyIntelligence Server

In-memory dataset

Interactive Dossier

Page 10: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.10

Live Connect

Response time vs. data volumeIn-memory vs. Live Connect Datasets

• Supported in 10.9 and on, it allows datasets to query data live from the source

• Enables access to the full breadth of detail data on the source vs. only aggregated or filtered data

• Implies a trade-off of response time vs. breadth of detail data

All interactive queries are executed live on the source

MicroStrategyHadoop Gateway

Interactive Dossier

HDFS

MicroStrategyIntelligence Server

Live-connect dataset

Hadoop Cluster

Page 11: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Wrangle, Aggregate and Filter

11

Data Wrangling

Create extracts of data for fast in-memory analysis

• Lets users transform and refine their data for analytics and visualizations without relying on IT

• Wrangling functions are performed natively at the source, distributed on each HDFS node

• There are 30+ wrangling operations available for data preparation

• All wrangling steps can be saved as a script so it can be applied when the dataset is updated with new data

Page 12: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Wrangle, Aggregate and Filter

12

Aggregation

Create extracts of data for fast in-memory analysis

• Users can aggregate data from the source files directly on the Hadoop cluster nodes at scale, without moving data the the Intelligence Server. Examples:

• Basic• Date and time• Math

• By aggregating data, users reduce the data volume to an amount appropriate for in-memory cubes.

• Separate datasets can be created for fast in-memory analytics and for detail data queries.

Page 13: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Wrangle, Aggregate and Filter

13

Filtering

Create extracts of data for fast in-memory analysis

• Users can also define filters to limit the number of rows to be brought into the system without compromising on the granularity of detail.

• Both aggregation and filtering expressions are pushed down to the cluster nodes to leverage the advantages of Spark distributed computing performance.

Page 14: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.

DemoMicroStrategy Hadoop Gateway

Browse FilesWrangle dataPublish In-memory datasetBlend with existing dataset

Page 15: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Demo

15

MicroStrategy Hadoop Gateway

Page 16: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

DemoAggregation and Filtering

Browse FilesAggregation FilteringPublish In-memory dataset

Page 17: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Demo

17

MicroStrategy Hadoop Gateway

Page 18: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Installation and Deployment

18

Automatically deploy gateway via Web for effortless deployment

MicroStrategy Hadoop Gateway

Use gateway manager in MicroStrategy Web to easily create/modify/delete, deploy/undeploy, and start/stop Hadoop Gateway remotely

Page 19: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Installation and Deployment

19

Configuration and automatic deployment demo

Hadoop Properties:Hadoop NameNode: FQDN or IPHDFS Port: browse files, def. 8020WebHDFS: preview file, def. 50070

Gateway Properties:Host: machine to install GatewayPort: I-Server to HG, def. 30004

Spark Properties:YARN: Jar: path of spark assembly

Page 20: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.

Installation and Deployment

20

MicroStrategy Hadoop Gateway

• Automatic deployment remotely installs and deploys the gateway on the cluster node, requiring a user with root privileges.

• In some cases, Hadoop administrators prefer to install the components manually using their own tools to manage the application.

• Refer to the product documentation for step-by-step instructions for manual installation and deployment commands.

• Also refer to the Hadoop Gateway FAQ on the MicroStrategy Community portal for more details.

Manual installation and deployment

Page 21: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.21

Authentication

Kerberos support

Authorization

Securing the Hadoop Gateway

• Support for Kerberos authentication: MIT Kerberos and Active Directory (LDAP)

• Support for Secure Socket Layer (SSL) encryption

• Integration with Ranger policies (Hortonworks)

• Integration with Sentry policies (Cloudera)

• The policies established are applied, enforcing user level authorization.

Sentry (Cloudera)Ranger (Hortonworks)

User credentials

Security policies

Data

MicroStrategyHadoop Gateway

Page 22: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.

Customer Stories Big Data Validation Program

PerformanceAgilitySecurity

Page 23: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.23

Performance Agility SecurityValidated at one of the largest multi-media companies

Validated at one of the largest retailers Validated at one of the largest financial organizations

Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers

• Looking to publish cubes from a rapidly growing set of viewership data

• Big Data ODBC connections unable to publish the cubes fast enough

• Took more than 6 hours to publish a cube via Hive

• Took less than an hour with Teradata

• Hadoop Gateway published in less than an hour

• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files

• Looking to give end users direct access to HDFS

• Previously needed to wait for files to load into Hive tables

• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers

• Data wrangling optimized with Hadoop Gateway

• Looking to directly access secure data and publish cubes

• MIT Kerberos had been enabled cluster

• Secure Socket Layer (SSL) encryption enabled

• Cluster had been enabled for High Availability

Page 24: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.24

Hadoop Gateway vs. ODBC

Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation

• The Hadoop Gateway was directly compared to a large relational database at one of the largest digital media companies in the world

• While publishing these cubes, the Hadoop Gateway outperformed this Database

Page 25: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.25

Performance Agility SecurityValidated at one of the largest multi-media companies

Validated at one of the largest retailers Validated at one of the largest financial organizations

Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers

• Looking to publish cubes from a rapidly growing set of viewership data

• Big Data ODBC connections unable to publish the cubes fast enough

• Took more than 6 hours to publish a cube via Hive

• Took less than an hour with Teradata

• Hadoop Gateway published in less than an hour

• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files

• Looking to give end users direct access to HDFS

• Previously needed to wait for files to load into Hive tables

• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers

• Data wrangling optimized with Hadoop Gateway

• Looking to directly access secure data and publish cubes

• MIT Kerberos had been enabled cluster

• Secure Socket Layer (SSL) encryption enabled

• Cluster had been enabled for High Availability

Page 26: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.26

Hadoop Gateway vs. ODBCwith Data Wrangle

Hadoop Gateway performs well vs. performance ODBCHadoop Gateway Customer Validation

• The Hadoop Gateway was directly compared to Hive on Spark at one of the largest retailers in the world

• While publishing these cubes, the Hadoop Gateway outperformed Hive on Spark

• Data wrangling functions have been integrated with Hadoop Gateway to reduce data movement and leverage processing capacity of the Hadoop cluster

Page 27: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2018 MicroStrategy Incorporated. All Rights Reserved.27

Performance Agility SecurityValidated at one of the largest multi-media companies

Validated at one of the largest retailers Validated at one of the largest financial organizations

Hadoop Gateway Customer ValidationThe Hadoop Gateway has been validated with some of the largest MSTR customers

• Looking to publish cubes from a rapidly growing set of viewership data

• Big Data ODBC connections unable to publish the cubes fast enough

• Took more than 6 hours to publish a cube via Hive

• Took less than an hour with Teradata

• Hadoop Gateway published in less than an hour

• Transaction level data (12M+ rows/ day) loaded into Parquet and Avro files

• Looking to give end users direct access to HDFS

• Previously needed to wait for files to load into Hive tables

• Hadoop Gateway outperformed Hive on Spark and Impala via ODBC drivers

• Data wrangling optimized with Hadoop Gateway

• Looking to directly access secure data and publish cubes

• MIT Kerberos had been enabled cluster

• Secure Socket Layer (SSL) encryption enabled

• Cluster had been enabled for High Availability

Page 28: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.28

Q&A

Page 29: MicroStrategy Hadoop Gateway...cube via Hive •Took less than an hour with Teradata •Hadoop Gateway published in less than an hour •Transaction level data (12M+ rows/ day) loaded

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.29

Thank you