Best Practices for Big Data : Visualizing billions of rows ...

86
Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved . Best Practices for Big Data : Visualizing billions of rows with rapid response times Anthony Maresco 1 “Big Data Analytics at the Speed of Thought”

Transcript of Best Practices for Big Data : Visualizing billions of rows ...

Page 1: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Best Practices for Big Data : Visualizing

billions of rows with rapid response times

Anthony Maresco

1

“Big Data Analytics at the Speed of Thought”

Page 2: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Safe Harbor Notice

This presentation describes features that are under development by MicroStrategy. The objective of this presentation is to provide insight into

MicroStrategy’s technology direction. The functionalities described herein may or may not be released as shown.

This presentation contains statements that may constitute “forward-looking statements” for purposes of the safe harbor provisions under the

Private Securities Litigation Reform Act of 1995, including descriptions of technology and product features that are under development and

estimates of future business prospects. Forward-looking statements inherently involve risks and uncertainties that could cause actual results

of MicroStrategy Incorporated and its subsidiaries (collectively, the “Company”) to differ materially from the forward-looking statements.

Factors that could contribute to such differences include: the Company’s ability to meet product development goals while aligning costs with

anticipated revenues; the Company’s ability to develop, market, and deliver on a timely and cost-effective basis new or enhanced offerings

that respond to technological change or new customer requirements; the extent and timing of market acceptance of the Company’s new

offerings; continued acceptance of the Company’s other products in the marketplace; the timing of significant orders; competitive factors;

general economic conditions; and other risks detailed in the Company’s Form 10-Q for the three months ended September 30, 2018 and

other periodic reports filed with the Securities and Exchange Commission. By making these forward-looking statements, the Company

undertakes no obligation to update these statements for revisions or changes after the date of this presentation.

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.

Page 3: Best Practices for Big Data : Visualizing billions of rows ...

Topics

• Our Performance Scenario and Objective

• First Order Basic Practices

• A Note on Benchmarks

• Hadoop OLAP Type Tools

• Intelligent Cubes for Concurrency

• Last Things

• Summary

• Q&A

Page 4: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .4

Special Thanks To….

HF Chadeisson

Principal Solutions Architect

Page 5: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Our Performance Scenario

and Objectives

Page 6: Best Practices for Big Data : Visualizing billions of rows ...

Objective

• Compare performance impact of changes with a simple drill-down scenario using TPC-H SSB benchmark data

• Progress from slow to fast using a combination of Hadoop features and MicroStrategy features

• “Speed of Thought” Big Data Analytics with billions of rows

• Get an introduction of new developments that improve the picture

• Look at how some of these capabilities optimize Dossier interactive performance

Page 7: Best Practices for Big Data : Visualizing billions of rows ...

Minimize this….

Page 8: Best Practices for Big Data : Visualizing billions of rows ...

Johannes Kepler

“Ships and sails proper for the heavenly air should be fashioned. Then there will also be people, who do not shrink from the dreary vastness of space.”

Johannes Kepler writing to Galileo Galilei in 1609

Giordano Bruno was burned at the stake in 1600 for declaring the universe was infinite with countless suns and earths.

Page 9: Best Practices for Big Data : Visualizing billions of rows ...

Performance – Concurrency - Structure

• “Initially, having any way to use SQL against the Hadoop data was the goal, now there is an increasing requirement to connect business users … and give them the performance they expect with high levels of concurrency.”

• “Note that to meet this requirement, it is likely that users will need to have structured data stored in Hadoop (along with the original unstructured data), as good performance is more likely if a transformation is done once rather than per-query”

https://dzone.com/articles/sql-and-hadoop

Page 10: Best Practices for Big Data : Visualizing billions of rows ...

déjà vu All Over Again

• ROLAP vs. MOLAP

• In-memory

• Columnar

• Indexing

• Pre-Aggregation

• MPP Features added to the data stores

• And more…

• All taken to the next level to deal with Big Data

Page 11: Best Practices for Big Data : Visualizing billions of rows ...

Live Access to Data Sources with Unparalleled PerformanceQuickly generate multi-pass SQL and leverage push down functionality for optimized performance

Database

Cube Cube

Dynamic Sourcing

MicroStrategy lets users seamlessly run queries

and dynamically drill across multiple sources. The

server is able to auto-recognize tasks and

intelligently direct queries against in-memory

cubes when possible.

Generate Multipass SQL

MicroStrategy is able to easily and quickly generate multi-pass

SQL queries to provide greater analytical power and minimize

the amount of data that is pulled back to the mid-tier. The

sophisticated SQL engine can deliver high performance for the

most complex SQL computations.

Pushdown architecture

MicroStrategy leverages the database to its full extent by pushing

data joins and analytic calculations to the database when possible.

Every connector is optimized for high performance by pushing

down functions to leverage the power of the database in

conjunction with the server.

11

MicroStrategy Analytics Platform

Page 12: Best Practices for Big Data : Visualizing billions of rows ...

Deliver a single version of the truth on top of your big dataLeverage the unified metadata repository to design reusable objects for rapid development,

governance and scale

Desktop iPhone iPadWeb

Clients

Enterprise Data & Business Applications

Relational

Databases

Hadoop &

Big Data

Cloud-

Based Data

Personal or

Departmental

Enterprise

Applications

Data Model

Reusable Objects

Documents

Applications

Android

CENTRALIZED

SECURITY

CENTRALIZED

ADMINISTRATION

SCALABLE EXTENSIBLE

Page 13: Best Practices for Big Data : Visualizing billions of rows ...

What is “Speed of Thought” Analytics ?

• 3 seconds ?

• 7 seconds ?

• 10 seconds ?

• 15 seconds ?

• 30 seconds ?

• More ???

• All have been quoted!

Sometimes you have to wait….but you can set and manage SLA’s for X% of your workload…and evolve…

Page 14: Best Practices for Big Data : Visualizing billions of rows ...

Data – Hardware – Software - Facts

• TPC-H SSB at various scales up to 1Terabyte

• ~6 Billion rows in the largest fact table – also 1.2 and 3.3 billion tests• Add in semi-structured and data mining

• 17 Billion Row Demo by Indexima

• 30 B Row demo provided by Kyvos during the conference

• 1.2 B Row Taxi used to show Dossier performance with several techniques and products referenced

• 10 Node clusters with 8 data/worker nodes

• Worker nodes have 8 vCPU and 61 GB

• Use ORC file types

• Hortonworks HDP 2.6.3

• 10 Node cluster with HDP 3.1.0 provided preliminary information

Page 15: Best Practices for Big Data : Visualizing billions of rows ...

Considerations

• Plenty of new turf and it’s continually changing

• SQL is still key as NoSQL or NewSQL• Add in semi-structured and data mining

• SQL on Hadoop products are evolving quickly with rapid increases in functionality and performance. • Performance can double, triple or more in some cases in 6 month periods.

• Ultimate decisions depend on testing with your data

• Schema on read is key for iteration and flexibility in the back end

• Schemas are still important to operationalize analytics to large number of users

• Use of aggregation, agg-aware sql generation, caching, cubes, and dynamic sourcing are keys to performance and scalability

• Memory Caching is a key infrastructure component

• Agility means “Everything is finished…..Nothing is finished…”

Page 16: Best Practices for Big Data : Visualizing billions of rows ...

Prescription

• Use the tools in the platform

• Use MicroStrategy to make them better

• Add additional tools & components where warranted

• Use MicroStrategy to make them better

• Use of aggregation, aggregation-aware sql generation, caching, cubes, and dynamic sourcing are tools for performance and scalability

• Ensure you have the resources, knowledge, and time allocation for initial and ongoing tuning – sizing – monitoring - benchmarking

• With MicroStrategy you get performance, scalability, governance, and concurrency to securely distribute insight with Big Data Analytics to 10’s of 1000’s of people

• Additional Big Data Techniques and Tools are required when volume –velocity – variety get to a tipping point

Page 17: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

First-Order Best Practices

Page 18: Best Practices for Big Data : Visualizing billions of rows ...

Preliminaries : Minimize Joins

• SSB Version of TPC-H eliminates joins but is missing lookup tables

• Snowflake is preferred with both dimension and lookup tables

TPC-H Schema SSB SchemaLINENUMBER

Page 19: Best Practices for Big Data : Visualizing billions of rows ...

Preliminaries : Partitioning

• Hive and Impala recommend partitioning to minimize scanned rows

• This requires including a filter based on the partition in every query

• It is important to pick partition columns so there are the right number of partitions

• Partitioning can have other side effects

• In an ad-hoc environment, you may not always be using the partitioning filter

• In this study, partitioning was not used to see the relative impact of various actions

• Bucketing can also be used in conjunction with Partitioning

Page 20: Best Practices for Big Data : Visualizing billions of rows ...

Preliminaries : Statistics

• Always collect statistics on tables and columns in order to leverage cost based optimization

• Example for Table

ANALYZE TABLE table1 COMPUTE STATISTICS;

• Example for Columns

ANALYZE TABLE table1 COMPUTE STATISTICS FOR COLUMNS;

Page 21: Best Practices for Big Data : Visualizing billions of rows ...

Preliminaries : Optimized File Types

• Parquet

• ORC

• AVRO

Page 22: Best Practices for Big Data : Visualizing billions of rows ...

Use the Latest Versions and Features

• Example: move from HiveServer2 to HiveServer2 Interactive LLAP

• At 550 Scale – 3.3 Billion Rows

• Times are mm:ss

Hive Server 2 Hive Server 2

Interactive LLAP

7:58 0:25

6:01 0:25

5:41 0:26

Page 23: Best Practices for Big Data : Visualizing billions of rows ...

SQL and Row Counts

Table Rows

lineorders 5,999,989,709

dates 2556

customers 30,000,000

parts 200,000,000

Supplier 2,000,000

select a14.d_year d_year,

a13.c_region c_region,

a12.p_category p_category,

a12.p_brand1 p_brand1,

a12.p_mfgr p_mfgr,

sum(a11.lo_revenue) WJXBFS1

from lineorder a11

join part a12

on (a11.lo_partkey = a12.p_partkey)

join customer a13

on (a11.lo_custkey = a13.c_custkey)

join dates a14

on (a11.lo_orderdate =

a14.d_datekey)

where (a13.c_region = 'EUROPE'

and a12.p_brand1 = 'MFGR#427')

group by a14.d_year,

a13.c_region,

a12.p_category,

a12.p_brand1,

a12.p_mfgr

Page 24: Best Practices for Big Data : Visualizing billions of rows ...

Dynamic Runtime Filtering

https://hortonworks.com/blog/top-5-performance-boosters-with-apache-hive-llap/

Page 25: Best Practices for Big Data : Visualizing billions of rows ...

Hive Server 2

Page 26: Best Practices for Big Data : Visualizing billions of rows ...

Hive Tez Query Processing

Page 27: Best Practices for Big Data : Visualizing billions of rows ...

Hive Server 2 Interactive LLAP Architecture

Low Latency Analytical Processing … Long Live and Process

Page 28: Best Practices for Big Data : Visualizing billions of rows ...

Impala Architecture

Page 29: Best Practices for Big Data : Visualizing billions of rows ...

Presto Architecture

“https://adtmag.com/articles/2015/06/08/teradata-presto.aspx

Page 30: Best Practices for Big Data : Visualizing billions of rows ...

Ambari Settings for HiveServer 2 Interactive/LLAP

Page 31: Best Practices for Big Data : Visualizing billions of rows ...

The Numbers Game

• Adding more nodes

• Adding more memory

• Does the whole table fit in memory ?

• How much does a petabyte of memory cost ?

• Caching tables

• Keeping the most important things in memory

• Constantly reviewing what is “most important”

• Changing what’s in memory

• Querying detail vs. querying an intermediate aggregate

• Using complex SQL

Page 32: Best Practices for Big Data : Visualizing billions of rows ...

More Nodes – Bigger Nodes

• Small has 6 Nodes 5 Data Nodes 8 vcpu 30 GB ram

• Large has 10 Nodes 8 Data Nodes 8 vcpu 61 GB ram

• Both tests used HiveServer 2 Interactive LLAP

• At 200 Scale – 1.2 billion rows

Small Cluster Larger Cluster

00:40 00:11

00:45 00:11

00:46 00:11

Page 33: Best Practices for Big Data : Visualizing billions of rows ...

Astronomia Nova: Kepler’s Laws

• Hired by Tycho Brahe to analyze Mars orbital observations in 1600

• Appointed Imperial Mathematician by the Holy Roman Emperor Rudolph II in 1601 succeeding Brahe after his death

• He took 8 years to analyze the data and produce the report

• Proved Mars’ orbit was elliptical and that Earth was a planet and revolved around the Sun

• And he wasn’t burned at the stake…..

“All Planets move about the Sun in elliptical orbits, having the Sun as one of the foci.”

“New Astronomy Based on Causations or a Celestial Physics Derived from Investigations of the Motion of Mars Founded on

the Observations of the Noble Tycho Brahe.”

“A radius vector joining any planet to the Sun sweeps out equal areas in equal lengths of time.

Page 34: Best Practices for Big Data : Visualizing billions of rows ...

Kepler’s Dashboards and Dossier

Page 35: Best Practices for Big Data : Visualizing billions of rows ...

Aggregate Tables at 1 TB Scale - ~6 Billion Rows

Page 36: Best Practices for Big Data : Visualizing billions of rows ...

MicroStrategy Aggregate-Aware SQL

• Add one or more intermediate aggregates

• Pick aggregates strategically to help the most popular requests

• At 1 TB Scale - ~6 billion rows

Before After

0:40 0:02

0:40 0:01

0:37 0:01

Page 37: Best Practices for Big Data : Visualizing billions of rows ...

Compare cost vs. how other tools use SQL• Build lots of dashboards against schema used by lots of users

• Optimize with new aggregate table or other schema changes

• How many places do you have to change objects?

X

X Places where you have to change objects

X X X X X X X X X XCompetitor’s

SQL Based

Dashboard

MicroStrategy

X X X X X X X X X X X

X X X X X X X X X X X

X

Metadata

Page 38: Best Practices for Big Data : Visualizing billions of rows ...

Review of Tuning Points

• Other related Systems

• Disk and Network

• Hadoop Tuning

• HDFS – YARN – Queues – Memory settings - JVM

• SQL on Hadoop Tuning

• Hive – Impala – Presto – Spark SQL – Drill

• Collecting Statistics and Data Modeling/De-normalization

• Storage Type : AVRO, Parquet, ORC

• MicroStrategy Tuning

• Governing, Memory, Queues

• Statistics

• Aggregate tables and cubes

Page 39: Best Practices for Big Data : Visualizing billions of rows ...

Sample of Tuning References

• https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_performance.html

• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_hive-performance-tuning/bk_hive-performance-tuning.pdf

• http://my.safaribooksonline.com/book/operating-systems-and-server-administration/apache/9781491943199

• https://drill.apache.org/docs/performance-tuning/

• https://streever.atlassian.net/wiki/spaces/HADOOP/pages/2916360/Tuning+Yarn+Containers+-+Memory+Settings

• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_yarn-resource-management/content/about_yarn_resource_allocation.html

• https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_yarn_tuning.html

• https://docs.treasuredata.com/articles/performance-tuning

• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

• https://community.hortonworks.com/content/kbentry/14309/demystify-tez-tuning-step-by-step.html

• https://prestodb.io/docs/current/admin/tuning.html

Page 40: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

A Word About Benchmarks

Page 41: Best Practices for Big Data : Visualizing billions of rows ...

Example : Smackdown Summary

https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown

Page 42: Best Practices for Big Data : Visualizing billions of rows ...

Details with LLAP

https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown

Page 43: Best Practices for Big Data : Visualizing billions of rows ...

Sub-Second Analytics with Hive and Druid

https://hortonworks.com/blog/sub-second-analytics-hive-druid

Page 44: Best Practices for Big Data : Visualizing billions of rows ...

Impala TPC-DS Benchmark

https://blog.cloudera.com/blog/2017/04/apache-impala-leads-traditional-analytic-database/

Page 45: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Hadoop OLAP Type Tools

Page 46: Best Practices for Big Data : Visualizing billions of rows ...

MOLAP – ROLAP Cube Derived Products

• Products which aggregate and index

• MDX, SQL and REST API’s

• Techniques range from pre-aggregation to streaming to dynamic usage-based aggregation

• Exploit memory and optimized data structures

LLAP Hive-Druid

Page 47: Best Practices for Big Data : Visualizing billions of rows ...

Hive-Druid Integration at 550 Scale

Page 48: Best Practices for Big Data : Visualizing billions of rows ...

Hive Druid with LLAP vs. LLAP Only

Hive Server 2

Interactive LLAP

Hive-Druid

Integration

0:25 0:01

0:25 0:01

0:25 0:01

• At 550 Scale – 3.3 billion rows

Page 49: Best Practices for Big Data : Visualizing billions of rows ...

Hive-Druid Integration at 1 TB Scale

Page 50: Best Practices for Big Data : Visualizing billions of rows ...

Hive Druid with LLAP vs. LLAP Only

Hive Server 2

Interactive LLAP

Hive-Druid

Integration

0:40 0:02

0:40 0:01

0:37 0:01

• At 1TB Scale – ~6 billion rows

• Almost identical test – but different cluster than 550 scale

test

Page 51: Best Practices for Big Data : Visualizing billions of rows ...

Hive-Druid Integration

• Druid ingests rapidly, aggregates and

indexes

• The integration embeds it into a SQL

architecture to support ad-hoc access

Page 52: Best Practices for Big Data : Visualizing billions of rows ...

Hive-Druid Keys

https://hortonworks.com/blog/apache-hive-druid-part-1-3/

• Indexing

• Column-Store

• Pre-aggregation

• In-memory

Page 53: Best Practices for Big Data : Visualizing billions of rows ...

Yahoo History and Approach Was a Factor

• 24 Terabyte MSAS Cube at Yahoo

• Built a Hadoop version of MDX engine

• Uses XMLA/MDX as primary interface

• Also provides a SQL based interface over the cube

• Modeling to translate the underlying sources into the cube structure

• Sits between the client and the underlying sources

• Has Adaptive Caching and other options to optimize how and when cubes are built

• Both AtScale and Kyvos have Yahoo ties

Page 54: Best Practices for Big Data : Visualizing billions of rows ...

Kyvos Insights

• OLAP Engine for Hadoop

• Sub-second response times over Trillions of Rows of Data

• On premise or Cloud

• Recently released version 5

• MSAS Implementation of MDX

• Works with all Big Data storage platforms, both on premise and in the Cloud.

• Supports Cloudera, Hortonworks, MapR, as well as Apache Hadoop.

• Supports all cloud platforms including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud.

• Leverages Spark, map reduce

Page 55: Best Practices for Big Data : Visualizing billions of rows ...

Kyvos Architecture

Page 56: Best Practices for Big Data : Visualizing billions of rows ...

Kyvos Transformation

Page 57: Best Practices for Big Data : Visualizing billions of rows ...

Dremio

• Data-as-a-Service Platform

• Horizontally scalable architecture

• Apache arrow execution support from 1 to 1000+ nodes, supporting cloud deployments with elastic compuer feature

• Data Reflections (TM) accelerates data and queries, up to 1000x faster, supporting relational algebra

• Optimizes execution on specific data sources with native push down

• Cost based query planner generates plans for optimal execution with Data Reflections and native push down

Page 58: Best Practices for Big Data : Visualizing billions of rows ...

AtScale Architecture

https://www.slideshare.net/Pivotal/achieving-megascale-business-intelligence-through-speed-of-thought-

analytics-on-hadoop

Page 59: Best Practices for Big Data : Visualizing billions of rows ...

Jethro

Jethro White Paper – October 2017

Page 60: Best Practices for Big Data : Visualizing billions of rows ...

Indexima And MicroStrategy Integration

Powerful

Robust

Scalable

PLATFORM

Build, Deploy, Govern & Maintain

Analytics & Mobility Applications

Data Sources

Enterprise

Analytics

Enterprise

Reporting

Big

Data

Data

Discovery

Embedded

Analytics

Enterprise

Mobility

Mobile

Analytics

Mobile

Productivity

External

Apps

Telemetry

and IoT

Data Space

K-Store

Hyper Index

AWS S3 Logs

Powerful

Comprehensive

Unified

PLATFORM

Page 61: Best Practices for Big Data : Visualizing billions of rows ...

Indexima Reduces Time to Data

In-MemoryMicroStrategy Cubes

Raw Data

Restricted views

Fine grained AnalysisComplex IT project

Costly infrastructureSlow response time

TIME TO DATA - MINUTES

Minutes

Instant response timeTime consuming projet

between IT and Business teamDifficult Cross Analysis

TIME TO DATA - MONTHS

WeeksSeconds

Aggregated DataDatabase

Page 62: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Intelligent Cubes for

Concurrency

Page 63: Best Practices for Big Data : Visualizing billions of rows ...

Big Data Apps Leverage Dynamic Sourcing and SQL Engine

RDBMS

Query multi-source schema model and

drill down from

Intelligent Cubes to Hadoop

Agg-aware SQL engine,

dynamic sourcing with

cubes, caching and SQL

access

Page 64: Best Practices for Big Data : Visualizing billions of rows ...

MicroStrategy Intelligent Cubes

• Add one or more Intelligent Cubes

• These tests use Dynamic Sourcing

• At 1 TB Scale - ~ 6 billion rows

Before After

0:40 0:01

0:40 0:01

0:37 0:01

Page 65: Best Practices for Big Data : Visualizing billions of rows ...

First for Speed

Page 66: Best Practices for Big Data : Visualizing billions of rows ...

Also for Concurrency

• Achieve more concurrency with Intelligent Cubes and Dynamic Sourcing

• Aggregate Table

• Jethro

• Hive-Druid

• Reduced communications to Cluster

• Memory closer to Visualization and other Action applications

• Less resources consumed on the cluster for running other applications

• Why connect and transform the same way 10,000x ?

• Plenty of options to keep data as fresh as possible

Page 67: Best Practices for Big Data : Visualizing billions of rows ...

CONFIDENTIALThe Information Contained In This Presentation Is Confidential And Proprietary To MicroStrategy. The Recipient Of This Document Agrees That They Will Not Disclose Its Contents To Any Third Party Or Otherwise Use This Presentation For Any Purpose

Other Than An Evaluation Of MicroStrategy's Business Or Its Offerings. Reproduction or Distribution Is Prohibited.

Introduction to In-Memory BI

In-Memory Queries Exhibit Extraordinary Performance Characteristics

Avg. Wait Time = 1.70 sec

Max. Throughput of 4,598 User queries/min

67

Queries / MinNumber of Users

1,000 2,000 3,000 4,000 5,00010,000 20,000 30,000 40,000 50,000

Use

r W

ait

Tim

e

Test Configuration:

Intelligence Server

16 CPUs Xeon – 144 GB Memory

Web Server

16 CPUs Xeon – 144 GB Memory

Test Suite:

76 Grid/Graph Reports

5 Dashboards

Database

1 TB transactional data

8 Billion records

Extraordinarily Consistent and Fast Response Time

Up to a Very High Level of Usage and Users

Usage Level (Scale)

1s

2s

3s

4s

5s

Page 68: Best Practices for Big Data : Visualizing billions of rows ...

Options for Adding Data

Page 69: Best Practices for Big Data : Visualizing billions of rows ...

Taxi with Kyvos using Live Connection

Page 70: Best Practices for Big Data : Visualizing billions of rows ...

Taxi with Existing Objects

Cube ? Dynamic Sourcing ? Aggregate Table ? Live ? It’s transparently tunable.

Page 71: Best Practices for Big Data : Visualizing billions of rows ...

Taxi with In-Memory Dataset

Page 73: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Last Things

Page 74: Best Practices for Big Data : Visualizing billions of rows ...

One More…Replace with performance demo

Page 75: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Hadoop Gateway on Spark

75

Leveraging Spark workflows for data analytics at scale

Hadoop Gateway is a long running Spark application

Page 76: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Hadoop Gateway on Spark

76

Leveraging Spark workflows for data analytics at scale

Each request generates a job

Page 77: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Hadoop Gateway on Spark

77

Leveraging Spark workflows for data analytics at scale

Each job consists of stages of parallelized tasks

Sweet Spot : Build and wrangle very large MTDI cubes in a fraction of the ODBC time.

Page 78: Best Practices for Big Data : Visualizing billions of rows ...

Hadoop 3 : The Next Generation for Hadoop• First class support for long running YARN services

• First class support for Docker on YARN

• Erasure coding results in less storage overhead and lower costs

• Multiple namespaces for Namenode Federation improving scalability

• Support for multiple standby Namenodes

• Improves the timeline service v2 improving scalability and reliability

• Enables scheduling of additional resources including disks and GPUs for better integration with containers, deep learning & machine learning

• Support for GPUs enhances the performance of computations required for Data Science and AI use cases

• Intra-node disk balancing.

• Supports intra-queue preemption allowing preemption between applications within a single queue to support job prioritization within queue based on user limits and/or application priority.

• First class FPGA support on YARN

• Since 2.8.3 there has been about 1 million new lines of code

• Use of affinity and anti-affinity labels to control how we deploy micro-services

Page 79: Best Practices for Big Data : Visualizing billions of rows ...

Hive 3 Materialized Views and Druid

• Pre-computation of relevant aggregates and joins

• Materialized views support automatic query rewriting based on materializations

• Materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers

• Materialized views can exploit Hive LLAP acceleration

• Optimizer uses Apache Calcite to produce full and partial rewriting of query expressions comprising projections, filters, join and aggregation operations

• Note that Index Creation has been dropped in HDP 3/Hive 3

Page 80: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

MicroStrategy Consulting

Big Data Advisory

Best practice guidance to ensure you utilize the right connectors and

gateways to bring Big Data to your enterprise.

MicroStrategy.com/Services

Page 81: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Visit microstrategy.com/request-benefits to explore consulting

services custom-built to help you become a more Intelligent

Enterprise—and available at no cost to you.

Enterprise Support ProgramBecause we are vested in your success

Reinvesting in you.

Page 82: Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Summary

Page 83: Best Practices for Big Data : Visualizing billions of rows ...

Kepler’s Third Law And Sir Isaac Newton

• That took another 10 years after Nova Astronomia.

“The Squares of the sidereal periods (or revolution) of the planets are directly proportional to the cubes of their mean distances from the Sun.”

The Harmony of the World

• From Kepler’s laws, Newton was able to establish gravity in “Newton’s Law of Universal Gravitation”.

• Along the way he invented The Calculus and codified Classical Mechanics…

• That led to the Industrial Revolution….

Philosophiæ Naturalis Principia Mathematica

Page 84: Best Practices for Big Data : Visualizing billions of rows ...

From Kepler to My Summary….

Page 85: Best Practices for Big Data : Visualizing billions of rows ...

Summary

• Use snowflake schemas and pre-join where needed

• Use the latest versions of Hadoop software as soon as possible

• Make sure to use the latest features built for SQL on Hadoop performance

• Make sure to have a big enough cluster – big enough nodes – lots of memory

• Size – tune – monitor – benchmark at all levels – forever…

• Make use of strategic aggregate tables

• Leverage Index/Aggregate solutions

• Use intelligent cubes for performance and concurrency

• Leverage additional 3rd party tools for greater aggregation, indexing and memory exploitation

Page 86: Best Practices for Big Data : Visualizing billions of rows ...