Best Practices for Big Data : Visualizing billions of rows ...

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .

Best Practices for Big Data : Visualizing

billions of rows with rapid response times

Anthony Maresco

1

“Big Data Analytics at the Speed of Thought”


Safe Harbor Notice

This presentation describes features that are under development by MicroStrategy. The objective of this presentation is to provide insight into

MicroStrategy’s technology direction. The functionalities described herein may or may not be released as shown.

This presentation contains statements that may constitute “forward-looking statements” for purposes of the safe harbor provisions under the

Private Securities Litigation Reform Act of 1995, including descriptions of technology and product features that are under development and

estimates of future business prospects. Forward-looking statements inherently involve risks and uncertainties that could cause actual results

of MicroStrategy Incorporated and its subsidiaries (collectively, the “Company”) to differ materially from the forward-looking statements.

Factors that could contribute to such differences include: the Company’s ability to meet product development goals while aligning costs with

anticipated revenues; the Company’s ability to develop, market, and deliver on a timely and cost-effective basis new or enhanced offerings

that respond to technological change or new customer requirements; the extent and timing of market acceptance of the Company’s new

offerings; continued acceptance of the Company’s other products in the marketplace; the timing of significant orders; competitive factors;

general economic conditions; and other risks detailed in the Company’s Form 10-Q for the three months ended September 30, 2018 and

other periodic reports filed with the Securities and Exchange Commission. By making these forward-looking statements, the Company

undertakes no obligation to update these statements for revisions or changes after the date of this presentation.

Copyright © 2017 MicroStrategy Incorporated. All Rights Reserved.

Topics

• Our Performance Scenario and Objective

• First Order Basic Practices

• A Note on Benchmarks

• Hadoop OLAP Type Tools

• Intelligent Cubes for Concurrency

• Last Things

• Summary

• Q&A

Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved .4

Special Thanks To….

HF Chadeisson

Principal Solutions Architect


Our Performance Scenario

and Objectives

Objective

• Compare performance impact of changes with a simple drill-down scenario using TPC-H SSB benchmark data

• Progress from slow to fast using a combination of Hadoop features and MicroStrategy features

• “Speed of Thought” Big Data Analytics with billions of rows

• Get an introduction of new developments that improve the picture

• Look at how some of these capabilities optimize Dossier interactive performance

Minimize this….

Johannes Kepler

“Ships and sails proper for the heavenly air should be fashioned. Then there will also be people, who do not shrink from the dreary vastness of space.”

Johannes Kepler writing to Galileo Galilei in 1609

Giordano Bruno was burned at the stake in 1600 for declaring the universe was infinite with countless suns and earths.

Performance – Concurrency - Structure

• “Initially, having any way to use SQL against the Hadoop data was the goal, now there is an increasing requirement to connect business users … and give them the performance they expect with high levels of concurrency.”

• “Note that to meet this requirement, it is likely that users will need to have structured data stored in Hadoop (along with the original unstructured data), as good performance is more likely if a transformation is done once rather than per-query”

https://dzone.com/articles/sql-and-hadoop

https://dzone.com/articles/sql-and-hadoop

déjà vu All Over Again

• ROLAP vs. MOLAP

• In-memory

• Columnar

• Indexing

• Pre-Aggregation

• MPP Features added to the data stores

• And more…

• All taken to the next level to deal with Big Data

Live Access to Data Sources with Unparalleled PerformanceQuickly generate multi-pass SQL and leverage push down functionality for optimized performance

Database

Cube Cube

Dynamic Sourcing

MicroStrategy lets users seamlessly run queries

and dynamically drill across multiple sources. The

server is able to auto-recognize tasks and

intelligently direct queries against in-memory

cubes when possible.

Generate Multipass SQL

MicroStrategy is able to easily and quickly generate multi-pass

SQL queries to provide greater analytical power and minimize

the amount of data that is pulled back to the mid-tier. The

sophisticated SQL engine can deliver high performance for the

most complex SQL computations.

Pushdown architecture

MicroStrategy leverages the database to its full extent by pushing

data joins and analytic calculations to the database when possible.

Every connector is optimized for high performance by pushing

down functions to leverage the power of the database in

conjunction with the server.

11

MicroStrategy Analytics Platform

Deliver a single version of the truth on top of your big dataLeverage the unified metadata repository to design reusable objects for rapid development,

governance and scale

Desktop iPhone iPadWeb

Clients

Enterprise Data & Business Applications

Relational

Databases

Hadoop &

Big Data

Cloud-

Based Data

Personal or

Departmental

Enterprise

Applications

Data Model

Reusable Objects

Documents

Applications

Android

CENTRALIZED

SECURITY

CENTRALIZED

ADMINISTRATION

SCALABLE EXTENSIBLE

What is “Speed of Thought” Analytics ?

• 3 seconds ?

• 7 seconds ?

• 10 seconds ?

• 15 seconds ?

• 30 seconds ?

• More ???

• All have been quoted!

Sometimes you have to wait….but you can set and manage SLA’s for X% of your workload…and evolve…

Data – Hardware – Software - Facts

• TPC-H SSB at various scales up to 1Terabyte

• ~6 Billion rows in the largest fact table – also 1.2 and 3.3 billion tests• Add in semi-structured and data mining

• 17 Billion Row Demo by Indexima

• 30 B Row demo provided by Kyvos during the conference

• 1.2 B Row Taxi used to show Dossier performance with several techniques and products referenced

• 10 Node clusters with 8 data/worker nodes

• Worker nodes have 8 vCPU and 61 GB

• Use ORC file types

• Hortonworks HDP 2.6.3

• 10 Node cluster with HDP 3.1.0 provided preliminary information

Considerations

• Plenty of new turf and it’s continually changing

• SQL is still key as NoSQL or NewSQL• Add in semi-structured and data mining

• SQL on Hadoop products are evolving quickly with rapid increases in functionality and performance. • Performance can double, triple or more in some cases in 6 month periods.

• Ultimate decisions depend on testing with your data

• Schema on read is key for iteration and flexibility in the back end

• Schemas are still important to operationalize analytics to large number of users

• Use of aggregation, agg-aware sql generation, caching, cubes, and dynamic sourcing are keys to performance and scalability

• Memory Caching is a key infrastructure component

• Agility means “Everything is finished…..Nothing is finished…”

Prescription

• Use the tools in the platform

• Use MicroStrategy to make them better

• Add additional tools & components where warranted

• Use MicroStrategy to make them better

• Use of aggregation, aggregation-aware sql generation, caching, cubes, and dynamic sourcing are tools for performance and scalability

• Ensure you have the resources, knowledge, and time allocation for initial and ongoing tuning – sizing – monitoring - benchmarking

• With MicroStrategy you get performance, scalability, governance, and concurrency to securely distribute insight with Big Data Analytics to 10’s of 1000’s of people

• Additional Big Data Techniques and Tools are required when volume –velocity – variety get to a tipping point


First-Order Best Practices

Preliminaries : Minimize Joins

• SSB Version of TPC-H eliminates joins but is missing lookup tables

• Snowflake is preferred with both dimension and lookup tables

TPC-H Schema SSB SchemaLINENUMBER

Preliminaries : Partitioning

• Hive and Impala recommend partitioning to minimize scanned rows

• This requires including a filter based on the partition in every query

• It is important to pick partition columns so there are the right number of partitions

• Partitioning can have other side effects

• In an ad-hoc environment, you may not always be using the partitioning filter

• In this study, partitioning was not used to see the relative impact of various actions

• Bucketing can also be used in conjunction with Partitioning

Preliminaries : Statistics

• Always collect statistics on tables and columns in order to leverage cost based optimization

• Example for Table

ANALYZE TABLE table1 COMPUTE STATISTICS;

• Example for Columns

ANALYZE TABLE table1 COMPUTE STATISTICS FOR COLUMNS;

Preliminaries : Optimized File Types

• Parquet

• ORC

• AVRO

Use the Latest Versions and Features

• Example: move from HiveServer2 to HiveServer2 Interactive LLAP

• At 550 Scale – 3.3 Billion Rows

• Times are mm:ss

Hive Server 2 Hive Server 2

Interactive LLAP

7:58 0:25

6:01 0:25

5:41 0:26

SQL and Row Counts

Table Rows

lineorders 5,999,989,709

dates 2556

customers 30,000,000

parts 200,000,000

Supplier 2,000,000

select a14.d_year d_year,

a13.c_region c_region,

a12.p_category p_category,

a12.p_brand1 p_brand1,

a12.p_mfgr p_mfgr,

sum(a11.lo_revenue) WJXBFS1

from lineorder a11

join part a12

on (a11.lo_partkey = a12.p_partkey)

join customer a13

on (a11.lo_custkey = a13.c_custkey)

join dates a14

on (a11.lo_orderdate =

a14.d_datekey)

where (a13.c_region = 'EUROPE'

and a12.p_brand1 = 'MFGR#427')

group by a14.d_year,

a13.c_region,

a12.p_category,

a12.p_brand1,

a12.p_mfgr

Dynamic Runtime Filtering

https://hortonworks.com/blog/top-5-performance-boosters-with-apache-hive-llap/

Hive Server 2

Hive Tez Query Processing

Hive Server 2 Interactive LLAP Architecture

Low Latency Analytical Processing … Long Live and Process

Impala Architecture

Presto Architecture

“https://adtmag.com/articles/2015/06/08/teradata-presto.aspx

Ambari Settings for HiveServer 2 Interactive/LLAP

The Numbers Game

• Adding more nodes

• Adding more memory

• Does the whole table fit in memory ?

• How much does a petabyte of memory cost ?

• Caching tables

• Keeping the most important things in memory

• Constantly reviewing what is “most important”

• Changing what’s in memory

• Querying detail vs. querying an intermediate aggregate

• Using complex SQL

More Nodes – Bigger Nodes

• Small has 6 Nodes 5 Data Nodes 8 vcpu 30 GB ram

• Large has 10 Nodes 8 Data Nodes 8 vcpu 61 GB ram

• Both tests used HiveServer 2 Interactive LLAP

• At 200 Scale – 1.2 billion rows

Small Cluster Larger Cluster

00:40 00:11

00:45 00:11

00:46 00:11

Astronomia Nova: Kepler’s Laws

• Hired by Tycho Brahe to analyze Mars orbital observations in 1600

• Appointed Imperial Mathematician by the Holy Roman Emperor Rudolph II in 1601 succeeding Brahe after his death

• He took 8 years to analyze the data and produce the report

• Proved Mars’ orbit was elliptical and that Earth was a planet and revolved around the Sun

• And he wasn’t burned at the stake…..

“All Planets move about the Sun in elliptical orbits, having the Sun as one of the foci.”

“New Astronomy Based on Causations or a Celestial Physics Derived from Investigations of the Motion of Mars Founded on

the Observations of the Noble Tycho Brahe.”

“A radius vector joining any planet to the Sun sweeps out equal areas in equal lengths of time.

Kepler’s Dashboards and Dossier

Aggregate Tables at 1 TB Scale - ~6 Billion Rows

MicroStrategy Aggregate-Aware SQL

• Add one or more intermediate aggregates

• Pick aggregates strategically to help the most popular requests

• At 1 TB Scale - ~6 billion rows

Before After

0:40 0:02

0:40 0:01

0:37 0:01

Compare cost vs. how other tools use SQL• Build lots of dashboards against schema used by lots of users

• Optimize with new aggregate table or other schema changes

• How many places do you have to change objects?

X

X Places where you have to change objects

X X X X X X X X X XCompetitor’s

SQL Based

Dashboard

MicroStrategy

X X X X X X X X X X X

X X X X X X X X X X X

X

Metadata

Review of Tuning Points

• Other related Systems

• Disk and Network

• Hadoop Tuning

• HDFS – YARN – Queues – Memory settings - JVM

• SQL on Hadoop Tuning

• Hive – Impala – Presto – Spark SQL – Drill

• Collecting Statistics and Data Modeling/De-normalization

• Storage Type : AVRO, Parquet, ORC

• MicroStrategy Tuning

• Governing, Memory, Queues

• Statistics

• Aggregate tables and cubes

Sample of Tuning References

• https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_performance.html

• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_hive-performance-tuning/bk_hive-performance-tuning.pdf

• http://my.safaribooksonline.com/book/operating-systems-and-server-administration/apache/9781491943199

• https://drill.apache.org/docs/performance-tuning/

• https://streever.atlassian.net/wiki/spaces/HADOOP/pages/2916360/Tuning+Yarn+Containers+-+Memory+Settings

• https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_yarn-resource-management/content/about_yarn_resource_allocation.html

• https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_yarn_tuning.html

• https://docs.treasuredata.com/articles/performance-tuning

• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

• https://community.hortonworks.com/content/kbentry/14309/demystify-tez-tuning-step-by-step.html

• https://prestodb.io/docs/current/admin/tuning.html

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_performance.html

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_hive-performance-tuning/bk_hive-performance-tuning.pdf

http://my.safaribooksonline.com/book/operating-systems-and-server-administration/apache/9781491943199

https://drill.apache.org/docs/performance-tuning/

https://streever.atlassian.net/wiki/spaces/HADOOP/pages/2916360/Tuning+Yarn+Containers+-+Memory+Settings

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_yarn-resource-management/content/about_yarn_resource_allocation.html

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/cdh_ig_yarn_tuning.html

https://docs.treasuredata.com/articles/performance-tuning

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

https://community.hortonworks.com/content/kbentry/14309/demystify-tez-tuning-step-by-step.html

https://prestodb.io/docs/current/admin/tuning.html


A Word About Benchmarks

Example : Smackdown Summary

https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown


Details with LLAP



Sub-Second Analytics with Hive and Druid

https://hortonworks.com/blog/sub-second-analytics-hive-druid

https://hortonworks.com/blog/sub-second-analytics-hive-druid/

Impala TPC-DS Benchmark

https://blog.cloudera.com/blog/2017/04/apache-impala-leads-traditional-analytic-database/

https://blog.cloudera.com/blog/2017/04/apache-impala-leads-traditional-analytic-database/


Hadoop OLAP Type Tools

MOLAP – ROLAP Cube Derived Products

• Products which aggregate and index

• MDX, SQL and REST API’s

• Techniques range from pre-aggregation to streaming to dynamic usage-based aggregation

• Exploit memory and optimized data structures

LLAP Hive-Druid

../../dev/techportal/sales/projects/Performance/Hive/bk_hive-performance-tuning.pdf

https://hortonworks.com/blog/apache-hive-druid-part-1-3/

Hive-Druid Integration at 550 Scale

Hive Druid with LLAP vs. LLAP Only

Hive Server 2

Interactive LLAP

Hive-Druid

Integration

0:25 0:01

0:25 0:01

0:25 0:01

• At 550 Scale – 3.3 billion rows

Hive-Druid Integration at 1 TB Scale

Hive Druid with LLAP vs. LLAP Only

Hive Server 2

Interactive LLAP

Hive-Druid

Integration

0:40 0:02

0:40 0:01

0:37 0:01

• At 1TB Scale – ~6 billion rows

• Almost identical test – but different cluster than 550 scale

test

Hive-Druid Integration

• Druid ingests rapidly, aggregates and

indexes

• The integration embeds it into a SQL

architecture to support ad-hoc access

Hive-Druid Keys

https://hortonworks.com/blog/apache-hive-druid-part-1-3/

• Indexing

• Column-Store

• Pre-aggregation

• In-memory

Yahoo History and Approach Was a Factor

• 24 Terabyte MSAS Cube at Yahoo

• Built a Hadoop version of MDX engine

• Uses XMLA/MDX as primary interface

• Also provides a SQL based interface over the cube

• Modeling to translate the underlying sources into the cube structure

• Sits between the client and the underlying sources

• Has Adaptive Caching and other options to optimize how and when cubes are built

• Both AtScale and Kyvos have Yahoo ties

Kyvos Insights

• OLAP Engine for Hadoop

• Sub-second response times over Trillions of Rows of Data

• On premise or Cloud

• Recently released version 5

• MSAS Implementation of MDX

• Works with all Big Data storage platforms, both on premise and in the Cloud.

• Supports Cloudera, Hortonworks, MapR, as well as Apache Hadoop.

• Supports all cloud platforms including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud.

• Leverages Spark, map reduce

Kyvos Architecture

Kyvos Transformation

Dremio

• Data-as-a-Service Platform

• Horizontally scalable architecture

• Apache arrow execution support from 1 to 1000+ nodes, supporting cloud deployments with elastic compuer feature

• Data Reflections (TM) accelerates data and queries, up to 1000x faster, supporting relational algebra

• Optimizes execution on specific data sources with native push down

• Cost based query planner generates plans for optimal execution with Data Reflections and native push down

AtScale Architecture

https://www.slideshare.net/Pivotal/achieving-megascale-business-intelligence-through-speed-of-thought-

analytics-on-hadoop

https://www.slideshare.net/Pivotal/achieving-megascale-business-intelligence-through-speed-of-thought-analytics-on-hadoop

Jethro

Jethro White Paper – October 2017

Indexima And MicroStrategy Integration

Powerful

Robust

Scalable

PLATFORM

Build, Deploy, Govern & Maintain

Analytics & Mobility Applications

Data Sources

Enterprise

Analytics

Enterprise

Reporting

Big

Data

Data

Discovery

Embedded

Analytics

Enterprise

Mobility

Mobile

Analytics

Mobile

Productivity

External

Apps

Telemetry

and IoT

Data Space

K-Store

Hyper Index

AWS S3 Logs

Powerful

Comprehensive

Unified

PLATFORM

Indexima Reduces Time to Data

In-MemoryMicroStrategy Cubes

Raw Data

Restricted views

Fine grained AnalysisComplex IT project

Costly infrastructureSlow response time

TIME TO DATA - MINUTES

Minutes

Instant response timeTime consuming projet

between IT and Business teamDifficult Cross Analysis

TIME TO DATA - MONTHS

WeeksSeconds

Aggregated DataDatabase


Intelligent Cubes for

Concurrency

Big Data Apps Leverage Dynamic Sourcing and SQL Engine

RDBMS

Query multi-source schema model and

drill down from

Intelligent Cubes to Hadoop

Agg-aware SQL engine,

dynamic sourcing with

cubes, caching and SQL

access

MicroStrategy Intelligent Cubes

• Add one or more Intelligent Cubes

• These tests use Dynamic Sourcing

• At 1 TB Scale - ~ 6 billion rows

Before After

0:40 0:01

0:40 0:01

0:37 0:01

First for Speed

Also for Concurrency

• Achieve more concurrency with Intelligent Cubes and Dynamic Sourcing

• Aggregate Table

• Jethro

• Hive-Druid

• Reduced communications to Cluster

• Memory closer to Visualization and other Action applications

• Less resources consumed on the cluster for running other applications

• Why connect and transform the same way 10,000x ?

• Plenty of options to keep data as fresh as possible

CONFIDENTIALThe Information Contained In This Presentation Is Confidential And Proprietary To MicroStrategy. The Recipient Of This Document Agrees That They Will Not Disclose Its Contents To Any Third Party Or Otherwise Use This Presentation For Any Purpose

Other Than An Evaluation Of MicroStrategy's Business Or Its Offerings. Reproduction or Distribution Is Prohibited.

Introduction to In-Memory BI

In-Memory Queries Exhibit Extraordinary Performance Characteristics

Avg. Wait Time = 1.70 sec

Max. Throughput of 4,598 User queries/min

67

Queries / MinNumber of Users

1,000 2,000 3,000 4,000 5,00010,000 20,000 30,000 40,000 50,000

Use

r W

ait

Tim

e

Test Configuration:

Intelligence Server

16 CPUs Xeon – 144 GB Memory

Web Server

16 CPUs Xeon – 144 GB Memory

Test Suite:

76 Grid/Graph Reports

5 Dashboards

Database

1 TB transactional data

8 Billion records

Extraordinarily Consistent and Fast Response Time

Up to a Very High Level of Usage and Users

Usage Level (Scale)

1s

2s

3s

4s

5s

Options for Adding Data

Taxi with Kyvos using Live Connection

Taxi with Existing Objects

Cube ? Dynamic Sourcing ? Aggregate Table ? Live ? It’s transparently tunable.

Taxi with In-Memory Dataset

Indexima Demo

https://www.youtube.com/watch?v=fuOHPjcoPTE&index=2&list=PL9QqqVTS5yON1urD42jEeN-dbGd0NHI9Q


Last Things

One More…Replace with performance demo


Hadoop Gateway on Spark

75

Leveraging Spark workflows for data analytics at scale

Hadoop Gateway is a long running Spark application



76


Each request generates a job



77


Each job consists of stages of parallelized tasks

Sweet Spot : Build and wrangle very large MTDI cubes in a fraction of the ODBC time.

Hadoop 3 : The Next Generation for Hadoop• First class support for long running YARN services

• First class support for Docker on YARN

• Erasure coding results in less storage overhead and lower costs

• Multiple namespaces for Namenode Federation improving scalability

• Support for multiple standby Namenodes

• Improves the timeline service v2 improving scalability and reliability

• Enables scheduling of additional resources including disks and GPUs for better integration with containers, deep learning & machine learning

• Support for GPUs enhances the performance of computations required for Data Science and AI use cases

• Intra-node disk balancing.

• Supports intra-queue preemption allowing preemption between applications within a single queue to support job prioritization within queue based on user limits and/or application priority.

• First class FPGA support on YARN

• Since 2.8.3 there has been about 1 million new lines of code

• Use of affinity and anti-affinity labels to control how we deploy micro-services

Hive 3 Materialized Views and Druid

• Pre-computation of relevant aggregates and joins

• Materialized views support automatic query rewriting based on materializations

• Materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers

• Materialized views can exploit Hive LLAP acceleration

• Optimizer uses Apache Calcite to produce full and partial rewriting of query expressions comprising projections, filters, join and aggregation operations

• Note that Index Creation has been dropped in HDP 3/Hive 3


MicroStrategy Consulting

Big Data Advisory

Best practice guidance to ensure you utilize the right connectors and

gateways to bring Big Data to your enterprise.

MicroStrategy.com/Services


Visit microstrategy.com/request-benefits to explore consulting

services custom-built to help you become a more Intelligent

Enterprise—and available at no cost to you.

Enterprise Support ProgramBecause we are vested in your success

Reinvesting in you.

https://www.microstrategy.com/request-benefits


Summary

Kepler’s Third Law And Sir Isaac Newton

• That took another 10 years after Nova Astronomia.

“The Squares of the sidereal periods (or revolution) of the planets are directly proportional to the cubes of their mean distances from the Sun.”

The Harmony of the World

• From Kepler’s laws, Newton was able to establish gravity in “Newton’s Law of Universal Gravitation”.

• Along the way he invented The Calculus and codified Classical Mechanics…

• That led to the Industrial Revolution….

Philosophiæ Naturalis Principia Mathematica

From Kepler to My Summary….

Summary

• Use snowflake schemas and pre-join where needed

• Use the latest versions of Hadoop software as soon as possible

• Make sure to use the latest features built for SQL on Hadoop performance

• Make sure to have a big enough cluster – big enough nodes – lots of memory

• Size – tune – monitor – benchmark at all levels – forever…

• Make use of strategic aggregate tables

• Leverage Index/Aggregate solutions

• Use intelligent cubes for performance and concurrency

• Leverage additional 3rd party tools for greater aggregation, indexing and memory exploitation

Best Practices for Big Data : Visualizing billions of rows ...

Documents

Transcript of Best Practices for Big Data : Visualizing billions of rows ...