A Closer Look at Apache Kudu

Apache KuduA Closer Look at

By Andriy Zabavskyy Mar 2017

A species of antelope from BigData Zoo

Why Kudu

Why Kudu

Analytics on Hadoop before Kudu

Fast Scans Fast Random Access

Weak side of combining Parquet and HBase

• Complex code to manage the flow and synchronization of data between the two systems.

• Manage consistent backups, security policies, and monitoring across multiple distinct systems.

Lambda Architecture Challenges

• In the real world, systems often need to accommodate • Late-arriving data• Corrections on past records• Privacy-related deletions on data that has already been

migrated to the immutable store.

Happy Medium• High Throughput. Goal within 2x Impala• Low Latency for random read/write. Goal 1ms on SSD• SQL and NoSQL style API

Fast Scans Fast Random Access

Why Kudu

Data Model

Tables, Schemas, Keys

• Kudu is a storage system for tables of structured data

• Schema consisting of a finite number of columns

• Each such column has a name, type:• Boolean, Integers, Unixtime_Micros, • Floating, String, Binary

Keys

• Some ordered subset of those columns are specified to be the table’s primary key

• The primary key:• enforces a uniqueness constraint • acts as the sole index by which rows may be efficiently

updated or deleted

Write Operations

• User mutates the table using Insert, Update, and Delete APIs • Note: a primary key must be fully specified• Java, C++, Python API

• No multi-row transactional APIs:• each mutation conceptually executes as its own

transaction, • despite being automatically batched with other mutations

for better performance.

Read Operations

• Scan operation:• any number of predicates to filter the results• two types of predicates:

• comparisons between a column and a constant value, • and composite primary key ranges.

• An user may specify a projection for a scan. • A projection consists of a subset of columns to be

retrieved.

Read/Write Python API Sample

Why Kudu

Storage Layout

Storage Layout Goals

• Fast columnar scans• best-of-breed immutable data formats

such as Parquet• efficiently encoded columnar data files.

• Low-latency random updates• O(lg n) lookup complexity for random

access

• Consistency of performance• Majority of users are willing

predictability

MemRowSet

• In-memory concurrent B-tree• No removal from tree – MVCC

records instead• No in-place updates – only

modifications without changing the value size

• Link together leaf nodes for sequential scans

• Row-wise layout

� -

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

DiskRowSet

� �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

• Column-organized• Each column is written to

disk in a single contiguous block of data.

• The column itself is subdivided into small pages

• Granular random reads, and

• An embedded B-tree index

Deltas

• A DeltaMemStore is a concurrent B-tree which shares the implementation of MemRowSets

• A DeltaMemStore flushes into a DeltaFile

• A DeltaFile is a simple binary column

Insert Path

• Each DiskRowSet stores a Bloom filter of the set of keys present

• Each DiskRowSet, we store the minimum and maximum primary key,

Read Path

• Converts the key range predicate into a row offset range predicate

• Performs the scan one column at a time • Seeks the target column to the correct row offset • Consult the delta stores to see if any later updates

Delta Compaction

• Background maintenance manager periodically

• scans DiskRowSets to find any cases where a large number of deltas have accumulated, and

• schedules a delta compaction operation which merges those deltas back into the base data columns.

RowSet Compaction

• A key-based merge of two or more DiskRowSets• The output is written back to new DiskRowSets rolling every

32 MB• RowSet compaction has two goals:

• We take this opportunity to remove deleted rows. • This process reduces the number of DiskRowSets that

overlap in key range

Kudu Trade-Offs

• Random Updates will be slower• Kudu requires key-lookup before update, bloom lookup

before insert

• Single Row Seek may be slower• Columnar Design is optimized for scans• Especially slow at reading a row with many recent

updates

Why Kudu

Cluster Architecture

Cluster Roles

��

� �

��

��

� �

� �

� �

��

��

��

� �

� �

� �

� �

� �

� � � �

� �

� �

The Kudu Master

Kudu’s central master process has several key responsibilities: • A catalog manager

• keeping track of which tables and tablets exist, as well as their schemas, desired replication levels, and other metadata

• A cluster coordinator• keeping track of which servers in the cluster are alive and

coordinating redistribution of data

• A tablet directory• keeping track of which tablet servers are hosting replicas of

each tablet

Why Kudu


Partitioning

Partitioning

• Tables in Kudu are horizontally partitioned.

• Kudu, like BigTable, calls these partitions tablets

• Kudu supports a flexible array of partitioning schemes

Partitioning: Hash

Img source: https://github.com/cloudera/kudu/blob/master/docs/images/hash-partitioning-example.png

Partitioning: Range

Img source: https://github.com/cloudera/kudu/blob/master/docs/images/r ange-partitioning-example.png

Partitioning: Hash plus Range

Img source: https://github.com/cloudera/kudu/blob/master/docs/images/hash-range-par tition ing-example.png

Partitioning Recommendations

• Bigger tables, like fact tables are recommended to partition in a way so that 1 tablet would contain about 1GB of data

• Do not partition small tables like dimensions• Note: Impala doesn’t allow skipping the partitioning

clause, so you need to specify the 1 range partition explicitly:

Dimension Table with One Partition

��

Why Kudu


Replication

Replication Approach

• Kudu uses the Leader/Follower or Master-Slave replication

• Kudu employs the Raft[25] consensus algorithm to replicate its tablets• If a majority of replicas accept the write and log it to

their own local write-ahead logs, • the write is considered durably replicated and thus

can be committed on all replicas

Raft: Replicated State Machine

• Replicated log ensures state machines execute same commands in same order• Consensus module ensures proper log replication• System makes progress as long as any majority of servers are up• Visualization: https://raft.github.io/raftscope/index.html

Consistency Model

• Kudu provides clients the choice between two consistency modes for reads(scans):• READ_AT_SNAPSHOT• READ_LATEST

READ_LATEST consistency

• Monotonic reads are guaranteed(?) Read-your-writes is not• Corresponds to "Read Committed" ACID Isolation mode:• This is the default mode.

READ_LATEST consistency

• The server will always return committed writes at the time the request was received.

• This type of read is not repeatable.

READ_AT_SNAPSHOT Consistency

• Guarantees read-your-writes consistency from a single client

• Corresponds "Repeatable Read” ACID Isolation mode.

READ_AT_SNAPSHOT Consistency

• The server attempts to perform a read at the provided timestamp

• In this mode reads are repeatable• at the expense of waiting for in-flight transactions whose

timestamp is lower than the snapshot's timestamp to complete

Write Consistency

• Writes to a single tablet are always internally consistent• By default, Kudu does not provide an external consistency

guarantee. • However, for users who require a stronger guarantee, Kudu

offers the option to manually propagate timestamps between clients

Replication Factor Limitation

• Since Kudu 1.2.0:• The replication factor of tables is now limited to a

maximum of 7• In addition, it is no longer allowed to create a table with an

even replication factor

Kudu and CAP Theorem

• Kudu is a CP type of storage engine.

• Writing to a tablet will be delayed if the server that hosts that tablet’s leader replica fails

• Kudu gains the following properties by using Raft consensus:• Leader elections are fast• Follower replicas don’t allow

writes, but they do allow reads

Why Kudu

Kudu Applicability

Applications for which Kudu is a viable

• Reporting applications where new data must be immediately available for end users

• Time-series applications with • queries across large amounts of historic data• granular queries about an individual entity

• Applications that use predictive models to make real-time decisions

Why Kudu

Streaming Analytics

Case Study

Business Case

• A leader in health care compliance consulting and technology-driven managed services

• Cloud-based multi-services platform

• It offers • enhanced data security and

scalability, • operational managed services,

and access to business information

http://ihealthone.com /wp-c ontent/uploads/2016/12/Healthcare_Complianc e_Cons ultants-495x400.jpg

ETL ApproachKey Points:

• Leverage Confluent platform with Schema Registry

• Apply configuration based approach:• Avro Schema in Schema Registry for

Input Schema• Impala Kudu SQL scripts for Target

Schema

• Stick to Python App as primary ETL code, but extend:• Develop new abstractions to work

with mapping rules

• Streaming processing for both facts and dimensions

Cons:

• Scaling needs extra effortsData Flow

AnalyticsDWH

EventTopics

ETL Code

Configuration

InputSchema

MappingRules

TargetSchema

Other Configurations

Stream ETL using Pipeline Architecture

Cache Manager

Mapper/ Flattener

Types Adjuster

Data Enricher DB SinkerData

Reader

Configuration

Pipeline Modules:• Data Reader: reads data from source DB• Mapper/Flattener: flatten JSON treelike structure into flat one

and maps the field names to target ones• Types Adjuster: adjusts/converts data types properly• Data Enricher: enriches the data structure with new data:

• Generates surrogate key• Looks up for the data from target DB(using cache)

• DB Sinker: writes data into target DBOther Modules:• Cache Manager: manages the cache with dimension data

Why Kudu

Key Types Benchmark

Kudu Numeric vs String Keys• Reason:

• Generating surrogate numeric keys adds extra processing step and complexity to the overall ETL process

• Sample Schema:• Dimension:

• Promotion dimension with 1000 unique members, 30 categories

• Products dimension with 50 000 unique members, 300 categories

• Facts• Fact table containing the references to the 2 dimension

above with 1 million of rows• Fact table containing the references to the 2 dimension

above with 100 million of rows

Benchmark Result

Why Kudu

Lessons Learnt

Pain Points

• Often releases with many changes• Data types Limitations (especially in Python Lib, Impala)• Lack of Sequences/Constraints• Lack of Multi-Row transactions

Limitations

• Not recommended more than 50 columns• Immutable primary keys• Non-alterable Primary Key, Partitioning, Column Types• Partitions splitable

Modeling Recommendations: Star Schema

Dimensions :• Replication factor equal to

number of nodes in a cluster• 1 Tablet per dimension

Facts:• Aim for as many tablets as you

have cores in the cluster

Why Kudu

What Kudu is Not

What Kudu is Not

• Not a SQL interface itself• It’s just the storage layer – you should use Impala or

SparkSQL

• Not an application that runs on HDFS• It’s an alternative, native Hadoop storage engine

• Not a replacement for HDFS or Hbase• Select the right storage for the right use case• Cloudera will support and invest in all three

Why Kudu

Kudu vs MPPData Warehouse

Kudu vs MPP Data Warehouses

In Common:• Fast analytics queries via SQL • Ability to insert, update, delete data

Differences:

üFaster streaming insertsüImproved Hadoop integration

oSlower batch insertsoNo transactional data loading, multi-row transactions,

indexing

Useful resources

• Community, Downloads, VM:• https://kudu.apache.org

• Whitepaper:• http://kudu.apache.org/kudu.pdf

• Slack channel:• https://getkudu-slack.herokuapp.com

USA HQToll Free: 866-687-3588 Tel: +1-512-516-8880

Ukraine HQTel: +380-32-240-9090

BulgariaTel: +359-2-902-3760

GermanyTel: +49-69-2602-5857

NetherlandsTel: +31-20-262-33-23

PolandTel: +48-71-382-2800

UKTel: +44-207-544-8414

[email protected]

WEBSITE:www.softserveinc.com

Questions ?

A Closer Look at Apache Kudu

Software

Transcript of A Closer Look at Apache Kudu