Apache Drill (ver. 0.1, check ver. 0.2)

18
Apache Drill Design proposal from OpenDremel team Camuel Gilyadov & Constantine Peresypkin, Email: [email protected]

description

Apache Drill proposed design from OpenDremel team

Transcript of Apache Drill (ver. 0.1, check ver. 0.2)

Page 1: Apache Drill (ver. 0.1, check ver. 0.2)

Apache Drill Design proposal from

OpenDremel team

Camuel Gilyadov & Constantine Peresypkin,

Email: [email protected]

Page 2: Apache Drill (ver. 0.1, check ver. 0.2)

OpenDremel Story: 2010

• Camuel Gilyadov started Dremel implementation on

summer 2010 named OpenDremel.

• David Gruzman joined the effort a few months later

followed by Constantine Peresypkin.

• There wasn’t a comprehensive design or architecture.

The goal was to get hierarchal-columnar transformation

working smoothly and in strict accordance to the

Dremel paper. Several working implementations are

published by us under Apache License.

• Hong San was hired as first full-timer to speedup the

development. Metaxa milestone was set.

Page 3: Apache Drill (ver. 0.1, check ver. 0.2)

OpenDremel Story: 2011

• OpenDremel early design was found too naive, mainly due to

Java underperformance in inner number-crunching loops.

• After fierce brainstorming, project was restarted from scratch

under new name Dazo. With Dazo, query plan is an arbitrary

piece of executable native code with Java frontend.

• From now on we got inspiration from BigQuery as opposed to

from Dremel paper.

• We decided to use Google NaCl as sandboxing technology to

isolate queries as well as meter resource consumption. The new

sandbox was named ZeroVM.

• As for storage we decided to use OpenStack Swift.

Page 4: Apache Drill (ver. 0.1, check ver. 0.2)

OpenDremel Story: 2012

• Four people full-time, several others part time, we still

don’t have fully integrated version but we are satisfied

with what we have achieved and convinced that the

decisions behind Dazo were correct.

• We believe ZeroVM could be a disruptive technology in

itself revolutionizing BigData@Cloud space.

• We are excited by Apache Drill initiative and hope to be

useful for it.

Page 5: Apache Drill (ver. 0.1, check ver. 0.2)

Design Tenet #1

• Apache Drill must support multi-tenant semantics

internally and not to be run in guest VMs altogether.

• It should be inspired by BigQuery and not only by

Dremel/PowerDrill/Tenzing papers.

• It is not practical to setup a dedicated cloud (billed

hourly) just to be able to run a query for a few seconds.

• The codebase must be clearly divided into trusted part

and untrusted part. Trusted part must be kept to

absolute minimum and must be peer-reviewed, secured,

audited and metered.

Page 6: Apache Drill (ver. 0.1, check ver. 0.2)

Design Tenet #2

• Apache Drill must be extremely flexible and

customizable.

• Schema-on-read concept must be supported.

Imperative high-performance parser code must be

possible to be embedded into the query.

• SQL is no longer enough. New query languages must

be easily added as plug-ins or as user-defined-functions

(UDF).

• Additionally various data-formats must be supported

like column-stores, row-stores, PAX, RCFiles and etc.

Page 7: Apache Drill (ver. 0.1, check ver. 0.2)

Design Tenet #2 (cont.)

• We suggest that query plan format will be relaxed to

arbitrary distributed executable code and data

format relaxed to arbitrary opaque BLOB.

• This way new query languages and new data formats

could be easily supported without changing backend.

• As added benefit backend becomes generic lightweight

homogeneous compute-storage cloud.

• Such approach exhibits good separation of control.

Cloud operator controls an bills for generic

infrastructure and the query engine is left completely in

the control of the tenant/user.

Page 8: Apache Drill (ver. 0.1, check ver. 0.2)

Design Tenet #3

• Apache Drill requests/queries must be hyper-elastic

meaning capability to exploit compute capacity of

thousands of servers for short duration of just a few

seconds. No resources must be kept spinning per user

between queries or when idle.

• Traditional VMs are too heavyweight for that.

Container approach such as OpenVZ/LXC and etc. are

not secure enough in multi-tenancy context.

• We suggest making sandboxing pluggable and

supporting ZeroVM ( developed for OpenDremel ) and

LXC (is fine for private clouds) to begin with.

Page 9: Apache Drill (ver. 0.1, check ver. 0.2)

Design Tenet #4

• Apache Drill must be efficient.

• Value-per-byte is extremely low with BigData.

• Overhead in the inner loop must be kept to minimum.

• Java was found inefficient for general number

crunching (such as data compression). The main

problem with Java is that GC overhead is unavoidable

for the whole data corpus being scanned. We went so

far as to keep all data in byte arrays and auto-generate

transformation code and it still underperformed and

code complexity went through the roof.

Page 10: Apache Drill (ver. 0.1, check ver. 0.2)

Suggested Architecture

Query

Browser / Client

Single-Tenant

Frontend running inside

traditional guest VM

Multi-Tenant

Backend scale-out object store

and in-situ compute

Query Compiler

JVM

Custom

executable job

Page 11: Apache Drill (ver. 0.1, check ver. 0.2)

OpenDremel/Dazo

Query

Two separate

unfinished jQuery

apps & cmdline app

with no particular

codenames

We call it Metaxa (historic reasons)

BQL Parser, unfinished

compiler based on

Apache Velocity

We call it Zwift

(Swift + ZeroVM)

Alpha Quality

Custom

executable job

Query Compiler

JVM

Page 12: Apache Drill (ver. 0.1, check ver. 0.2)

What is Swift?

“Swift is a highly available, distributed,

eventually consistent object/blob store.

Organizations can use Swift to store

lots of data efficiently, safely, and

cheaply.”

Page 13: Apache Drill (ver. 0.1, check ver. 0.2)

Haven’t got it?

Swift is THE open-source

implementation of

Amazon S3

Page 14: Apache Drill (ver. 0.1, check ver. 0.2)

What is ZeroVM?

Highly-secure, low-overhead, low-latency container-style

virtualization based on Google Native Client project. The

critical security code is transferred verbatim from Chrome

Browser project and therefore is as secure as Chrome

Browser. More info: http://ZeroVM.org and

http://news.ycombinator.com/item?id=3746222

Page 15: Apache Drill (ver. 0.1, check ver. 0.2)

ZeroVM highlights

1. Disposable VM per request

2. HyperElasticity per request

3. Embeddable into everything

4. High-performance (x86/ARM)

5. Erlang inspired clustering

6. Written in pure C, not deps

Page 16: Apache Drill (ver. 0.1, check ver. 0.2)

Haven’t got it?

ZeroVM to Virtualization

is what

SQLite is to Databases

Page 17: Apache Drill (ver. 0.1, check ver. 0.2)

Where is the code?

• OpenDremel (1st generation design): – http://code.google.com/p/dremel/source/browse?repo=dremel

– http://code.google.com/p/dremel/source/browse?repo=metaxa

• Dazo (2nd generation design):

– https://github.com/Dazo-org

Page 18: Apache Drill (ver. 0.1, check ver. 0.2)

Thanks Camuel Gilyadov,

Email: [email protected]