NoSQL and SQL - Open Analytics Summit

download NoSQL and SQL - Open Analytics Summit

of 28

Transcript of NoSQL and SQL - Open Analytics Summit

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    1/28

    NoSQL and SQL Work Side-by-Side

    to Tackle Real-time Big Data Needs

    Allen Day

    MapR Technologies

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    2/28

    Me

    Allen Day

    Principal Data Scientist @ MapR

    Human Genomics / Bioinformatics

    (PhD, UCLA School of Medicine)

    @allenday

    [email protected]

    [email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    3/28

    You

    Im assuming that the typical attendee:

    is a software developer

    is interested and familiar with open source

    is familiar with Hadoop, relational DBs

    has heard of or has used some NoSQL technology

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    4/28

    Big Data Workloads

    Offline

    ETL

    Model creation & clustering & indexing

    Web Crawling

    Batch reporting

    Online

    Lightweight OLTP

    Classification & anomaly detection

    Stream processing

    Interactive reporting

    SQL

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    5/28

    What is NoSQL? Why use it?

    Traditional storage (relational DBs) are unable toaccommodate increasing # and variety ofobservations

    Culprits: sensors, event logs, electronic payments

    Solution: stay responsive by relaxing ACID storagerequirements

    Denormalize (#)

    Loosen schema (variety), loosen consistency

    This is the essence of NoSQL

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    6/28

    NoSQL Impact on Business Processes

    Traditional business intelligence (BI) tech stackassumes relational DB storage Company decisions depend on this (reports, charts)

    NoSQL collected data arent in relational DB Data volume/variety is still increasing

    Tech and methods are still in flux

    Decoupled data storage and decision supportsystems BI cant access freshest, largest data sets

    Very high opportunity cost to business

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    7/28

    Ideal Solution Features

    Scalable & Reliable

    Distributed replicated storage

    Distributed parallel processing

    BI application support

    Ad-hoc, interactive queries

    Real-time responsiveness

    Flexible

    Handles rapid storage and schema evolution

    Handles new analytics methods and functions

    Hadoop FS

    Map/Reduce, YARN{

    SQL Interface{Extensible for NoSQL,Advanced Analytics{

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    8/28

    From Ideals to Possibilities

    Migrate NoSQL data/processing to SQL High cost to marshal NoSQL data to SQL storage

    SQL systems lack advanced analytics capabilities

    Migrate SQL data to NoSQL Breaks compatibility for BI-dependent functions, e.g.

    financial reporting

    Limited support for relational operations (joins) high latency

    NoSQL tech is still in flux (continuity)

    Other Approaches? Yes. First lets consider a SQL/NoSQL use case

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    9/28

    Impala

    Interactive Queries & Hadoop

    low-latency

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    10/28

    Example Problem: Marketing Campaign

    Jane is an analyst at ane-commerce company

    How does she figure

    out good targetingsegments for the nextmarketing campaign?

    She has some ideasand lots of data

    User

    profiles

    Transaction

    information

    Access

    logs

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    11/28

    Traditional System Solution 1: RDBMS

    ETL the data from

    MongoDB and Hadoop

    into the RDBMS

    MongoDB data must beflattened, schematized,

    filtered and aggregated

    Hadoop data must be

    filtered and aggregated Query the data using

    any SQL-based tool

    User

    profiles

    Access

    logs

    Transaction

    information

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    12/28

    Traditional System Solution 2: Hadoop

    ETL the data fromOracle and MongoDBinto Hadoop

    MongoDB data must beflattened andschematized

    Work with theMapReduce team to

    write custom code togenerate the desiredanalyses

    User

    profiles

    Access

    logs

    Transaction

    information

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    13/28

    Traditional System Solution 3: Hive

    ETL the data from

    Oracle and MongoDB

    into Hadoop

    MongoDB data must beflattened and

    schematized

    But HiveQL queries are

    slow and BI toolsupport is limited

    Marshaling/Coding

    User

    profiles

    Access

    logs

    Transaction

    information

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    14/28

    What Would Google Do?

    Distributed

    File SystemNoSQL

    Interactive

    analysis

    Batch

    processing

    GFS BigTable Dremel MapReduce

    HDFS HBase ??? HadoopMapReduce

    Build Apache Drill to provide a true open source

    solution to interactive analysis of Big Data

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    15/28

    Apache Drill Overview

    Interactive analysis of Big Data using standardSQL

    Fast

    Low latency queries

    Complement native interfaces andMapReduce/Hive/Pig

    Open

    Community driven open source project

    Under Apache Software Foundation

    Modern Standard ANSI SQL:2003 (select/into)

    Nested data support

    Schema is optional

    Supports RDBMS, Hadoop and NoSQL

    Interactive queries

    Data analyst

    Reporting

    100 ms-20 min

    Data mining

    Modeling

    Large ETL

    20 min-20 hr

    MapReduce

    HivePig

    Apache Drill

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    16/28

    How Does It Work?

    Drillbit(Coordinator)

    SQL QueryParser

    Query Planner

    Drillbit(Executor)

    Drillbit(Executor)

    Drillbit(Executor)

    SELECT * FROM

    oracle.transactions,

    mongo.users,

    hdfs.events

    LIMIT 1

    Drill Client

    Tableau

    Drill ODBC Driver

    Micro-Strategy

    CrystalReports

    Driver

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    17/28

    How Does It Work?

    Drillbits run on each node, designed to

    maximize data locality

    Processing is done outside MapReduce

    paradigm (but possibly within YARN)

    Queries can be fed to any Drillbit

    Coordination, query planning, optimization,

    scheduling, and execution are distributed

    SELECT * FROM

    oracle.transactions,mongo.users,

    hdfs.events

    LIMIT 1

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    18/28

    Apache Drill: Key Features

    Full ANSI SQL:2003 support Use any SQL-based tool

    Nested data support Flattening is error-prone and often impossible

    Schema-less data source support Schema can change rapidly and may be record-specific

    Extensible

    DSLs, UDFs Custom operators (e.g. k-means clustering)

    Well-documented data source & file format APIs

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    19/28

    How Does Impala Fit In?Impala Strengths

    Beta currently available

    Easy install and setup on top of

    Cloudera

    Faster than Hive on some queries

    SQL-like query language

    Questions

    Open Source Lite

    Lacks RDBMS support

    Lacks NoSQL support beyond

    HBase

    Early row materializationincreases footprint and reduces

    performance

    Limited file format support

    Query results must fit in memory!

    Rigid schema is required

    No support for nested data

    SQL-like (not SQL)

    Many important features are coming soon.

    Architectural foundation is constrained. No community development.

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    20/28

    Drill Status: Alpha Available July

    Heavy active development by multiple organizations Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho

    Available

    Logical plan syntax and interpreter

    Reference interpreter In progress

    SQL interpreter

    Storage engine implementations for Accumulo, Cassandra, HBase and

    various file formats

    Significant community momentum

    Over 200 people on the Drill mailing list

    Over 200 members of the Bay Area Drill User Group

    Drill meetups across the US and Europe

    Beta: Q3

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    21/28

    Why Apache Drill Will Be Successful

    Resources Contributors have

    strong backgrounds

    from companies like

    Oracle, IBM Netezza,

    Informatica, Clustrix

    and Pentaho

    Community Development done in

    the open

    Active contributors

    from multiple

    companies

    Rapidly growing

    Architecture Full SQL

    New data support

    Extensible APIs

    Full Columnar

    Execution

    Beyond Hadoop

    Bottom Line: Apache Drill enables NoSQL and SQL

    Work Side-by-Side to Tackle Real-time Big Data Needs

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    22/28

    Me

    Allen Day

    Principal Data Scientist @ MapR

    @allenday

    [email protected] [email protected]

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    23/28

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    24/28

    ADDITIONAL SLIDES

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    25/28

    Full SQL (ANSI SQL:2003)

    Drill supports SQL (ANSI SQL:2003 standard)

    Correlated subqueries, analytic functions,

    SQL-like is not enough

    Use any SQL-based tool with Apache Drill

    Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL,

    Standard ODBC and JDBC drivers

    Drill%WorkerDrill%Worker

    Driver

    Client

    Drillbit

    SQL%Query%

    Parser

    Query%

    PlannerDrillbits

    Drill%ODBC%

    Driver

    Tableau

    MicroStrategy

    Excel

    SAP%Crystal%

    Reports

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    26/28

    Nested Data

    Nested data is becoming prevalent

    JSON, BSON, XML, Protocol Buffers, Avro, etc.

    The data source may or may not be aware

    MongoDB supports nested data natively

    A single HBase value could be a JSON document

    (compound nested type) Google Dremels innovation was efficient columnar

    storage and querying of nested data

    Flattening nested data is error-prone and oftenimpossible

    Think about repeated and optional fields at everylevel

    Apache Drill supports nested data

    Extensions to ANSI SQL:2003

    enum Gender {

    MALE, FEMALE

    }

    record User {

    string name;

    Gender gender;

    long followers;

    }

    {

    "name": "Homer",

    "gender": "Male",

    "followers": 100

    children: [

    {name: "Bart"},

    {name: "Lisa}

    ]

    }

    JSON

    Avro

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    27/28

    Schema is Optional

    Many data sources do not have rigid schemas Schemas change rapidly

    Each record may have a different schema, may be sparse/wide

    Apache Drill supports querying against unknown schemas

    Query any HBase, Cassandra or MongoDB table User can define the schema or let the system discover it

    automatically System of record may already have schema information

    No need to manage schema evolution

    Row Key CF contents CF anchor

    "com.cnn.www" contents:html = "" anchor:my.look.ca = "CNN.com"

    anchor:cnnsi.com = "CNN"

    "com.foxnews.www" contents:html = "" anchor:en.wikipedia.org = "Fox News"

  • 7/28/2019 NoSQL and SQL - Open Analytics Summit

    28/28

    Flexible and Extensible Architecture

    Apache Drill is designed for extensibility

    Well-documented APIs and interfaces

    Data sources and file formats Implement a custom scanner to support a new source/format

    Query languages SQL:2003 is the primary language

    Implement a custom Parser to support a Domain Specific Language

    UDFs

    Optimizers

    Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration

    Operators Custom operators can be implemented (e.g. k-Means clustering)

    Operator push-down to data source (RDBMS)