Olap with Spark and Cassandra

42
OLAP WITH SPARK AND CASSANDRA EVAN CHAN JULY 2014

description

 

Transcript of Olap with Spark and Cassandra

Page 1: Olap with Spark and Cassandra

OLAP WITH SPARK ANDCASSANDRA

EVAN CHANJULY 2014

Page 2: Olap with Spark and Cassandra

WHO AM I?Principal Engineer, @evanfchan

Creator of

Socrata, Inc.

http://github.com/velviaSpark Job Server

Page 3: Olap with Spark and Cassandra

WE BUILD SOFTWARE TO MAKE DATA USEFUL TO MOREPEOPLE.

data.edmonton.ca finances.worldbank.org data.cityofchicago.orgdata.seattle.gov data.oregon.gov data.wa.govwww.metrochicagodata.org data.cityofboston.govinfo.samhsa.gov explore.data.gov data.cms.gov data.ok.govdata.nola.gov data.illinois.gov data.colorado.govdata.austintexas.gov data.undp.org www.opendatanyc.comdata.mo.gov data.nfpa.org data.raleighnc.gov dati.lombardia.itdata.montgomerycountymd.gov data.cityofnewyork.usdata.acgov.org data.baltimorecity.gov data.energystar.govdata.somervillema.gov data.maryland.gov data.taxpayer.netbronx.lehman.cuny.edu data.hawaii.gov data.sfgov.org

Page 4: Olap with Spark and Cassandra

WE ARE SWIMMING IN DATA!

Page 5: Olap with Spark and Cassandra

BIG DATA AT OOYALA2.5 billion analytics pings a day = almost a trillion events ayear.Roll up tables - 30 million rows per day

Page 6: Olap with Spark and Cassandra

BIG DATA AT SOCRATAHundreds of datasets, each one up to 30 million rowsCustomer demand for billion row datasets

Page 7: Olap with Spark and Cassandra

HOW CAN WE ALLOW CUSTOMERS TO QUERY AYEAR'S WORTH OF DATA?

Flexible - complex queries includedSometimes you can't denormalize your data enough

Fast - interactive speeds

Page 8: Olap with Spark and Cassandra

RDBMS? POSTGRES?Start hitting latency limits at ~10 million rowsNo robust and inexpensive solution for querying across shardsNo robust way to scale horizontallyComplex and expensive to improve performance (eg rolluptables)

Page 9: Olap with Spark and Cassandra

OLAP CUBES?Materialize summary for every possible combinationToo complicated and brittleTakes forever to computeExplodes storage and memory

Page 10: Olap with Spark and Cassandra

When in doubt, use brute force- Ken Thompson

Page 11: Olap with Spark and Cassandra
Page 12: Olap with Spark and Cassandra

CASSANDRAHorizontally scalableVery flexible data modelling (lists, sets, custom data types)Easy to operateNo fear of number of rows or documentsBest of breed storage technology, huge communityBUT: Simple queries only

Page 13: Olap with Spark and Cassandra

APACHE SPARKHorizontally scalable, in-memory queriesFunctional Scala transforms - map, filter, groupBy, sortetc.SQL, machine learning, streaming, graph, R, many more pluginsall on ONE platform - feed your SQL results to a logisticregression, easy!THE Hottest big data platform, huge community, leavingHadoop in the dustDevelopers love it

Page 14: Olap with Spark and Cassandra

SPARK PROVIDES THE MISSING FAST, DEEPANALYTICS PIECE OF CASSANDRA!

Page 15: Olap with Spark and Cassandra

INTEGRATING SPARK AND CASSANDRAScala solutions:

Datastax integration:

(CQL-based)https://github.com/datastax/cassandra-driver-sparkCalliope

Page 16: Olap with Spark and Cassandra

A bit more work:

Use traditional Cassandra client with RDDsUse an existing InputFormat, like CqlPagedInputFormat

Page 17: Olap with Spark and Cassandra

EXAMPLE CUSTOM INTEGRATION USINGASTYANAX

val cassRDD = sc.parallelize(rowkeys). flatMap { rowkey => columnFamily.get(rowkey).execute().asScala }

Page 18: Olap with Spark and Cassandra

A SPARK AND CASSANDRAOLAP ARCHITECTURE

Page 19: Olap with Spark and Cassandra

SEPARATE STORAGE AND QUERY LAYERSCombine best of breed storage and query platformsTake full advantage of evolution of eachStorage handles replication for availabilityQuery can replicate data for scaling read concurrency -independent!

Page 20: Olap with Spark and Cassandra

SCALE NODES, NOTDEVELOPER TIME!!

Page 21: Olap with Spark and Cassandra

KEEPING IT SIMPLEMaximize row scan speedColumnar representation for efficiencyCompressed bitmap indexes for fast algebraFunctional transforms for easy memoization, testing,concurrency, composition

Page 22: Olap with Spark and Cassandra

SPARK AS CASSANDRA'S CACHE

Page 23: Olap with Spark and Cassandra

EVEN BETTER: TACHYON OFF-HEAP CACHING

Page 24: Olap with Spark and Cassandra

INITIAL ATTEMPTSval rows = Seq( Seq("Burglary", "19xx Hurston", 10), Seq("Theft", "55xx Floatilla Ave", 5) )

sc.parallelize(rows) .map { values => (values[0], values) } .groupByKey .reduce(_[2] + _[2])

Page 25: Olap with Spark and Cassandra

No existing generic query engine for Spark when we started(Shark was in infancy, had no indexes, etc.), so we built our ownFor every row, need to extract out needed columnsAbility to select arbitrary columns means using Seq[Any], notype safetyBoxing makes integer aggregation very expensive and memoryinefficient

Page 26: Olap with Spark and Cassandra

COLUMNAR STORAGE AND QUERYING

Page 27: Olap with Spark and Cassandra

The traditional row-based data storageapproach is dead- Michael Stonebraker

Page 28: Olap with Spark and Cassandra

TRADITIONAL ROW-BASED STORAGESame layout in memory and on disk:

Name AgeBarak 46

Hillary 66

Each row is stored contiguously. All columns in row 2 come afterrow 1.

Page 29: Olap with Spark and Cassandra

COLUMNAR STORAGE (MEMORY)Name column

0 10 1

Dictionary: {0: "Barak", 1: "Hillary"}

Age column

0 146 66

Page 30: Olap with Spark and Cassandra

COLUMNAR STORAGE (CASSANDRA)Review: each physical row in Cassandra (e.g. a "partition key")stores its columns together on disk.

Schema CF

Rowkey TypeName StringDict

Age Int

Data CF

Rowkey 0 1Name 0 1

Age 46 66

Page 31: Olap with Spark and Cassandra

ADVANTAGES OF COLUMNAR STORAGECompression

Dictionary compression - HUGE savings for low-cardinalitystring columnsRLE

Reduce I/OOnly columns needed for query are loaded from disk

Can keep strong types in memory, avoid boxingBatch multiple rows in one cell for efficiency

Page 32: Olap with Spark and Cassandra

ADVANTAGES OF COLUMNAR QUERYINGCache locality for aggregating column of dataTake advantage of CPU/GPU vector instructions for ints /doublesavoid row-ifying until last possible momenteasy to derive computed columnsUse vector data / linear math libraries

Page 33: Olap with Spark and Cassandra

COLUMNAR QUERY ENGINE VS ROW-BASED INSCALA

Custom RDD of column-oriented blocks of dataUses ~10x less heap10-100x faster for group by's on a single nodeScan speed in excess of 150M rows/sec/core for integeraggregations

Page 34: Olap with Spark and Cassandra

SO, GREAT, OLAP WITH CASSANDRA ANDSPARK. NOW WHAT?

Page 35: Olap with Spark and Cassandra
Page 36: Olap with Spark and Cassandra

DATASTAX: CASSANDRA SPARK INTEGRATIONDatastax Enterprise now comes with HA Spark

HA master, that is.cassandra-driver-spark

Page 37: Olap with Spark and Cassandra

SPARK SQLAppeared with Spark 1.0In-memory columnar storeCan read from Parquet now; Cassandra integration comingQuerying is not column-based (yet)No indexesWrite custom functions in Scala .... take that Hive UDFs!!Integrates well with MLBase, Scala/Java/Python

Page 38: Olap with Spark and Cassandra

WORK STILL NEEDEDIndexesColumnar querying for fast aggregationEfficient reading from columnar storage formats

Page 39: Olap with Spark and Cassandra

GETTING TO A BILLION ROWS / SECBenchmarked at 20 million rows/sec, GROUP BY on twocolumns, aggregating two more columns. Per core.50 cores needed for parallel localized grouping throughput of1 billion rows~5-10 additional cores budget for distributed exchange andgrouping of locally agggregated groups, depending on resultsize and network topology

Above is a custom solution, NOT Spark SQL.

Look for integration with Spark/SQL for a proper solution

Page 40: Olap with Spark and Cassandra

LESSONSExtremely fast distributed querying for these use cases

Data doesn't change much (and only bulk changes)Analytical queries for subset of columnsFocused on numerical aggregationsSmall numbers of group bys, limited network interchange ofdata

Spark a bit rough around edges, but evolving fastConcurrent queries is a frontier with Spark. Use additionalSpark contexts.

Page 41: Olap with Spark and Cassandra

THANK YOU!

Page 42: Olap with Spark and Cassandra

SOME COLUMNARALTERNATIVES

Monetdb and Infobright - true columnar stores (storage +querying)Cstore-fdw for PostGres - columnar storage onlyVoltDB - in-memory distributed columnar database (but needto recompile for DDL changes)Google BigQuery - columnar cloud database, Dremel basedAmazon RedShift