ImpalaToGo use case

ImpalaToGoUse case

By ImpalaToGo team

http://impala2go.info

http://impala2go.info/


ImpalaToGo required if ...

You have more than hundred gigabytes of data in the cloud.You want to slice and dice this dataset and look for anomalies.You can not predict queries in advance. You just need brute force to query raw data.

Elastic solution required

It is hardly profitable to do big data analytics using a non elastic setup.Slicing and dicing 1TB of data interactively requires dozens dedicated servers.

The gain from elasticity

50 Servers cluster, with scan rate capability of about 40 GB/sec will cost : $12000 a month (m3.2xlarge reserved instances)or$28 per hour.By running cluster for 1-2 hour a day when required you save about $10K a month.

What is elastic database?Easy to spawn and resize cluster, in matter of minutes.Efficient work with cloud data storage. We do not want ETL per session.

Cloud storage dilemmaOn one hand, object storage like s3 is perfect to access data - no issues with size or accessibility from other machines.On the other hand - object store access is slow.

ImpalaToGo introductionImpalaToGo is MPP (Massive parallel processing database) built on top of Cloudera Impala.ImpalaToGo removes the need for local HDFS, replacing it with S3 (or another remote DFS), using local drives for caching.

Architecture

CSV, Parquet, Avro files in S3 bucket

Impala To Go cluster

Caching layer on local SSD drives

ImpalaToGo

cluster

s3 + open format = no ETL

You produce a file in one of the supported file types and put it into S3 bucket. CSV is easiest to create. Formats are open, and usable by other frameworks such as Spark

CSV, Parquet, Avro files in S3 bucket

Local drive = best cache

ImpalaToGo is using local SSD drives for the cache.Local SSD used to keep hot data setSpace is not wasted for replication - it is just cache.SSD is fast enough to keep CPU busy

Caching layer on local SSD drives

No storage = elasticity

Since the ImpalaToGo cluster only caches data from S3, there is no possibility of data loss. Further, It is easy to resize.Adding a node takes 1-2 minutes. Most of this time is waiting for the instance to run.Removing node - instant.

Impala To Go cluster

Why do we need resize?It is almost impossible to predict how much time ad-hoc query will take. Different queries on the same data can easy range from 10-100x computation and memory requirements.

CompetitionMain competitors are- Commercial MPP databases like Vertica,

Paracel, etc- Redshift- Hadoop in form of CDH, EMR- SparkSQL, Presto, Hive- Snowflake

Commercial MPPThey store data in proprietary format - so there is ETL process.They have their own storage layer - so they are not elastic.They may be more efficient than Impala engine on some queries.

Amazon RedshiftIt is efficient columnar database deployed and managed by amazon. In many cases - faster then ImpalaToGo.Main drawbacks comparing with ImpalaTogo- Locked In to amazon- Requires hours to days to resize- No UDF support

Hadoop CDH & EMRToday, you can deploy a hadoop cluster and manually cache data from S3, or wait each time for S3 access.Once Impala has the ability to efficiently work with s3 this will become viable option, but- requires hadoop skills.- less elastic, because of HDFS

SparkSQL, Hive, Presto etcSparkSQL, Hive, Presto, Drill are JVM based, so they can not match native engines like Impala, Vertica, etc on raw speed.- Slower than ImpalaToGo- Hard to utilize big heaps.

SnowflakeSnowflake is very similar to ImpalaToGo in terms of architecture. Both store columnar data in S3, both run elastic clusters.- Snowflake is proprietary software- Data stored in proprietary format.

Have more questions?Please write to David [email protected] to try - visit http://impala2go.info

mailto:[email protected]


ImpalaToGo use case

Software

Transcript of ImpalaToGo use case