ImpalaToGo use case

19
ImpalaToGo Use case By ImpalaToGo team http://impala2go.i nfo

Transcript of ImpalaToGo use case

Page 1: ImpalaToGo use case

ImpalaToGoUse case

By ImpalaToGo team

http://impala2go.info

Page 2: ImpalaToGo use case

ImpalaToGo required if ...

You have more than hundred gigabytes of data in the cloud.You want to slice and dice this dataset and look for anomalies.You can not predict queries in advance. You just need brute force to query raw data.

Page 3: ImpalaToGo use case

Elastic solution required

It is hardly profitable to do big data analytics using a non elastic setup.Slicing and dicing 1TB of data interactively requires dozens dedicated servers.

Page 4: ImpalaToGo use case

The gain from elasticity

50 Servers cluster, with scan rate capability of about 40 GB/sec will cost : $12000 a month (m3.2xlarge reserved instances)or$28 per hour.By running cluster for 1-2 hour a day when required you save about $10K a month.

Page 5: ImpalaToGo use case

What is elastic database?Easy to spawn and resize cluster, in matter of minutes.Efficient work with cloud data storage. We do not want ETL per session.

Page 6: ImpalaToGo use case

Cloud storage dilemmaOn one hand, object storage like s3 is perfect to access data - no issues with size or accessibility from other machines.On the other hand - object store access is slow.

Page 7: ImpalaToGo use case

ImpalaToGo introductionImpalaToGo is MPP (Massive parallel processing database) built on top of Cloudera Impala.ImpalaToGo removes the need for local HDFS, replacing it with S3 (or another remote DFS), using local drives for caching.

Page 8: ImpalaToGo use case

Architecture

CSV, Parquet, Avro files in S3 bucket

Impala To Go cluster

Caching layer on local SSD drives

ImpalaToGo

cluster

Page 9: ImpalaToGo use case

s3 + open format = no ETL

You produce a file in one of the supported file types and put it into S3 bucket. CSV is easiest to create. Formats are open, and usable by other frameworks such as Spark

CSV, Parquet, Avro files in S3 bucket

Page 10: ImpalaToGo use case

Local drive = best cache

ImpalaToGo is using local SSD drives for the cache.Local SSD used to keep hot data setSpace is not wasted for replication - it is just cache.SSD is fast enough to keep CPU busy

Caching layer on local SSD drives

Page 11: ImpalaToGo use case

No storage = elasticity

Since the ImpalaToGo cluster only caches data from S3, there is no possibility of data loss. Further, It is easy to resize.Adding a node takes 1-2 minutes. Most of this time is waiting for the instance to run.Removing node - instant.

Impala To Go cluster

Page 12: ImpalaToGo use case

Why do we need resize?It is almost impossible to predict how much time ad-hoc query will take. Different queries on the same data can easy range from 10-100x computation and memory requirements.

Page 13: ImpalaToGo use case

CompetitionMain competitors are- Commercial MPP databases like Vertica,

Paracel, etc- Redshift- Hadoop in form of CDH, EMR- SparkSQL, Presto, Hive- Snowflake

Page 14: ImpalaToGo use case

Commercial MPPThey store data in proprietary format - so there is ETL process.They have their own storage layer - so they are not elastic.They may be more efficient than Impala engine on some queries.

Page 15: ImpalaToGo use case

Amazon RedshiftIt is efficient columnar database deployed and managed by amazon. In many cases - faster then ImpalaToGo.Main drawbacks comparing with ImpalaTogo- Locked In to amazon- Requires hours to days to resize- No UDF support

Page 16: ImpalaToGo use case

Hadoop CDH & EMRToday, you can deploy a hadoop cluster and manually cache data from S3, or wait each time for S3 access.Once Impala has the ability to efficiently work with s3 this will become viable option, but- requires hadoop skills.- less elastic, because of HDFS

Page 17: ImpalaToGo use case

SparkSQL, Hive, Presto etcSparkSQL, Hive, Presto, Drill are JVM based, so they can not match native engines like Impala, Vertica, etc on raw speed.- Slower than ImpalaToGo- Hard to utilize big heaps.

Page 18: ImpalaToGo use case

SnowflakeSnowflake is very similar to ImpalaToGo in terms of architecture. Both store columnar data in S3, both run elastic clusters.- Snowflake is proprietary software- Data stored in proprietary format.

Page 19: ImpalaToGo use case

Have more questions?Please write to David [email protected] to try - visit http://impala2go.info