ImpalaToGo use case
-
Upload
david-groozman -
Category
Software
-
view
234 -
download
0
Transcript of ImpalaToGo use case
![Page 1: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/1.jpg)
ImpalaToGoUse case
By ImpalaToGo team
http://impala2go.info
![Page 2: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/2.jpg)
ImpalaToGo required if ...
You have more than hundred gigabytes of data in the cloud.You want to slice and dice this dataset and look for anomalies.You can not predict queries in advance. You just need brute force to query raw data.
![Page 3: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/3.jpg)
Elastic solution required
It is hardly profitable to do big data analytics using a non elastic setup.Slicing and dicing 1TB of data interactively requires dozens dedicated servers.
![Page 4: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/4.jpg)
The gain from elasticity
50 Servers cluster, with scan rate capability of about 40 GB/sec will cost : $12000 a month (m3.2xlarge reserved instances)or$28 per hour.By running cluster for 1-2 hour a day when required you save about $10K a month.
![Page 5: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/5.jpg)
What is elastic database?Easy to spawn and resize cluster, in matter of minutes.Efficient work with cloud data storage. We do not want ETL per session.
![Page 6: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/6.jpg)
Cloud storage dilemmaOn one hand, object storage like s3 is perfect to access data - no issues with size or accessibility from other machines.On the other hand - object store access is slow.
![Page 7: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/7.jpg)
ImpalaToGo introductionImpalaToGo is MPP (Massive parallel processing database) built on top of Cloudera Impala.ImpalaToGo removes the need for local HDFS, replacing it with S3 (or another remote DFS), using local drives for caching.
![Page 8: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/8.jpg)
Architecture
CSV, Parquet, Avro files in S3 bucket
Impala To Go cluster
Caching layer on local SSD drives
ImpalaToGo
cluster
![Page 9: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/9.jpg)
s3 + open format = no ETL
You produce a file in one of the supported file types and put it into S3 bucket. CSV is easiest to create. Formats are open, and usable by other frameworks such as Spark
CSV, Parquet, Avro files in S3 bucket
![Page 10: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/10.jpg)
Local drive = best cache
ImpalaToGo is using local SSD drives for the cache.Local SSD used to keep hot data setSpace is not wasted for replication - it is just cache.SSD is fast enough to keep CPU busy
Caching layer on local SSD drives
![Page 11: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/11.jpg)
No storage = elasticity
Since the ImpalaToGo cluster only caches data from S3, there is no possibility of data loss. Further, It is easy to resize.Adding a node takes 1-2 minutes. Most of this time is waiting for the instance to run.Removing node - instant.
Impala To Go cluster
![Page 12: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/12.jpg)
Why do we need resize?It is almost impossible to predict how much time ad-hoc query will take. Different queries on the same data can easy range from 10-100x computation and memory requirements.
![Page 13: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/13.jpg)
CompetitionMain competitors are- Commercial MPP databases like Vertica,
Paracel, etc- Redshift- Hadoop in form of CDH, EMR- SparkSQL, Presto, Hive- Snowflake
![Page 14: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/14.jpg)
Commercial MPPThey store data in proprietary format - so there is ETL process.They have their own storage layer - so they are not elastic.They may be more efficient than Impala engine on some queries.
![Page 15: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/15.jpg)
Amazon RedshiftIt is efficient columnar database deployed and managed by amazon. In many cases - faster then ImpalaToGo.Main drawbacks comparing with ImpalaTogo- Locked In to amazon- Requires hours to days to resize- No UDF support
![Page 16: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/16.jpg)
Hadoop CDH & EMRToday, you can deploy a hadoop cluster and manually cache data from S3, or wait each time for S3 access.Once Impala has the ability to efficiently work with s3 this will become viable option, but- requires hadoop skills.- less elastic, because of HDFS
![Page 17: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/17.jpg)
SparkSQL, Hive, Presto etcSparkSQL, Hive, Presto, Drill are JVM based, so they can not match native engines like Impala, Vertica, etc on raw speed.- Slower than ImpalaToGo- Hard to utilize big heaps.
![Page 18: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/18.jpg)
SnowflakeSnowflake is very similar to ImpalaToGo in terms of architecture. Both store columnar data in S3, both run elastic clusters.- Snowflake is proprietary software- Data stored in proprietary format.
![Page 19: ImpalaToGo use case](https://reader036.fdocuments.us/reader036/viewer/2022071820/55b57989bb61ebc4788b463a/html5/thumbnails/19.jpg)
Have more questions?Please write to David [email protected] to try - visit http://impala2go.info