Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption...
Transcript of Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption...
1
Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs
Ashish Tadose
Principal Engineer
2
Agenda
• Data stores @ Walmart Labs
• Motivation for Presto as Distributed Query service
• Multi-tenant Distributed Query service
• Presto deployment & auto-scaling in GCP
• Security integrations
• Overall architecture
• Monitoring
• Best practices and tuning
Footer
3
Data stores @ Walmart LabsAccess needs are varied from team to team – one solution does not fit all….
4
Motivation for Presto..
• DataLake cluster - powered by on-prem Hadoop/HDFS
• Compute storage colocation – GOOD
• Need to ingest data from all diverse sources – CHALLENGING
• Scaling out compute with growing needs – CHALLENGING
• Need to separate storage & compute / support federated query capability – PRESTO..
• Isolated clusters in private cloud powering dedicated data-marts
Dat
a jo
urne
y
5
• Simplified query access layer
• Leverage cloud elastic compute
• Better scalability & Effective cluster utilization by auto-scaling
• Performant query response times
• Security – Authentication – LDAP– Authorization – work with existing policies
• Handle sensitive data – encryption at rest & over the wire
• Efficient Monitoring & alerting
• Resource quotas – SLA guarantees
• Flexibility to configure query configuration per tenant
Multi-tenant Query service - requirements
6
Presto & Alluxio Works well together…
Small range query response timeLower is better
Large scan query response timeLower is better
ConcurrencyHigher is better
Presto Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency
7
• Cloud DataProc init scripts or optional image -https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc
– Super easy to spawn Presto cluster – Elevated cost due to managed services such as DataProc– Overhead of additional Hadoop components – Difficult to source new catalog or deploy config changes
• Alluxio – no GCP managed deployment
• Presto-admin – can be used deployment and configuration not auto-scaling
• Need for lower level deployment strategy
Presto on GCP
8
• WalmartLabs internal auto-scaler Presto deployer
• Framework to deploy and auto-scale Presto cluster in GCP
• Leverages ansible & GCP deployment manager
• Auto-scaling via configurable cluster wide CPU & memory usage threshold
• Our recent changes – will be released soon to open community – Alluxio deployment co-located with Presto workers– Efficient configurability – suitable for multiple envs– More auto-scaling configs– Terraform integration – making it cloud agnostic
GCP presto auto-deployment
9
• Ranger plugin for Hive catalog
• Caching ranger policies
• Hive MetaStore impersonation
Presto Security integrations
10
Hive MetaStore , Alluxio integration & Views
• Automated approach to sync metadata
• Hive MetaStore event listeners
• External metastore clients
• Waggle-dance (WIP)
https://github.com/HotelsDotCom/waggle-dance
• Hive native views access
11
Presto Alluxio – overall stack
12
• Presto Event listeners
– Track latencies – Analyze failures – Faulty clients – Frequently queried tables for caching
• On prem monitoring - Prometheus & Grafana
• GCP stack driver integration
• GCP Stackdriver Presto MBeans integration issue
Presto monitoring & archiving
13
• Kafka – ability to apply timestamp filters based Kafka message timestamp– https://www.slideshare.net/shubhamtagra/debugging-data-pipelines-ola-by-karan-kumar
• Druid connector – Based on Druid JDBC interface and extension to Presto’s BaseJdbcClient
• ClickHouse connector
• ThoughtSpot connector
• BigQuery connector
• SAP HANA connector
Presto custom connectors
14
• SLA guarantees by Presto resource queues - https://prestosql.io/docs/current/admin/resource-groups.html
• Each application group has varying query patterns
– Configurable through session properties • join_reordering_strategy• optimize_top_n_row_number• query_max_execution_time
– Session Property Managers - https://prestosql.io/docs/current/admin/session-property-managers.html• Configure sessions for resource groups, source types, client tags
Supporting Multi-tenant cluster
15
Distributed query across Data stores
16
• ORC compression – ZLIB
– Point to point queries performs well for snappy – Large aggregation ZLIB is better
• Enable bloom filter on frequently used columns in filters
• Enable sorting on frequently used columns (boost query perf on the cost of higher ingestion time )
• Increase ORC stripe & stride size
– ORC files are splittable on a stripe level thus affects parallelism.– We observed 18%-22% increased in presto parallelism (after setting stripe size = 128Mb and index stride = 16k)
• Enable Table & column stats (Most important )
– Now stats can be computed via Presto - https://prestosql.io/docs/current/sql/analyze.html
ORC storage recommendations
17
THANKS!
17