Building A Self Service Analytics Platform on Hadoop
-
Upload
craig-warman -
Category
Technology
-
view
161 -
download
1
Transcript of Building A Self Service Analytics Platform on Hadoop
1Page
Building a Self Service Analytics Platform on Hadoop
Avinash Ramineni
2Page
Clairvoyant
3Page
Clairvoyant Services
4Page
Quick Poll
• Big Data Deployments in Prod
• Hadoop Distributions• People use Ecosystems rather than tools
• Architecture was implemented on Cloudera
• Cloud Experience – AWS ?
5Page
Challenges
• Data in Silos
• Acquires Perspectives as data is moved
• Data availability delays
• Legacy Systems handling the Volume , Veracity and Velocity
• Extracting data from legacy systems
• Lack of Self-Service Capabilities
• Knowledge becomes tribal – instead of institutional
• Security / Compliance Requirements
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Without Self-Service
8Page
Self-Service at all Levels
Ingest Organize Enrich Analyze Dashboards
AnalyzeIngest Organize Enrich Insights
9Page
Key Design Tenets
• Separation of Compute and Storage
• Independently scale compute and storage
• Data Democratization and Governance
• Bring your own Compute (BYOC)
• HA / DR
• Open Source Stack
10
Page
Separation of Compute and Storage
• Scale storage and compute independently
• Shifts bottleneck from Disk IO to Network
• Centralized Data Storage
• Data Democratization
• No data duplication
• Easier Hardware upgrade paths
• Flexible Architecture
• DR Simplified
11
Page
BYOC (Bring Your Own Cluster)
• Each department/application can bring its own Hadoop cluster
• Eliminates the need for very large clusters
• Easier to administer and maintain
• Reduces multi-tenancy issues
• Clusters can be upgraded independently
• Enables usage based cost model
Centralized / Common S3 Storage
MarketingCluster
Centralized Storage
PersonalizationCluster
MainCluster
12
Page
Architecture
13
Page
Architecture – Data Ingestion Layer
• DB Ingestor
• Stream Ingestor
• Kafka and Spark Streaming
• File Ingestor
• FTP / SFTP / Logs
• Ingestion using Service API
14
Page
Architecture – Data Processing Layer
• Storage layer carved into logical buckets• Landing, Raw, Derived and Delivery• Schema stored with data (no guesswork)
• Platform Jobs • Converting text to Parquet• Saving streaming data Parquet• Derivatives• Compaction• Standardization
15
Page
Architecture – Data Delivery Layer
• Data Delivery • SQL - Spark Thrift Server / Impala
• Tableau, SQL IDE, Applications
• Self Service • Derivatives
• Represented Via SQL on Delivery Layer• Stored in Derived Storage Layer • Metadata driven
• Derived Layer Generators• Long running Spark Job• Derivative Refresh
16
Page
Key Takeaways - Cloud
• Hadoop Cloud ready-ness• Cloudera Director Limitations• Multi-Availability zone, regions
• Storage• Instance Storage• EBS Volumes
• gp2 vs st1
• S3 Eventual Consistency
17
Page
Key Takeaways - Spark Thrift Server
• Spark Thrift Server Support• Performance Tuning• Concurrency• partition strategy• Cache Tables
• Compression Codec for Parquet• Snappy vs gzip
18
Page
Key Takeaways - Security
• Secure by Design, Secure by Default• Access to Data on S3
• IAM Roles
• Sentry• Support for Spark
• Kerberos • Spark Thrift Server
• Navigator• Support for Spark
19
Page
Key Takeaways - General
• Rapidly Changing Technology• Feature addition• Documentation• Bugs• Jar hell
• Small files • Performance Issues• Compaction
20
Page
Key Takeaways - General
• Partition Strategy• Parquet Files
• Balancing parallelism and throughput• Table Partitions
• Cluster sizing, optimization and tuning
• Integrating with Corporate infrastructure• Deployment practices• Monitoring and Alerting• Information Security Policies
21
Page
Data Security
22
Page
Questions
• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni