Building A Self Service Analytics Platform on Hadoop

22
1 Page Building a Self Service Analytics Platform on Hadoop Avinash Ramineni

Transcript of Building A Self Service Analytics Platform on Hadoop

Page 1: Building A Self Service Analytics Platform on Hadoop

1Page

Building a Self Service Analytics Platform on Hadoop

Avinash Ramineni

Page 2: Building A Self Service Analytics Platform on Hadoop

2Page

Clairvoyant

Page 3: Building A Self Service Analytics Platform on Hadoop

3Page

Clairvoyant Services

Page 4: Building A Self Service Analytics Platform on Hadoop

4Page

Quick Poll

• Big Data Deployments in Prod

• Hadoop Distributions• People use Ecosystems rather than tools

• Architecture was implemented on Cloudera

• Cloud Experience – AWS ?

Page 5: Building A Self Service Analytics Platform on Hadoop

5Page

Challenges

• Data in Silos

• Acquires Perspectives as data is moved

• Data availability delays

• Legacy Systems handling the Volume , Veracity and Velocity

• Extracting data from legacy systems

• Lack of Self-Service Capabilities

• Knowledge becomes tribal – instead of institutional

• Security / Compliance Requirements

Page 6: Building A Self Service Analytics Platform on Hadoop

6Page

Data Lake Attributes

• Data Democratization

• Data Discovery

• Data Lineage

• Self-Service capabilities

• Metadata Management

Page 7: Building A Self Service Analytics Platform on Hadoop

7Page

Without Self-Service

Page 8: Building A Self Service Analytics Platform on Hadoop

8Page

Self-Service at all Levels

Ingest Organize Enrich Analyze Dashboards

AnalyzeIngest Organize Enrich Insights

Page 9: Building A Self Service Analytics Platform on Hadoop

9Page

Key Design Tenets

• Separation of Compute and Storage

• Independently scale compute and storage

• Data Democratization and Governance

• Bring your own Compute (BYOC)

• HA / DR

• Open Source Stack

Page 10: Building A Self Service Analytics Platform on Hadoop

10

Page

Separation of Compute and Storage

• Scale storage and compute independently

• Shifts bottleneck from Disk IO to Network

• Centralized Data Storage

• Data Democratization

• No data duplication

• Easier Hardware upgrade paths

• Flexible Architecture

• DR Simplified

Page 11: Building A Self Service Analytics Platform on Hadoop

11

Page

BYOC (Bring Your Own Cluster)

• Each department/application can bring its own Hadoop cluster

• Eliminates the need for very large clusters

• Easier to administer and maintain

• Reduces multi-tenancy issues

• Clusters can be upgraded independently

• Enables usage based cost model

Centralized / Common S3 Storage

MarketingCluster

Centralized Storage

PersonalizationCluster

MainCluster

Page 12: Building A Self Service Analytics Platform on Hadoop

12

Page

Architecture

Page 13: Building A Self Service Analytics Platform on Hadoop

13

Page

Architecture – Data Ingestion Layer

• DB Ingestor

• Stream Ingestor

• Kafka and Spark Streaming

• File Ingestor

• FTP / SFTP / Logs

• Ingestion using Service API

Page 14: Building A Self Service Analytics Platform on Hadoop

14

Page

Architecture – Data Processing Layer

• Storage layer carved into logical buckets• Landing, Raw, Derived and Delivery• Schema stored with data (no guesswork)

• Platform Jobs • Converting text to Parquet• Saving streaming data Parquet• Derivatives• Compaction• Standardization

Page 15: Building A Self Service Analytics Platform on Hadoop

15

Page

Architecture – Data Delivery Layer

• Data Delivery • SQL - Spark Thrift Server / Impala

• Tableau, SQL IDE, Applications

• Self Service • Derivatives

• Represented Via SQL on Delivery Layer• Stored in Derived Storage Layer • Metadata driven

• Derived Layer Generators• Long running Spark Job• Derivative Refresh

Page 16: Building A Self Service Analytics Platform on Hadoop

16

Page

Key Takeaways - Cloud

• Hadoop Cloud ready-ness• Cloudera Director Limitations• Multi-Availability zone, regions

• Storage• Instance Storage• EBS Volumes

• gp2 vs st1

• S3 Eventual Consistency

Page 17: Building A Self Service Analytics Platform on Hadoop

17

Page

Key Takeaways - Spark Thrift Server

• Spark Thrift Server Support• Performance Tuning• Concurrency• partition strategy• Cache Tables

• Compression Codec for Parquet• Snappy vs gzip

Page 18: Building A Self Service Analytics Platform on Hadoop

18

Page

Key Takeaways - Security

• Secure by Design, Secure by Default• Access to Data on S3

• IAM Roles

• Sentry• Support for Spark

• Kerberos • Spark Thrift Server

• Navigator• Support for Spark

Page 19: Building A Self Service Analytics Platform on Hadoop

19

Page

Key Takeaways - General

• Rapidly Changing Technology• Feature addition• Documentation• Bugs• Jar hell

• Small files • Performance Issues• Compaction

Page 20: Building A Self Service Analytics Platform on Hadoop

20

Page

Key Takeaways - General

• Partition Strategy• Parquet Files

• Balancing parallelism and throughput• Table Partitions

• Cluster sizing, optimization and tuning

• Integrating with Corporate infrastructure• Deployment practices• Monitoring and Alerting• Information Security Policies

Page 21: Building A Self Service Analytics Platform on Hadoop

21

Page

Data Security

Page 22: Building A Self Service Analytics Platform on Hadoop

22

Page

Questions

• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni