Practical Guide to Architecting Data
LakesPresented By Avinash Ramineni
Agenda• About Clairvoyant• What is Data Lake ?• Features of Data Lake • Tools • Implementation Challenges• Questions
3Page
Clairvoyant
4Page
Clairvoyant Services
5Page
What is a Data Lake“ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats”
“A data lake is a central location in which to store all your data, regardless of its source or format.”
“Is Data lake a replacement or complimentary to EDW ? ”
“Is Data lake just a storage layer ? ”
“ Just having a Hadoop environment is a data lake ? ”
6Page
Data Lake Attributes• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Data Lake
8Page
Self Service Analytics
9Page
Data Governance• Data Acquisition - what, when, where of data• Data Organization – Structure, format• Data Catalog – what data exists in the lake• Capturing Metadata
• Data Lineage• Data Quality• Data Profile• Provenance of data at file and record levels• Business names, descriptions
• Data Provisioning
10Page
11Page
Data Lineage
12Page
Data Lake Challenges
13Page
Guidelines• Expect structured , semi-structure, unstructured data
• store a metadata or tag for location of schema, unstructured
• Store a copy of raw input
• Raw first mile copy of the data so that we can recover our business or almost
• Replay the business if we need to
• Data Standardization – data clensing as a workflow after ingest
• Use a format that supports your data
• Automate metadata management
14Page
Data Lake Security
15Page
Data Security
16Page
Implementation Challenges• Change Data Capture
• Mysql – binlog readers• Oracle - tungsten
• Updating the deltas on to the data lake• Reusable Data movement workflows
• One workflow for table ? (Generate Dynamic workflows based on metadata)• Needs to be driven of metadata
• Schema changes on the Source end• Streaming Data • Partitioning Strategies on the Data Lake
• Configure them into metadata
17Page
Tools / Products• Smart Catalogs
• Waterline Data Inventory• Collibra Catalog
• Data Lake Management• Zaloni Bedrock• Informatica Intelligent Data Lake
• Data Governance and Metadata Management• Cloudera Navigator• Apache Atlas• Collibra Data Governance• Oracle BigData Catalog
18Page
Data Lake Trends• Data Lakes on Cloud• IOT Data Lakes• Logical Data Lakes
• Unified View of data that exists across data stores
• Data Discovery Portals
19Page
Questions
• Principal @ Clairvoyant • Email: [email protected]• LinkedIn: https://www.linkedin.com/in/avinashramineni