Implementing a Data Lake with Enterprise Grade Data Governance
-
Upload
hortonworks -
Category
Software
-
view
905 -
download
2
Transcript of Implementing a Data Lake with Enterprise Grade Data Governance
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Implementing a Data Lake with Enterprise Grade Data Governance
We do Hadoop.
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Your speakers
Andrew Ahn Governance Product Manager, Hortonworks
Oliver Claude CMO at Waterline
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP: Data Governance We Do Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Data Governance Goals
GOAL: Provide a common approach to data governance across all systems and data within the organization
• Transparent Governance standards & protocols must be clearly defined and available to all
• Reproducible Recreate the relevant data landscape at a point in time
• Auditable All relevant events and assets but be traceable with appropriate historical lineage
• Consistent Compliance practices must be consistent
ETL/DQ
BPM
Business Analytics
Visualization & Dashboards
ERP
CRM SCM
MDM
ARCHIVE
Governance Framework
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Challenges WITHIN Hadoop
• No comprehensive governance within the Hadoop stack
• Mostly disjoint as each project defines its own future and there is no common framework
• Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution
• No integration with external governance frameworks
• Difficult to get right because each project is autonomous and you need insight into traditional IT
Apa
che
Pig
Apa
che
Hiv
e
Apa
che
HB
ase
Apa
che
Acc
umul
o
Apa
che
Sol
r
Apa
che
Spa
rk
Apa
che
Sto
rm
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Governance Initiative for Hadoop
ETL/DQ
BPM
Business Analytics
Visualization & Dashboards
ERP
CRM SCM
MDM
ARCHIVE
Data Governance Initiative
Common Governance Framework
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
°
°
Apa
che
Pig
A
pach
e H
ive
Apa
che
HB
ase
Apa
che
Acc
umul
o A
pach
e S
olr
Apa
che
Spa
rk
Apa
che
Sto
rm
TWO Requirements
1. Hadoop must snap in to the existing frameworks and be a good citizen
2. Hadoop must also provide governance within its own stack of technologies
A group of companies dedicated to meeting these requirements in the open
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Common Data Governance Use Cases
Financial Reporting Chain of custody, Lineage Narratives
Telco Device log management, Correlation, Analysis, and Mitigation
Retail Point of sale analysis, Price optimization
Healthcare 30 day measures reporting
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Overview We Do Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
New Project Proposal: Apache Atlas
Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open.
Key Capabilities • Data Classification • Metadata Exchange • Centralized Auditing • Search & Lineage (Browse) • Security & Policy Engine
Apache Atlas
Knowledge Store
Audit Store
Models Type-System
Policy Rules Taxonomies
Tag Based Policies
Data Lifecycle Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Energy
PPDM
Retail
PCI PII
Other
CWM
Essen%al Timeline
Phase-‐3
• Collaboration Features • Self Service • Steward Delegation • Profiling & Pattern Analysis • Visualization
Phase-‐2
• Advance audit reporting • Advanced Policy Engine • Row / Column Masking • 3rd party Metadata exchange
1H 2015 GA
• Rest API • Centralized Taxonomy • Import / export metadata • Basic Policy Rules Engine • Real-time access control • Column Level Tagging
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview Data Classification • Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing • Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse) • Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information
Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification)
Apache Atlas
Knowledge Store
Audit Store
Models Type-System
Policy Rules Taxonomies
Tag Based Policies
Data Lifecycle Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Energy
PPDM
Retail
PCI PII
Other
CWM
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Apache Atlas Overview
Knowledge Store Knowledge store categorized with appropriate business-oriented taxonomy
• Data sets & objects • Tables / Columns
• Logical context • Source, destination
Support exchange of metadata between foundation components and third-party applications/governance tools
Leverages existing Hadoop metastores
Audit Store
Policy Engine
Data Lifecycle Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Knowledge Store
Models Type-System
Policy Rules Taxonomies
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Data Lifecycle Management Leverage existing investment in Apache Falcon with a focus on:
• Provenance
• Multi-cluster replication
• Data set retention/eviction
• Late data handling
• Automation
Audit Store
Models Type-System
Policy Rules Taxonomies Policy Engine
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Data Lifecycle Management
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Audit Store Historical repository for all governance events
• Security: Access Grant & Deny
• Operational: Data Provenance & Metrics
• Indexed and Searchable
Models Type-System
Policy Rules Taxonomies Policy Engine
Data Lifecycle Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Audit Store
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Security Integration with HDP Advanced Security investments to ensure compliance.
Establish global security policies based on data classification.
Leverages Ranger plug-in architecture for policy enforcement
Audit Store
Models Type-System
Policy Rules Taxonomies Policy Engine
Data Lifecycle Management
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Security
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Policy Engine Runtime rationalization of policies rules with respect to data asset combinations and time. Fully extensible.
• Metadata based
• Geo based rules
• Time-based rules
• Hive Column Prohibitions
• Preview: Hive Row and Column Masking
Audit Store
Models Type-System
Taxonomies
Data Lifecycle Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Policy Rules Policy Engine
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
RESTful interface • Extensible enterprise classification of data assets,
relationships and policies organized in a meaningful way -- aligned to business organization.
• Supports exploration via user interface
• Supports extensibility via API and CLI exposure
Audit Store
Models Type-System
Policy Rules Taxonomies Policy Engine
Data Lifecycle Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Coming 2h 2015
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas
Knowledge Store
Apache Atlas Overview
Enhanced Audit Store Historical repository for all governance events
• Immutable file format • Events Metadata Taggable • Advanced Reporting • Security: Access Grant & Deny
• Operational: Data Provenance & Metrics
• Indexed and Searchable
Models Type-System
Policy Rules Taxonomies Policy Engine
Data Lifecycle Management
Security
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Custom
CWM
Retail
PCI PII
Other
Audit Store
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview Data Classification • Import or define taxonomy business-oriented annotations for data
• Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes
• Export metadata to third-party systems
Centralized Auditing • Capture security access information for every application, process, and interaction with data
• Capture the operational information for execution, steps, and activities
Search & Lineage (Browse) • Pre-defined navigation paths to explore the data classification and audit information
• Text-based search features locates relevant data and audit event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information
Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes
• Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification)
Apache Atlas
Knowledge Store
Audit Store
Models Type-System
Policy Rules Taxonomies
Tag Based Policies
Data Lifecycle Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Energy
PPDM
Retail
PCI PII
Other
CWM
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance Ready Certification Program
Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability
Discovery Tagging
Prep / Cleanse
ETL
Governance BPM
Self Service
Visual-ization
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data improves speed to value and compliance
Data Warehouse Offload
Data Science/Analytics Sandbox
Data Lake
VALUE CREATION
COST SAVINGS
Deliver a Business-Ready
Data Lake
Accelerate Data Prep Process
Govern Data in Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
The Modern Data Architecture
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Atlas Capabilities: Overview
Apache Atlas
Knowledge Store
Audit Store
Models Type-System
Policy Rules Taxonomies
Tag Based Policies
Data Lifecycle Management
Real Time Tag Based Access Control
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA HL7
Financial
SOX Dodd-Frank
Energy
PPDM
Retail
PCI PII
Other
CWM
Rest API
Business Glossary
Automated Classification (Tagging)
Automated Lineage Discovery
Profiling and Data Quality
Schema Discovery
Change Detection and Audit • Glossary • Tags • Lineage • Models
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Visual-ization
Governance Ready Certification Program
Discovery Tagging
Prep / Cleanse
ETL
Governance BPM
Self Service
Visual-ization
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Waterline Data is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Inventory
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find and Understand
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Provision
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Governance
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Big Data IT Architect
Deliver a Business-Ready Data Lake
Data Engineer/Data Scientist
Accelerate Data Prep Process
CDO/Data Steward
Govern Data in Hadoop
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process “80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.” Data Engineering Director, Financial Services
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Accelerate data prep process "Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data, which in turn can help analytic teams provision the right data for their analyses.” Tony Baer, Principal Analyst, Ovum
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop “Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” “Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Govern data in Hadoop “The first step to governing Big Data is to build an inventory.” Sunil Soares, Managing Partner, Information Asset
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practice approach to implement an enterprise grade data lake
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
1. Create and populate landing area
1 1
• Create Landing directory structure • Set up ETL processes using
Falcon to orchestrate • Implement ETL jobs using ETL
tools (Syncsort, Talend, Informatica, etc), Hadoop tools (Sqoop, Flume, etc) or FTP
Falcon
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
2. Build inventory of data
1. Create and populate landing area
2
• Crawl the cluster • Profile files • Automatically discover technical,
business, and compliance metadata at a field level
• Create Hive tables as needed • Import lineage • Export to Atlas
2
2 Falcon
HCatalog
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
3
3
• Import business glossary terms and export new tags and updated definitions
• Synchronize Atlas and Waterline Data Inventory
• Export metadata and lineage from Hadoop to Enterprise repository
Falcon
HCatalog
Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
4
• Use Waterline Data Inventory to find sensitive data
• Create access privileges in Ranger • Encrypt or de-identify
HCatalog
Ranger
Falcon Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
5
5
5
• Create account with Kerberos, LDAP, etc.
• Set up ACLs (leverage Ranger) • Users can browse securely through
Waterline Data Inventory
5
HCatalog
Ranger
Falcon Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Best practices in deployment landscape
6. Monitor and maintain
5. Open up to users
4. Protect sensitive data
3. Integrate with enterprise metadata repository
2. Build inventory of data
1. Create and populate landing area
• Continue profiling new or changed files and sync with Atlas
• Continue monitoring for sensitive data, use Ranger to protect
• Build a folksonomy and synchronize with business glossary in Atlas and Enterprise Business Glossary
HCatalog
Ranger
Falcon Atlas
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Find, understand and govern data in Hadoop
Discover lineage and business metadata automatically, and manage metadata
CDO/Data Steward
Automate cataloging of data assets at scale, with secure provisioning to business users
Big Data Architect
Find and understand best-suited and most trusted data without having to explore every file manually
Data Engineer/Data Scientist/Business Analyst
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions and Answers
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Next Steps…
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Waterline Data & Hortonworks http://hortonworks.com/partner/waterline-data Joint tutorial: bit.ly/DataLakeTutorial Modern Data Architecture Paper: go.waterlinedata.com/hw-mda
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
SAN JOSE June 9-11
BRUSSELS April 15-16
• Deep-dive technical content • 65+ sessions and 5 tracks • 1,000 attendees • Sponsorships Available • Including Pre and Post event community meetups
and BOFs • Hadoop training available
• 100+ sessions and 7 tracks • Deep-dive technical content • 5,000 attendees • Sponsorships Available • Including Pre and Post event community meetups
and BOFs • Hadoop training available
www.hadoopsummit.org
The Largest Hadoop Community Events in Europe and North America