Best Practices for Unleashing the Power of Data Lakes

Post on 16-Apr-2017

332 views 0 download

Transcript of Best Practices for Unleashing the Power of Data Lakes

#TalendConnect#TalendConnect

Best practices for unleashing the power of data lakesIsabelle Nuage & Christophe Toum, Big Data Products, Talend

#TalendConnect

Self-service data lake, cafeteria style

Using sensor data collected in real-time to improve gas turbines reliability, operational performance and extend lifetime value.

#TalendConnect

Why Do We Need a Data Lake?“Data lakes are enterprise-wide data management platforms for analyzing disparate sources of data in its native format.”, Gartner.

Busin

ess V

alue

Reducing cost

Generating new opportunities

• ETL offload• EDW offload/optimization• Data archiving

• Customer acquisition, retention..• Real-time engagement• Pricing optimization• Demand forecasting• Risk and fraud• Predictive maintenance• Smart products…

#TalendConnect

But Data Lakes Bring New Challenges

The rest of us

Data Lakes Bring New Challenges

High-end users

Complexity, poor governance and control, no reuse

#TalendConnect

Data Lake – Conceptual Architecture

AcquireIngest

Understand & Improve

Curate & Govern

DeliverSelf-service

SCALE

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Continuously refreshed data Continuous data delivery and data processes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Wide connectivity Batch & streaming ubiquity Scale with volume and variety

Pitfalls:o Hand codingo Fragmented tools

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Add context on data (provenance, semantics…)

Optimize data with curation, stewardship, preparation…

Use a collaborative process

Pitfalls:o Authoritative governanceo Inconsistent framework

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Pervasive DQ, masking… Consistent operationalization Single platform for all use cases

& personas

Pitfalls:o Fragmented toolso Hand codingo Shadow IT

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Make data accessible Governed self-service Scalable operationalization

Pitfalls:o Unmanaged autonomyo Self-service tools for the tech

savvy

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

GET READY FOR CHANGE

#TalendConnect

Ingestion Best Practices

Transactions

Messages & Events

1011011100

10

1011011100

10

Logs

Sensors

Data Analytics & Data Science

Real-time Data Visualization

Real-time Indicators / Scorecard

Collect - Distribute

Track

Streaming

WindowingAlert

NYC Taxi Data Streaming

#TalendConnect#TalendConnect

NYC Taxi Data Streaming

#TalendConnect

• The future features described in this presentation are under consideration by Talend and are not commitments for future products, technologies, or services.• The roadmap is subject to change and Talend does not guarantee the features

or release dates.

Disclaimer

#TalendConnect

Roadmap 2017

Addressing the needs of large enterprises

Big Data

1st on Spark 2.0&

Data Prep on Big Data

Data Prep&

Data Ingestion

Cloud Self-service

Data Stewardship &

Self-service connectors

Governance

Apache Atlas

#TalendConnect

Analyze way more data to find more opportunities for innovations and transformations

Real-time data streaming brings increased agility

To unleash data lakes, data governance is essential

Key Take Aways

#TalendConnect

Free Trial: Talend Big Data Sandbox

• A ready-to-run Docker environment

• A step-by-step expert guide

• Real-world scenarios using Spark, Kafka, MapReduce & NoSQL

www.talend.com/BigDataSandbox

Hit the Easy Button for Hadoop, Spark and Machine Learning

#TalendConnect

#TalendConnect#TalendConnect

Thank You