Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1....

40
Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io Building Data Pipelines with Open Source Components and Services Heikki Nousiainen Open Source Summit Japan 2018

Transcript of Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1....

Page 1: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Building Data Pipelines with Open Source Components and Services

Heikki Nousiainen

Open Source Summit Japan 2018

Page 2: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Agenda

1. Introduction

2. Motivation

3. Challenges

4. Open Source

5. Data pipelines - Old and New

6. Build or Buy

7. Conclusions

8. Q & A

This presentation was created by Aiven Ltd - https://aiven.io.Product and vendor logos used for identification purposes only.

Page 3: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Introduction

Page 4: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Speaker

● Heikki Nousiainen

● CTO, co-founder @ Aiven, a cloud DBaaS company

● Previously: software architect - cloud

transformation, distributed systems

● Open source user and fan since 1995

@hnousiainen

Page 5: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Aiven

● Independent Database as a Service provider

● Based in Helsinki and Boston

● 8 database systems available in 70+ regions

around the world

https://aiven.io

@aiven_io

Page 6: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

Page 7: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

You’ve all heard it,

Data is your most valuable asset

“Data is the new oil”

Data is disrupting every industry

...but let’s take a look at some actual uses

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 8: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

Car-as-a-Sensor

Traffic and road condition detection &

routing

Vacant parking space locator

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 9: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

Welding Management

- Procedures

- Qualification verification

- 100% traceability

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 10: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

Home / commercial automation

Smart Locks & Entry Controls

Environment sensing and management, lighting

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 11: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Motivation

Predicting performance and proactively

preventing downtime.

Pay-per-use models with SLA.

Fuel consumption management.

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 12: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Challenges

Page 13: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Challenges

Area

● Liveliness

● Volume & Velocity

● Data/system lifespan

● Changing business requirements

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Requirements

● Low latency / Real-time eventing

○ Interactive usage

○ Environmental awareness

○ Routing decisions

● Batch

○ Analytics

○ Reporting

○ Research

Page 14: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Challenges

Area

● Liveliness

● Volume & Velocity

● Data/system lifespan

● Changing business requirements

Requirements

● Billions of messages and terabytes

of data 24/7

● 2013, 787 Dreamliner, 1TB data

per flight. 150 units / year.

● 2018, Audi Concept, 4TB data per

day per car. 2M units / year.

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 15: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Challenges

Area

● Liveliness

● Volume & Velocity

● Data/system lifespan

● Changing business requirements

Requirements

● Production systems have long

lifespans

○ Car ~15-20 years from design

to disposal

○ Sea vessel 25-30 years

● Collected and consumed data differ

● Software / hardware upgrades

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 16: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Challenges

Area

● Liveliness

● Volume & Velocity

● Data/system lifespan

● Changing business requirements

Requirements

● New services derived from data

● New sources/sinks for data

● Discontinued systems

● New experiments

● Legal landscape changes

● New/disbanded teams

● Acquisitions / integrations

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 17: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Open Source

Page 18: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Open Source

● Open Source product development pace trumps that of any private undertaking

○ A lot of data management innovation happens in Open Source

○ Open Source is quick to absorb innovation from any source

○ The pragmatic evolutionary development cycles are efficient in improving quality

● Using Open Source guarantees continued access to business critical data

○ Avoid lock-in to a single vendor

○ Even with the 10 - 20 year lifespans

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 19: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Data Pipelines - Old and New

Page 20: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Common components of a data pipeline

Typical parts of a data pipeline

● Data ingestion

● Filtering & Enrichment

● Routing

● Processing

● Querying / Visualization / Reporting

● Data warehousing

● Reprocessing capabilities

Typical requirements

● Scalability

○ Billions of messages and

terabytes of data 24/7

● Availability and redundancy

○ Across physical locations

● Latency

○ Real-time / batch

● Adaptability / Platform support

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 21: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

“Traditional” data flow model

OLTP DB

ReportDB

Metrics DB

MicroservicesBilling systemsPublic REST API

Web clients AnalyticsReporting apps

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 22: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

“Traditional” data flow model

Caches Docstore

External clouds

OLTP DB

Data WH

Metrics DB

MicroservicesBilling systemsPublic REST API

Web clients AnalyticsReporting apps

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 23: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Message-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

Message Bus / Router

MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 24: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Message-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

?MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 25: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Message-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 26: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Apache Kafka

Apache Kafka is an open source stream processing platform.

"The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds."

Originally developed by LinkedIn, open sourced in 2011, now a top-level Apache project. Nowadays used by e.g. New York Times, Pinterest, Zalando, Airbnb, Shopify, Spotify and many others for event streaming. [See https://kafka.apache.org/powered-by for more.]

Kafka excels as a centerpiece for event delivery, where a range of applications can produce and consume real-time event streams.

https://kafka.apache.org/

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 27: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Kafka concepts

1 2 3 4 5 6 7 8 9 10 11 12metrics_v1 #1

40 41 42 43 44 45 46 47 48 49 50metrics_v1 #0

logs_v2 #0 0 1 2 3 4 5 6 7 8 9

Devices, VMs..

Monitoring and alert systems Log aggregator Data

warehouse

Consumers and consumer groups

Producers

Logs of topics and partitions

A key abstraction in Kafka is its commit log, where each consumer maintains maintains its own position in the log. This allows clean decoupling of the producing and consuming processes.

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 28: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Kafka-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 29: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Kafka Connect

Framework for importing data from other systems and services - Sources - to Kafka and exporting to other services and systems - Sinks.

The framework makes it easy to create and share connectors in Open Source.

A host of connectors are available:

BigQuery, Cassandra, DynamoDB, Elasticsearch, Github, IOTHub / Azure, JDBC, JMS, Kinesis, PubSub / Google, MQTT, MySQL CDC, PostgreSQL CDC, RabbitMQ, Redshift, Redis, SalesForce, SAP, Solr, Splunk, SQS, Syslog, Twitter, Vertica

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 30: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Databases in the Pipeline

● Specialized Open Source database technologies available for different use cases

● Consider the same requirements as for the streaming platform:

- Access patterns: transactional, relational

- Scalability

- Reliability

- Adaptability / Platform support / SDKs & Libraries

- Available competencies

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 31: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Databases in the Pipeline

External clouds

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 32: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Kafka Streams

● Kafka Streams is an SDK / library for building application that process data in

real-time

● DSL for defining streams & processing steps

● Supports abstraction for stream of events, but also tables and state.

● Stateless and stateful transformations

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

KStream<String, String> textLines = builder.stream("InputLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+"))) .groupBy((key, word) -> word) .count("Counts"); wordCounts.to(Serdes.String(), Serdes.Long(), "WordsWithCountsTopic");

Page 33: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

KSQL: SQL engine for Kafka

● KSQL allows performing continuous queries and transformation using SQL syntax

● Standalone service using Kafka APIs typically running as its own cluster next to Kafka

0 1 2 3 4 5 6 7 8 9 10log_stream

KSQL

CREATE STREAM log_stream_origin(status bigint, path varchar) WITH (kafka_topic='log_stream', value_format='DELIMITED');

SELECT status, path FROM log_stream_origin WHERE status >= 400;

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 34: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

The Open Source Pipeline

External clouds

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

MQTT

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 35: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Build or Buy

Page 36: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Data management is hard to do well

● Management of stateful systems requires specialized personnel and 24/7 response capability.

● Failures are difficult to predict and can have extremely high impact on business.

● Managed clouds services allow users to stay focused on their core business without worrying about software infrastructure.

● Open Source solutions allow one to move between in-house and managed offering.

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 37: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Managed Open Source Solutions

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 38: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Conclusions

Page 39: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Summary

● A lot of hype around data, but it’s still real deal and to be taken seriously

● The challenges with data management are both technical and temporal

● Open Source is the best bet to meet the data management challenges

● Kafka as the central component of a data pipeline helps clean up messy architectures

● A host of good Open Source database solutions can help to meet the data storage

and access needs

● You can leverage a host of managed service providers or build your own capability

● With Open Source, you have the option to revisit that choice at any time

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Page 40: Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1. Introduction 2. Motivation 3. Challenges 4. Open Source 5. Data pipelines - Old and New

Thanks!

https://aiven.io

@aiven_io

@hnousiainen

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io