Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1....

Building Data Pipelines with Open Source Components | Open Source Summit Japan 2018 | https.//aiven.io

Building Data Pipelines with Open Source Components and Services

Heikki Nousiainen

Open Source Summit Japan 2018

Agenda

1. Introduction

2. Motivation

3. Challenges

4. Open Source

5. Data pipelines - Old and New

6. Build or Buy

7. Conclusions

8. Q & A

This presentation was created by Aiven Ltd - https://aiven.io.Product and vendor logos used for identification purposes only.

Introduction

Speaker

● Heikki Nousiainen

● CTO, co-founder @ Aiven, a cloud DBaaS company

● Previously: software architect - cloud

transformation, distributed systems

● Open source user and fan since 1995

@hnousiainen

● Independent Database as a Service provider

● Based in Helsinki and Boston

● 8 database systems available in 70+ regions

around the world

https://aiven.io

@aiven_io

Motivation

You’ve all heard it,

Data is your most valuable asset

“Data is the new oil”

Data is disrupting every industry

...but let’s take a look at some actual uses

Motivation

Car-as-a-Sensor

Traffic and road condition detection &

routing

Vacant parking space locator

Motivation

Welding Management

- Procedures

- Qualification verification

- 100% traceability

Motivation

Home / commercial automation

Smart Locks & Entry Controls

Environment sensing and management, lighting

Motivation

Predicting performance and proactively

preventing downtime.

Pay-per-use models with SLA.

Fuel consumption management.

Challenges

● Liveliness

● Volume & Velocity

● Data/system lifespan

● Changing business requirements

Requirements

● Low latency / Real-time eventing

○ Interactive usage

○ Environmental awareness

○ Routing decisions

● Batch

○ Analytics

○ Reporting

○ Research

Challenges

● Liveliness

Requirements

● Billions of messages and terabytes

of data 24/7

● 2013, 787 Dreamliner, 1TB data

per flight. 150 units / year.

● 2018, Audi Concept, 4TB data per

day per car. 2M units / year.

Challenges

● Liveliness

Requirements

● Production systems have long

lifespans

○ Car ~15-20 years from design

to disposal

○ Sea vessel 25-30 years

● Collected and consumed data differ

● Software / hardware upgrades

Challenges

● Liveliness

Requirements

● New services derived from data

● New sources/sinks for data

● Discontinued systems

● New experiments

● Legal landscape changes

● New/disbanded teams

● Acquisitions / integrations

Open Source

● Open Source product development pace trumps that of any private undertaking

○ A lot of data management innovation happens in Open Source

○ Open Source is quick to absorb innovation from any source

○ The pragmatic evolutionary development cycles are efficient in improving quality

● Using Open Source guarantees continued access to business critical data

○ Avoid lock-in to a single vendor

○ Even with the 10 - 20 year lifespans

Data Pipelines - Old and New

Common components of a data pipeline

Typical parts of a data pipeline

● Data ingestion

● Filtering & Enrichment

● Routing

● Processing

● Querying / Visualization / Reporting

● Data warehousing

● Reprocessing capabilities

Typical requirements

● Scalability

○ Billions of messages and

terabytes of data 24/7

● Availability and redundancy

○ Across physical locations

● Latency

○ Real-time / batch

● Adaptability / Platform support

“Traditional” data flow model

OLTP DB

ReportDB

Metrics DB

MicroservicesBilling systemsPublic REST API

Web clients AnalyticsReporting apps

“Traditional” data flow model

Caches Docstore

External clouds

OLTP DB

Data WH

Metrics DB

MicroservicesBilling systemsPublic REST API

Web clients AnalyticsReporting apps

Message-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

MicroservicesCustom Apps

REST App

Web clients

AnalyticsReporting apps

Message Bus / Router

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

REST App

Web clients

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

REST App

Web clients

Apache Kafka

Apache Kafka is an open source stream processing platform.

"The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds."

Originally developed by LinkedIn, open sourced in 2011, now a top-level Apache project. Nowadays used by e.g. New York Times, Pinterest, Zalando, Airbnb, Shopify, Spotify and many others for event streaming. [See https://kafka.apache.org/powered-by for more.]

Kafka excels as a centerpiece for event delivery, where a range of applications can produce and consume real-time event streams.

https://kafka.apache.org/

Kafka concepts

1 2 3 4 5 6 7 8 9 10 11 12metrics_v1 #1

40 41 42 43 44 45 46 47 48 49 50metrics_v1 #0

logs_v2 #0 0 1 2 3 4 5 6 7 8 9

Devices, VMs..

Monitoring and alert systems Log aggregator Data

warehouse

Consumers and consumer groups

Producers

Logs of topics and partitions

A key abstraction in Kafka is its commit log, where each consumer maintains maintains its own position in the log. This allows clean decoupling of the producing and consuming processes.

Kafka-centric data flow model

Caches Docstore

External cloudsOLTP

DBData WH

Metrics DB

REST App

Web clients

Kafka Connect

Framework for importing data from other systems and services - Sources - to Kafka and exporting to other services and systems - Sinks.

The framework makes it easy to create and share connectors in Open Source.

A host of connectors are available:

BigQuery, Cassandra, DynamoDB, Elasticsearch, Github, IOTHub / Azure, JDBC, JMS, Kinesis, PubSub / Google, MQTT, MySQL CDC, PostgreSQL CDC, RabbitMQ, Redshift, Redis, SalesForce, SAP, Solr, Splunk, SQS, Syslog, Twitter, Vertica

Databases in the Pipeline

● Specialized Open Source database technologies available for different use cases

● Consider the same requirements as for the streaming platform:

- Access patterns: transactional, relational

- Scalability

- Reliability

- Adaptability / Platform support / SDKs & Libraries

- Available competencies

Databases in the Pipeline

External clouds

REST App

Web clients

Kafka Streams

● Kafka Streams is an SDK / library for building application that process data in

real-time

● DSL for defining streams & processing steps

● Supports abstraction for stream of events, but also tables and state.

● Stateless and stateful transformations

KStream<String, String> textLines = builder.stream("InputLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("\\W+"))) .groupBy((key, word) -> word) .count("Counts"); wordCounts.to(Serdes.String(), Serdes.Long(), "WordsWithCountsTopic");

KSQL: SQL engine for Kafka

● KSQL allows performing continuous queries and transformation using SQL syntax

● Standalone service using Kafka APIs typically running as its own cluster next to Kafka

0 1 2 3 4 5 6 7 8 9 10log_stream

CREATE STREAM log_stream_origin(status bigint, path varchar) WITH (kafka_topic='log_stream', value_format='DELIMITED');

SELECT status, path FROM log_stream_origin WHERE status >= 400;

The Open Source Pipeline

External clouds

REST App

Web clients

Build or Buy

Data management is hard to do well

● Management of stateful systems requires specialized personnel and 24/7 response capability.

● Failures are difficult to predict and can have extremely high impact on business.

● Managed clouds services allow users to stay focused on their core business without worrying about software infrastructure.

● Open Source solutions allow one to move between in-house and managed offering.

Managed Open Source Solutions

Conclusions

Summary

● A lot of hype around data, but it’s still real deal and to be taken seriously

● The challenges with data management are both technical and temporal

● Open Source is the best bet to meet the data management challenges

● Kafka as the central component of a data pipeline helps clean up messy architectures

● A host of good Open Source database solutions can help to meet the data storage

and access needs

● You can leverage a host of managed service providers or build your own capability

● With Open Source, you have the option to revisit that choice at any time

Thanks!

https://aiven.io

@aiven_io

@hnousiainen

Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1....

Documents

Transcript of Building Data Pipelines with Open Source Components and Services · 2019-12-21 · Agenda 1....

Formação Open Source - Open Source Project PT

Open Source LoRa WiFi Gateway - Open Source WiFi, Linux ...

Open Source Primer/Open Source Alternatives to Photoshop

Open Source, Open Data

Open Innovation: Smart Solutions for R&D Pipelines

The End of Ossification: Strategic Opportunities in Open Source · 2020-05-07 · in Open Source Rebecca Hansen Open Source Strategy Manager, Open Source Group Open source is a powerful

Open Source Philosophy - Amazon Web Servicesconnect.linaro.org.s3.amazonaws.com/sfo17... · Open Source and Free Software Open Source and Free Software co-exist together Open Source

Open Source, Open Minds

TippingPoint TPS Open Source Licenses · Open Source Licenses The TippingPoint Threat Protection System (TPS) devices use open source components. Many open source license agreements

Cask Hydrator: Open Source, Code-Free Data Pipelines | Global Big Data Conf 2016

THE OPEN SOURCE OPPORTUNITY: Monetizing Open Source Though Partnerships

Bioinformatics Platform Options for Public Health …...bioinformatics platform. •Full access to open source tools for bioinformatics. •Share your pipelines with anyone. •Connect

open-source-energy.org Puhari… · open-source-energy.org

Open Source Best Practices - Plone CMS: Open Source ...

Open Source - a presentation on open source concepts

Open Source HPC: An Open Source Supercomputing Platform

Open Source, Open Résumés

The Chandra Source Catalog 2.0: Data Processing Pipelines … · The Chandra Source Catalog 2.0: Data Processing Pipelines Joseph B. Miller1, Christopher Allen1, Jamie A. Budynkiewicz1,

Enbridge Pipelines Inc. Leave to Open Application (Part 6 ...

Explore the State of Open Source Performance Testing in Continuous Delivery Pipelines with BlazeMeter Chief Scientist Andrey Pokhilko