The Evolution of Data Architecture

52
The Evolution of Data Architecture Wei-Chiu Chuang 2017. 10 @ NCKU 1

Transcript of The Evolution of Data Architecture

The Evolution of Data Architecture

Wei-Chiu Chuang2017. 10 @ NCKU

1

Who’s Wei-Chiu?

Data Value Chain

AI

Machine Learning

Data Science

Analytics

Big Data

Decision making

Insight

AutomatedDecision making

Hype (?)

3

Data is the new Oil

https://www.economist.com/news/leaders/2172165

6-data-economy-demands-new-approach-antitrust-

rules-worlds-most-valuable-resource

4

Fastest way to transmit 5MB of data in 1956

6

Fast forward 60

years… transmit

100PB of data in 2016

Once upon a time, processors double in speed every 18 months …

The “Moore’s Law”

stopped 10 years ago.

CPU, RAM and disk almost

stopped improving in

speed ever since.

7

Processor speed has been stagnant

But data is being generated

at ever increasing speed.

Hardware improvement

cannot keep up with data

generation.

Multi-threaded systems,

distributed systems are the

must.

8

Distributed Systems are hard

Programmability

Scalability

Consistency

Availability

Partition Tolerance

Fault Tolerance

9

Big Data/Parallel Computing/Distributed

Sys.

D HPCBig DataCloud

Distributed Systems

10

Scale out

11

Modern Data Architecture

How do you:

transmit

collect

store

compute

Petabyte+ storage on

1000+ compute nodes?

12

Modern Data Center

DataCenter

ToR

Server1

Server10

ToR

Server1

Server10

ToR

Server1

Server10

ToR

Server1

Server10

Aggr Aggr Aggr

Core Core

Internet

AR AR

10Gbps

10Gbps

1Gbps

13

GFS

Master – slave architecture

Separation of control plane and

data plane

Low cost, commodity hardware

Failures are norm, rather

than exceptions

Balance availability and network

partition tolerance

Control

messages

Data

messagesGFS

Master

GFS

chunkservers

/foo/barGFS

client

14

MapReduce

A very simple yet powerful

distributed programming model

Share-nothing architecture

Programmability

Data-locality:

ship compute to data, rather

than shipping data to compute

Fault tolerance:

Intermediate state is stored in

storage.

Failed tasks can be restarted

easily.

Split 0

Split 1

Split 2

worker

worker

worker

Input files Map phase

worker

worker

Intermediate

files

Reduce

phase

Output 0

Output files

Output 1

masterassign

mapassign

reduce

15

Hadoop

16

Hadoop

GFS, MapReduce inspired Hadoop

Initially developed by Yahoo!

Released in 2006.

Used by most large enterprises

Hadoop 3.0 beta 1!

17

2006 2008 2009 2010 2011 2012 2013

Core Hadoop(HDFS,

MapReduce)

HBaseZooKeeper

SolrPig

Core Hadoop

HiveMahoutHBase

ZooKeeperSolrPig

Core Hadoop

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

Core Hadoop

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

SparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

ParquetSentrySparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

2007

SolrPig

Core Hadoop

KnoxFlink

ParquetSentrySparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

2014 2015

KuduRecordService

IbisFalconKnoxFlink

ParquetSentrySparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

Evolution of the Hadoop Platform

The stack is continually evolving and growing!

18

Mix and match

Resource Management

YARN Mesos Kubernetes

Storage

HDFS HBase Kudu S3 ADLS

Compute

MapReduce Hive Impala Spark Presto

Pig Drill Solr Storm

Ingest

Kafka

Flume

Beam

Samza

19

Open source in infra & platform

20

Why open source?

It’s free ($$$)

No vendor lock-in.

Faster development and faster adoption.

A new approach to foster collaboration.

Open source software is becoming the standard.

21

Sell open source software, really?

Water is free, but bottled water is not.

Cloudera sells the “bottle”

Cloudera’s Distribution of Hadoop.

The integration of software.

The support and services.

The management software is proprietary. The OSS is free of charge.

22

Market for open source software?

23

0

50

100

150

200

250

300

350

400

FY2015 FY2016 FY2017 FY2018 (f)

Revenue (million USD)

Hortonworks Cloudera MongoDB

Open Source Business Model

• MySQL

Dual licensing

• RedHat, Hortonworks

Support + services

• Java EE, Qt

Open core

• DataBricks, Amazon AWS, Microsoft Azure

Software as a Service

• Google Chrome, Android

Advertising-supported

• Cloudera, Confluent, MongoDB

Hybrid Open Source Software

24

Use Cases

25

“Big Data” finds many applications

across many industries

IT Healthcare Transportation Retail

Utilities Telecomm Public sector Manufactring

27

Applications and Use cases

Realtime database for serving internet traffic

Internet services (Facebook messenger), Twitter, Uber, Airbnb …

Data analytics

Assist in the development of new drugs by analyzing millions

of medical records

Data science / Machine learning

Fraud detection

Anti-money laundry

Cybersecurity

28

Fraud Detection System using Hadoop

The Cloudera Platform for IoT – Data Mgmt. Value Chain

Data Sources Data Ingest Data Storage & ProcessingServing, Analytics &

Machine Learning

ENTERPRISE DATA HUB

Apache KafkaStream or batch ingestion of IoT data

Apache SqoopIngestion of data from relational sources

Apache HadoopStorage (HDFS) & deep batch processing

Apache KuduStorage & serving for fast changing data

Apache HBaseNoSQL data store for real time

applications

Apache ImpalaMPP SQL for fast analytics

Cloudera SearchReal time searchConnected Things/ Data

Sources

Other Data Sources Security, Scalability & Easy Management

Deployment Flexibility:

Datacenter Cloud

Apache SparkStream & iterative processing, ML

IoT Use Case 1:

Predictive Maintenance

Predictive Maintenance on Thousands of Industrial Machinery in Real- Time

Challenge:

• Collect and analyze data from thousands of diverse manufacturing systems in real-time

Solution:

• iTrak application using Cloudera in the Cloud to monitor the performance of individual manufacturing systems in real-time

• Predictive Maintenance - Proactively identifying & fixing issues before they break

MANUFACTURING

» INDUSTRIAL IoT

» PREDICTIVE MAINTENANCE

» IMPROVED EFFICIENCIES

Industrial IoT – Predictive Maintenance

DATA-DRIVENPROCESS

CASE STUDY

DATA-DRIVENPRODUCTS

Use Case 2:

Connected Vehicles

Using Predictive Maintenance to Improve Performance and Reduce Fleet Downtime

Challenge:

• Monitor the health of 180,000+ trucks in real-time in order to minimize downtime

Solution:

• OnCommand Connection collecting telematics and geolocation data across thousands of trucks

• Identify and correct engine problems early, and increase fleet uptime

• Reduced maintenance costs to $.03 per mile from $.12-$.15 per mile

Connected Vehicles & Telematics

DATA-DRIVENPROCESS

CASE STUDY

DATA-DRIVENPRODUCTS

TRANSPORTATION

» PREDICTIVE MAINTENANCE

» TELEMETRY

» LOWER TCO

Use Case 3:

Smart Cities & Smart Infrastructure

Enabling the State of Kentucky manage snow and ice events in real time

Challenge:

• Kentucky Transportation Cabinet (KYTC) oversees the state’s transportation system, which includes 27,000 miles of highways, 230 airports and heliports, and more than three million drivers.

• Needed more efficient approach to inclement weather road management

Solution:

• KYTC has built a real-time weather response system that incorporates real-time data from Waze, HERE, ESRI’s GeoEvent processor, and Automatic Vehicle Locations (providing sensor data from salt trucks).

• KYTC aggregates 15-20 million records every day and process more than a million records per second.

Data Driven Dept. of Transportation

Source: http://www.routefifty.com/2016/09/data-drives-government/131821/

2016 Data Impact Award Winner

State of Kentucky Department of

Transportation

Use Case 4:

Connected Healthcare

Improve Parkinson's Disease Monitoring and Treatment through IoT

Challenge:

• Collect and analyze data from wearables (more than 300 readings per second) from thousands of patients in real-time

Solution:

• Cloudera on Intel architecture to detect patterns in patient data streaming from wearables

• Continuously monitor the patients and symptoms to understand the progression of the disease objectively

HEALTHCARE

» WEARABLES

» PREDICTIVE ANALYTICS

» IMPROVED CARE

Connected Healthcare

DATA-DRIVENPROCESS

CASE STUDY

DATA-DRIVENPRODUCTS

Building a Holistic Picture of the US Securities Market From 50 Billion Daily Events

• Saving $10-20M in operational efficiencies annually

• 90-minute queries run in 10 seconds• Supporting future market growth and a

dynamic regulatory environment.

CUSTOMER 360

Using Big Data to Help Consumers Save Hundreds of Millions in Utility Bills

• Relevant insight into household energy use improves energy consciousness

• 2.7+ TWH (terawatt hours) saved to date

• Motivated consumers to save enough energy to power every household in Salt Lake City and St. Louis for a year

CUSTOMER 360

ENERGY & UTILITIES » PRODUCT INNOVATION» SERVICE IMPROVEMENT» IOT

Saving Lives by Detecting Sepsis Early Enough for Successful Treatment

• Builds a more complete picture of patients, conditions, and trends

• Has saved 100’s of lives already• Reduces hospital readmissions • 2PB+ in multi-tenant environment

supporting 100s of clients • Secure yet explorable

HEALTHCARE» 360° CUSTOMER VIEW» PREDICTIVE ANALYTICS» IMPROVED SERVICE

Improving Pediatric Care and Outcomes

• Quantifying effect of ambient noise on children’s vital signs

• Identifying cancerous genome variants in 20 minutes (vs. days before)

• Performing fewer CT scans and higher quality surgeries

CUSTOMER 360

HEALTHCARE» MACHINE LEARNING» IOT» 360o CUSTOMER VIEW

Government Revenue Service

Increasing Customer Convenience

• Provides view of the complete taxpayer journey

• Creates ability to pre-populate tax returns for increased ease of use

• Supports move to near-real-time oversight of operations and faster response

CUSTOMER 360

GOVERNMENT» SERVICE IMPROVEMENT» PROCESS IMPROVEMENT» 360° CUSTOMER VIEW

Driving Growth and Innovation

• Combines 80+ years’ data spanning all business units and 50 states

• Expedites holistic analysis and reports by 500X

• Enables more accurate and detailed predictive models to customize offers, optimizing pricing, and minimize risk

CUSTOMER 360

INSURANCE» 360° CUSTOMER VIEW» FRAUD DETECTION» PREDICTIVE ANALYTICS

Re-Platformed 1,600 Operational Databases & Systems onto a Cloudera EDH• Business & consumer data was spread

over a dozen different customer databases

• One daily ETL job (processing 1 billion customer records) used to take 24 hours

• Increased data velocity by 15x(5 times the data in 1/3 of the time)Now completes in 1 ½ hours

• BT now has access to the most up-to-date and centralized data for all their customers

CUSTOMER

360

TELECOMMUNICATIONS

» IMPROVED SERVICE

» PROCESS IMPROVEMENT

» IT COST REDUCTION

Future

48

Future

Hardware evolution:

Cloud

40Gbps, 100Gbps networks

GPU, TPU

Flash disk

Application-driven:

Machine learning, deep learning

Realtime data stream processing (IoT)49

Future

How to scale by an order of

magnitude in 5 years?

We are here today

In 10 years?

50

台灣資料工程協會

Click to enter confidentiality information

台灣人參與Apache

Click to enter confidentiality information

葉祐欣 謝良奇、蔡東邦 陳恩平

戴資力 莊偉赳 蔡嘉平

Apache Contributor 育才賽

Click to enter confidentiality information

Takeaway

If you only remember 3 things from this talk:

1.Data is the new Oil

2.Open source is the standard

3.Think big! Remember GFS:

failures are the norm rather

than the exception!

54