InfoSphere Streams

38
© 2013 IBM Corporation Nov 2013 InfoSphere Streams Tushar Kale Big Data Evangelist – Streams Architect [email protected]

description

Tushar Kale Big Data Evangelist – Streams Architect [email protected]. InfoSphere Streams. Agenda. Overview Architecture Customer Use Cases. Big Data = Variety, Velocity, and Volume. - PowerPoint PPT Presentation

Transcript of InfoSphere Streams

Page 1: InfoSphere Streams

© 2013 IBM CorporationNov 2013

InfoSphere Streams

Tushar KaleBig Data Evangelist – Streams [email protected]

Page 2: InfoSphere Streams

© 2013 IBM Corporation2

Information Management

Agenda

Overview

Architecture

Customer Use Cases

Page 3: InfoSphere Streams

© 2013 IBM Corporation3

Information Management

Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible.

Big Data = Variety, Velocity, and Volume

Variety Manage the complexity of multiple relational and non-relational data types and schemas

Velocity Streaming data and large volume data movement

Volume Scale from terabytes to zettabytes

Page 4: InfoSphere Streams

© 2013 IBM Corporation4

Information Management

InfoSphere Streams

Volume

Millions of events per

second

Microsecond Latency

Traditional / Non-traditional data sources

Real time delivery

PowerfulAnalytics

Algo Trading

Telco churnpredict

SmartGrid

CyberSecurity

Government /Law enforcement

ICUMonitoring

EnvironmentMonitoring

A Platform to Run In-Motion Analytics on BIG Data

Handles up to Petabytes of

data per day

Supports traditional as well as

non-traditional data (Audio,

Video etc.)

Delivers insights with

microsecond latencies

Supports custom analytics

written in C++/Java and

warehouse analytic models

Single instance can support

multiple applications

Variety

Velocity

ComplexAnalytics

Agility

Page 5: InfoSphere Streams

© 2013 IBM Corporation5

Information Management

5

Stream Computing Illustrated

directory: ”/img"filename: “farm”

directory: ”/img"filename: “bird”

directory: ”/opt"filename: “java”

directory: ”/img"filename: “cat”

tuple

height: 640width: 480data:

height: 1280width: 1024data:

height: 640width: 480data:

Page 6: InfoSphere Streams

© 2013 IBM Corporation6

Information Management

What can Streams do for you?

Analyze and react to events as they are happening

Take advantage of more sources of data in “true” real time

Build models on your most up-to-the-second information that will help predict what happens next

Streams is a middleware and language for building and running analytic applications operating on data in motion• Scale – easily handles a few events per second through multiple

millions of events per second• Reaction time – possible to get actionable results in much less than a

second (< 20 micros possible)

Enables TRUE situational awareness

Page 7: InfoSphere Streams

© 2013 IBM Corporation7

Information Management

BIG Data – Extending the Warehouse

Streams

Internet

Scale

Warehouse

In-Motion Analytics

Data Analytics,Data Operations &

Model Building

Results

Internet Scale

Database &Warehouse

At-Rest Data Analytics

Results

Ultra Low Latency Results

Traditional / Relational

Data Sources

Non-Traditional / Non-Relational

Data Sources

Non-Traditional/Non-RelationalData Sources

Traditional/Relational Data Sources

InfoSphereStreams

InfoSphereBigInsights

Page 8: InfoSphere Streams

© 2013 IBM Corporation8

Information Management

Adaptive AnalyticsIntegrating Analytics on Data in Motion and Data at Rest

1. Data Ingest

Data Integration, data mining, machine learning, statistical modeling

Visualization of real-time and historical insights

3. Adaptive Analytics Model

Data ingest, preparation,

online analysis, model validation

Data

2. Bootstrap/Enrich

Control flow

InfoSphereBigInsights, Database & Warehouse

InfoSphereStreams

Page 9: InfoSphere Streams

© 2013 IBM Corporation9

Information Management

Agenda

Overview

Architecture

Customer Use Cases

Page 10: InfoSphere Streams

© 2013 IBM Corporation10

Information Management

10

What are key differentiating technical capabilities of Streams?

Performance and Scaling:Operator Fusing and ThreadingEfficient use of coresDistributed executionVery fast data exchange

Language built for Streaming applications:

Reusable operatorsRapid application developmentContinuous “pipeline” processing

Flexible and high performance transport:

Very low latencyHigh data rates

Easy to extend:Built in adaptorsUsers add capability with familiar C++ and Java

Use the data that gives you a competitive advantage:

Can handle virtually any data typeUse data that is too expensive and time sensitive for traditional

approaches

Easy to manage:Automatic placementExtend applications incrementally without downtime Multi-user / multiple applications

Dynamic analysis:Programmatically change

topology at runtimeCreate new subscriptionsCreate new port properties

Page 11: InfoSphere Streams

© 2013 IBM Corporation11

Information Management

InfoSphere Streams

Streams Processing Language and IDE

Runtime Environment

Tools and Technology Integration

Highly Scalable stream processing runtime

Streams Console & Monitoring,Built-in Stream Relational Analytics,

Adapters, Toolkits

Streams StudioEclipse IDE for SPL

Supported on x86 hardware, RedHat Enterprise Linux Version 5 (5.3 and up)

Front Office 3.0

Page 12: InfoSphere Streams

© 2013 IBM Corporation12

Information Management

Terminology Application

• Data flow graph of operator instances connected to each other via stream connections

Operator• Reusable stream analytic

• Input ports: receives data / Output ports: produces data• Source: No input ports / Sink: No output ports

Operator Instance• A specific instantiation of an operator

Stream• Continuous series of tuples, generated by an operator instance’s output port

Stream connection• A stream connected to a specific operator instance input port

PE• A runtime process that executes a set of operator instances

Job• An application instance running on a set of hosts

O1

O2

O3

(stream<Type> A) as O1 = MySrc() {}() as O2 = MySink(A) {}() as O3 = MySink(A) {}

A

stream Astream connection

MySink

MySink

MySrc

Page 13: InfoSphere Streams

© 2013 IBM Corporation13

Information Management

InfoSphere Streams Programming Model

Application Programming (SPL)

Source Adapters Sink AdaptersOperator Repository

Platform optimized compilation

Page 14: InfoSphere Streams

© 2013 IBM Corporation14

Information Management

The Join operator is used for correlating two streams

The Functor operator is used for performing tuple-level manipulations

The Aggregate operator is used for grouping and summarization of incoming tuples

The Punctor operator is for inserting punctuation marks in streams

The Sort operator is used for imposing an order on incoming tuples in a stream

The Barrier operator is used as a synchronization point

The Delay operator is used to “artificially” slowdown a stream

The Split operator is used for dividing incoming tuples into separate streams for parallel processing

And more!

Streams Core Analytical Capabilities Streams Built-in Relational and Utility Operators

Page 15: InfoSphere Streams

© 2013 IBM Corporation15

Information Management

The ODBCSource operator is used for reading data from databases, such as DB2, IDS, Oracle

The ODBCAppend operator is used for writing data to databases, such as DB2, IDS, Oracle

The ODBCEnrich operator is used for extending streaming data based on lookups performed from database tables

The solidDBEnrich operator is used for extending streaming data based on lookups performed from in-memory database tables

The FileSource operator is used for reading data from files in formats such as csv, line, or binary

The FileSink operator is used for writing data to files in formats such as csv, line, or binary

The TCP / UDPSink operator is used for writing data to sockets in formats such as csv, line, or binary

The TCP / UDPSource operator is used for reading data from sockets in formats such as csv, line, or binary

Streams Core Adapter Capabilities Streams Built-in Adapters and DB Toolkit

Page 16: InfoSphere Streams

© 2013 IBM Corporation16

Information Management

Extensibility

User-defined operators that extend the language–A reusable, generic operator model

•written in general purpose programming languages (C++/Java)

User-defined functions that extend the language

Toolkits: Set of domain-specific operators/functions–Toolkits available as part of Streams

•DB toolkit•Data mining toolkit•Financial toolkit

–Streams Exchange on developerWorks•Re-usable Assets and Forum

Developers in two categories–Application developers–Toolkit developers

Page 17: InfoSphere Streams

© 2013 IBM Corporation17

Information Management

Static vs. Dynamic Composition

Static connections–Fully specified at application development-time and do not change at run-time

Dynamic connections–Partially specified at application development-time (Name or Properties)–Established at run-time, as new jobs come and go

•Specifications can also be updated at run-time

Dynamic application composition–Incremental deployment of applications–Dynamic adaptation of applications

Page 18: InfoSphere Streams

© 2013 IBM Corporation18

Information Management

Static vs. Dynamic Composition

Static connections–Fully specified at application development-time and do not change at run-time

Dynamic connections–Partially specified at application development-time (Name or Properties)–Established at run-time, as new jobs come and go

•Specifications can also be updated at run-time

Dynamic application composition–Incremental deployment of applications–Dynamic adaptation of applications

Page 19: InfoSphere Streams

© 2013 IBM Corporation19

Information Management

InfoSphere Streams Runtime Architecture

Streams ApplicationManager

StreamsWeb Service

Name ServiceRoot Service

Components running on management hosts

Components running on application hosts

Name ServicePartition Service

Scheduler

Running anywhere inside the clusterstreamtool

InfoSphere Streams Runtime running on a cluster – 125 blades

Subset of aSPL application

(a collection of operators)

Streams ResourceManager

Authorization and Authentication

Service

Host ControllerProcessing

ElementContainer Agent

Language/OptimizingCompiler

Management APIsAdmin Config / Console

Eclipse IDE and Management Tools

Page 20: InfoSphere Streams

© 2013 IBM Corporation20

Information Management

Streams is a distributed, multi-user, multi-instance system•Multiple instances can run at the same time•Can run jobs from multiple users•A security model is provided for authentication and authorization

Application management •New jobs can be added/removed at any time•New and existing jobs can connect to each other •Scheduler assigns PEs to Hosts based on load

Resource management •Hosts & Services configuration and state•System & Application Metrics

Failure semantics•Recovery of management services state•PEs can be restarted or relocated upon failure•All connections will be re-established once a PE restarts

•All state and in transit tuples are lost•Checkpointing can be used to restore operator state

InfoSphere Streams Runtime

Page 21: InfoSphere Streams

© 2013 IBM Corporation21

Information Management

X86 Host X86 Host X86 Host X86 Host X86 Host

Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters

Adapts to changes :

InfoSphere Streams Runtime - cont’d

Page 22: InfoSphere Streams

© 2013 IBM Corporation22

Information Management

X86 Host X86 Host X86 Host X86 Host X86 Host

Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters

Adapts to changes :•In workloads

InfoSphere Streams Runtime – cont’d

Page 23: InfoSphere Streams

© 2013 IBM Corporation23

Information Management

X86 Host X86 Host X86 Host X86 Host X86 Host

Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters

Adapts to changes :•In workloads

InfoSphere Streams Runtime – cont’d

Page 24: InfoSphere Streams

© 2013 IBM Corporation24

Information Management

X86 Host X86 Host X86 Host X86 Host X86 Host

Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters

Adapts to changes :•In workloads•In resources

InfoSphere Streams Runtime – cont’d

Page 25: InfoSphere Streams

© 2013 IBM Corporation25

Information Management

X86 Host X86 Host X86 Host X86 Host X86 Host

Runs on commodity hardware•From single node to blade centers to high performance multi-rack clusters

Adapts to changes :•In workloads•In resources

InfoSphere Streams Runtime – cont’d

Page 26: InfoSphere Streams

© 2013 IBM Corporation26

Information Management

Streams Studio Eclipse IDE

Page 27: InfoSphere Streams

© 2013 IBM Corporation27

Information Management

Streams Console – Metrics

Page 28: InfoSphere Streams

© 2013 IBM Corporation28

Information Management

Agenda

Overview

Architecture

Customer Use Case

Page 29: InfoSphere Streams

© 2013 IBM Corporation29

Information Management

Streaming Analytics in Action

Stock Market Impact of weather on securities prices Analyze market data at ultra-low latencies

Fraud Prevention Detecting multi-party fraud Real time fraud prevention

e-Science Space weather prediction Detection of transient events Synchrotron atomic research

Transportation Intelligent traffic

management

Manufacturing Process control for

microchip fabrication

Natural Systems Wildfire management Water management

Telephony CDR processing Social analysis Churn prediction Geomapping

Other Smart Grid Text analysis Who’s talking to whom? ERP for commodities FPGA acceleration

Real-time multimodal surveillance Situational awareness Cyber security detection

Law Enforcement, Defense & Cyber Security

Health & Life Sciences Neonatal ICU monitoring Epidemic early warning

system Remote healthcare

monitoring

Page 30: InfoSphere Streams

© 2013 IBM Corporation30

Information Management

Smarter Faster Cheaper CDR Processing

InfoSphere Streams InfoSphere Streams xDR HubxDR Hub

Key Requirements: Price/Performance and Scaling

6 Billion CDRs per day, dedups over 7 days, processing latency from 12 hours to a few seconds6 machines (using ½ processor capacity)

Page 31: InfoSphere Streams

© 2013 IBM Corporation31

Information Management

Call QualityAnalytics

Telco: Beyond CDR processing, building on existing insight

NetworkAnalytics

Campaign Analytics

LocationAnalytics

Business

Rules

Call DataAnalytics

AudioAnalytics

ChurnAnalytics

Social Analytics

Analytics…

Analytics…

Analytics

Mobile Network

Customer Interactions

Weather

Social Media InfoSphere

Streams

Database & Warehouse

Page 32: InfoSphere Streams

© 2013 IBM Corporation32

Information Management

Use scenario• State-of-the-art covert surveillance system

based on Streams platform

• Acoustic signals from buried fiber optic cables are monitored, analyzed and reported in real time for necessary action

• Currently designed to scale up to 1600 streams of raw binary data

Requirement

• Real-time processing of multi-modal signals (acoustics. video, etc)

• Easy to expand, dynamic

• 3.5M data elements per second

Winner 2010 IBM CTO Innovation Award

Surveillance and Physical Security: TerraEchos (Business Partner)

Page 33: InfoSphere Streams

© 2013 IBM Corporation33

Information Management

Cyber Security Analytics

Botnet nodes / Malware IP/MAC identifying suspects

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Processing Element Container

Live PacketCapture

DNS / DHCP / Netflow sources

Botnet Behavior modeling

External C&C Feeds (live DB queries)

IT I/S Firewalls

Remediation Infrastructure / Ticketing

33

InfoSphereStreams

Page 34: InfoSphere Streams

© 2013 IBM Corporation34

Information Management

University of Ontario Institute of Technology (UOIT) and Sick Kids Hospital

IBM Data Babyhttp://youtu.be/ZiqY7p1v950

IBM Data Babyhttp://youtu.be/ZiqY7p1v950

Page 35: InfoSphere Streams

© 2013 IBM Corporation35

Information Management

Intelligent Transportation

Multimodal Data Streams• GPS• Counts, speeds, travel times• Public Transport• Pollution measurements• Weather Conditions

Archiving of cleansed data

Real Time Traffic Monitoring

Real Time Traffic Information

(Multimodal) Travel Planner

Only 4 x86 Blade servers to process 250,000 GPS probes per second

GPSData

Streams

Real Time Transformatio

n Logic

Real Time Geo

Mapping

Real Time Speed & Heading

Estimation

Real Time Aggregates & Statistics

DataWarehouseWeb

Server

GoogleEarth

Offlinestatisticalanalysis

Interactivevisualization

Storageadapters

Page 36: InfoSphere Streams

© 2013 IBM Corporation Nov 2013

Information Management

THINK

36

Page 37: InfoSphere Streams

© 2013 IBM Corporation Nov 2013

Information Management

Questions?

Page 38: InfoSphere Streams

© 2013 IBM Corporation38

Information Management