IBM InfoSphere Streams Technical Overview Julie Garrone [email protected] Christophe Menichetti...

70
IBM InfoSphere Streams Technical Overview Julie Garrone [email protected] Christophe Menichetti [email protected] © 2013 IBM Corporation June 7, 2022

Transcript of IBM InfoSphere Streams Technical Overview Julie Garrone [email protected] Christophe Menichetti...

Page 1: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

IBM InfoSphere StreamsTechnical Overview

Julie [email protected]

Christophe [email protected]

© 2013 IBM CorporationApril 10, 2023

Page 2: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Page 3: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Technical overview

Streams Processing Language

Use case

Technical overview

Streams Processing Language

Page 4: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

IBM InfoSphere Streams v3.0

A platform for real-time analytics on BIG data Volume

– Terabytes per second– Petabytes per day

Variety– All kinds of data– All kinds of analytics

Velocity– Insights in microseconds

Agility– Dynamically responsive – Rapid application development

© 2013 IBM Corporation4

Millions of events per

second

Microsecond Latency

Sensor, video, audio, text, and relational data sources

Just-in-time decisions

PowerfulAnalytics

Page 5: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Analyzes All Kinds of Data

© 2013 IBM Corporation5

Mining in Microseconds(included with Streams)Mining in Microseconds(included with Streams)

Image & Video (Open Source)

Simple & Advanced Text (included with Streams)Simple & Advanced Text (included with Streams)

Text(listen, verb),

(radio, noun)

Acoustic(IBM Research)(Open Source)

Acoustic(IBM Research)(Open Source)

Geospatial(included with Streams)

Geospatial(included with Streams)

Predictive(included with Streams)

Predictive(included with Streams)

Advanced Mathematical Models(IBM Research)

Advanced Mathematical Models(IBM Research)

Statistics(included with Streams)

Statistics(included with Streams)

population

tt asR ),(

***New***New

***New***New

***New***New

Page 6: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Categories of Problems Solved by Streams

Applications that require on-the-fly processing, filtering and analysis of streaming data– Sensors: environmental, industrial, surveillance video, GPS, …– “Data exhaust”: network/system/web server/app server log files– High-rate transaction data: financial transactions, call detail records

Criteria: two or more of the following– Messages are processed in isolation or in limited data windows– Sources include non-traditional data (spatial, imagery, text, …)– Sources vary in connection methods, data rates, and processing requirements,

presenting integration challenges– Data rates/volumes require the resources of multiple processing nodes– Analysis and response are needed with sub-millisecond latency– Data rates and volumes are too great for store-and-mine approaches

© 2013 IBM Corporation6

Page 7: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Traditional / Relational

Data Sources

Traditional / Relational

Data Sources

The Big Data Ecosystem: Interoperability is Key

© 2013 IBM Corporation7

StreamingData

TraditionalWarehouse

Analytics onData at Rest

DataWarehouse

Analytics on Structured Data

RTAP: Analytics onData in Motion

BigInsightsBigInsightsNon-Traditional /Non-RelationalData Sources

Non-Traditional /Non-RelationalData Sources

Non-Traditional / Non-Relational

Data Feeds

Non-Traditional / Non-Relational

Data Feeds

Traditional / Relational

Data Sources

Traditional / Relational

Data Sources

Internet-Scale

Data Sets

StreamsStreams

Page 8: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

EuroBuzz with France Televisions EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song ContestSmarter Analytics applied to the Eurovision Song Contest

8

Tweets on Eurovision event

Buzz Statistics

Tweet format{"entities":{"hashtags":[{"text":“eurovision","text":“Sweeden

will win Eurovision for sure #eurovision”….

JSON Format

Data-visualisation Partner

Gauge by country

Latest tweets on Eurovision

Pictures of Eurovision/Twitter top contributor

s

Streams Overall

statistics(World & France)

Twitter Streaming API

Tweets filtered by Eurovision specific hashtags:#eurovision#eurofrancetv…

Real-time analysis

•Filters, Aggregations, Calculations on incoming tweets•2 different Streams based on hashtags

• World stream collecting tweets from all eurovision hastags

• France stream collecting tweets from the France Televisions hastag #eurofrancetv

•Analyse of 2 different languages: English and French

24 hours Data Aggregation

France 3 website

Live Eurovision

Show

Page 9: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

All blades with RHEL6.1.64

SAN

SAN

DS8700

HS22 8 cores 2.93 GHz

48GB RAMMaster Node

HS22 8 cores 2.93 GHz

48GB RAMSecondary Node

Data-visualisation Partner

CPU Total HS22 29/05/2012

0

10

20

30

40

50

60

70

80

90

100

20:3

7

20:4

0

20:4

3

20:4

6

20:4

9

20:5

2

20:5

5

20:5

8

21:0

1

21:0

4

21:0

7

21:1

0

21:1

3

21:1

6

21:1

9

21:2

2

21:2

5

21:2

8

21:3

1

21:3

4

21:3

7

21:4

0

21:4

3

21:4

6

21:4

9

21:5

2

21:5

5

21:5

8

22:0

1

22:0

4

22:0

7

22:1

0

22:1

3

22:1

6

22:1

9

22:2

2

22:2

5

22:2

8

22:3

1

22:3

4

22:3

7

22:4

0

22:4

3

22:4

6

22:4

9

22:5

2

22:5

5

22:5

8

23:0

1

23:0

4

23:0

7

23:1

0

23:1

3

23:1

6

23:1

9

User% Sys% Wait%

Max 100%

CPU Total HS22 29/05/2012

0

1

2

3

4

5

6

7

8

9

10

20:3

7

20:4

0

20:4

3

20:4

6

20:4

9

20:5

2

20:5

5

20:5

8

21:0

1

21:0

4

21:0

7

21:1

0

21:1

3

21:1

6

21:1

9

21:2

2

21:2

5

21:2

8

21:3

1

21:3

4

21:3

7

21:4

0

21:4

3

21:4

6

21:4

9

21:5

2

21:5

5

21:5

8

22:0

1

22:0

4

22:0

7

22:1

0

22:1

3

22:1

6

22:1

9

22:2

2

22:2

5

22:2

8

22:3

1

22:3

4

22:3

7

22:4

0

22:4

3

22:4

6

22:4

9

22:5

2

22:5

5

22:5

8

23:0

1

23:0

4

23:0

7

23:1

0

23:1

3

23:1

6

23:1

9

User% Sys% Wait%

Network I/O HS22 (KB/s) - 29/05/2012

-400

-300

-200

-100

0

100

200

300

400

500

600

20:3

7

20:4

0

20:4

3

20:4

6

20:4

9

20:5

2

20:5

5

20:5

8

21:0

1

21:0

4

21:0

7

21:1

0

21:1

3

21:1

6

21:1

9

21:2

2

21:2

5

21:2

8

21:3

1

21:3

4

21:3

7

21:4

0

21:4

3

21:4

6

21:4

9

21:5

2

21:5

5

21:5

8

22:0

1

22:0

4

22:0

7

22:1

0

22:1

3

22:1

6

22:1

9

22:2

2

22:2

5

22:2

8

22:3

1

22:3

4

22:3

7

22:4

0

22:4

3

22:4

6

22:4

9

22:5

2

22:5

5

22:5

8

23:0

1

23:0

4

23:0

7

23:1

0

23:1

3

23:1

6

23:1

9

Total-Read Total-Write (-ve)

EuroBuzz with France Televisions EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song ContestSmarter Analytics applied to the Eurovision Song Contest

Network peak at the beginning of

each song

Very low CPU utilization : < 1%

-> InfoSphere Streams Efficiency

Max 10%

Infrastructure for the PoC

•2 Blades in Active/Active Mode (Nehalem Sockets) •SAN Storage to store all data during the PoC (All Tweets)

Page 10: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Use case– Neonatal infant monitoring– Predict infection in ICU 24 hours in advance

Solutions– 120 children monitored :120K msg/sec, billion msg/day– Trials expanding to include hospitals in US and China

University of Ontario Institute of Technology

SensorNetwork

Stream-based Distributed Interoperable Health care Infrastructure

Solutions (Applications)

Event Pre-processer

Analysis Framework

Page 11: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

11

Asian Telco reduces billing costs and improves customer satisfaction

Asian Telco reduces billing costs and improves customer satisfaction

Capabilities:

Stream ComputingAnalytic Accelerators

Real-time mediation and analysis of

5B CDRs per day

Data processing time reduced from 12 hrs to 1 min

Hardware cost reduced to 1/8th

Proactively address issues (e.g. dropped calls) impacting customer satisfaction.

11

Page 12: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Page 13: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

continuous ingestion Continuous ingestion Continuous analysis

How Streams Works

© 2013 IBM Corporation13

Page 14: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts

Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity

TransformTransformFilter / SampleFilter / Sample

ClassifyClassifyCorrelateCorrelate

AnnotateAnnotate

Where appropriate: Elements can be fused togetherfor lower communication latency

Continuous ingestion Continuous analysis

How Streams Works

© 2013 IBM Corporation14

Page 15: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

InfoSphere Streams Runtime

Distributed runtime, running on a cluster– No limit on cluster size

Import/export streams between jobs– Agility: dynamically reconfigure applications

Performance and health metrics – down to operator level

Jobs submitted and canceled at any time

Hosts added to and removed from cluster at any time

High Availability – Restart and relocate portions of applications– Assign parts of applications to host pools based on host tags

Application portability – Simplifies test and operations

© 2013 IBM Corporation15

Page 16: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

From Operators to Running Jobs

Streams application graph:– A directed, possibly cyclic, graph– A collection of operators– Connected by streams

Each complete application is a potentially deployable job

Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance)

An instance can include a single processing node (hardware)

Or multiple processing nodes

© 2013 IBM Corporation16

Streams instance

OP

OP

Src

Src

Sink

Sink

OP

stream

h/w node

node nodenode

nodenode node

node

Page 17: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

InfoSphere Streams Objects: Runtime View

Instance– Runtime instantiation of InfoSphere

Streams executing across one or more hosts

– Collection of components and services Processing Element (PE)

– Fundamental execution unit that is run by the Streams instance

– Can encapsulate a single operator or many “fused” operators

Job– A deployed Streams application

executing in an instance – Consists of one or more PEs

© 2013 IBM Corporation17

InstanceInstance

JobJob

NodeNode

Stream 1 PEPE PEPE

NodeNode

PEPE

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator

Page 18: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

InfoSphere Streams Objects: Development View Operator

– The fundamental building block of the Streams Processing Language

– Operators process data from streams and may produce new streams

Stream– An infinite sequence of structured tuples– Can be consumed by operators on a tuple-

by-tuple basis or through the definition of a window

Tuple– A structured list of attributes and their types.

Each tuple on a stream has the form dictated by its stream type

Stream type– Specification of the name and data type of

each attribute in the tuple Window

– A finite, sequential group of tuples– Based on count, time, attribute value,

or punctuation marks

© 2013 IBM Corporation18

directory: "/img"filename: "farm"

directory: "/img"filename: "bird"

directory: "/opt"filename: "java"

directory: "/img"filename: "cat"

Streams Application

stream

tuple

height: 640width: 480data:

height: 1280width: 1024data:

height: 640width: 480data:

operator

Page 19: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Page 20: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

What is Streams Processing Language?

Designed for stream computing– Define a streaming-data flow graph– Rich set of data types to define tuple attributes

Declarative– Operator invocations name the input and output streams– Referring to streams by name is enough to connect the graph

Procedural support– Full-featured imperative language– Custom logic in operator invocations– Expressions in attribute assignments and parameter definitions

Extensible– User-defined data types– Custom functions written in SPL or a native language (C++ or Java)– Custom operators written in SPL– User-defined operators written in a native language (C++ or Java)

© 2013 IBM Corporation21

Page 21: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Some SPL Terms

An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams

A stream connects to an operator on a port – an operator defines input and output ports

An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.

Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join

A composite operator is a collection of operators– An encapsulation of a subgraph of

• Primitive operators (non-composite)• Composite operators (nested)

– Similar to a macro in a procedural language

© 2013 IBM Corporation22

Aggregate

Employee Info

Aggregate

port

Salary Statistics

port

TCPSource

FileSink

compositeoperator

Split

Join

Page 22: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Composite Operators

Every graph is encoded as a composite– A composite is a graph of one or more operators– A composite may have input and output ports– Source code construct only

• Nothing to do with operator fusion (PEs) Each stream declaration in the composite

– Invokes a primitive operator or– another composite operator

An application is a main composite– No input or output ports– Data flows in and out but not on

streams within a graph– Streams may be exported to and

imported from other applicationsrunning in the same instance

© 2013 IBM Corporation2323

composite Main { graph stream … { } stream … { } . . . }

Application (logical view)Application (logical view)

Stream 1

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator

Page 23: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Composing a Flow Graph with Stream Definitions

© 2013 IBM Corporation24

composite POS_TxHandling{ graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…}}

Composite POS_TxHandlingTaxableSalesOperator

2

TaxesDueOperator3

TCP-Sink

Deliveries Inventory ReordersOperator5

Sales

TCP-Sink

TCP-Source

Operator4

SalesOperator

1TCP-

Source

POS_Transactions

Page 24: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Page 25: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation26

Example description

Goal– Leverage information about the use of the OpenVPN infrastructure by on-line

students

Syslog server

REPORTS WITH COGNOSAND REAL-TIME REPORTS

WITH COGNOS RTM

DATAWAREHOUSE

INFOSPHERE STREAMSUsers

OpenVPN

DATAWAREHOUSEDATAWAREHOUSE

INFOSPHERE STREAMS

DATAWAREHOUSE

INFOSPHERE STREAMS

DATAWAREHOUSE

Page 26: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation27

Global diagram

Different composite for each step of the program– File source– Parsing– Aggregation– Enrichment with database– Database loading– HDFS loading

Transfers between composites with Export / Import operators

Page 27: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

FileSource Read data from files in formats such as csv, line, or binary

© 2013 IBM Corporation28

File source composite

Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Page 28: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

FileSource Read data from files in formats such as csv, line, or binary

© 2013 IBM Corporation29

Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started.

File source composite

Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Page 29: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Throttle Make a stream flow at a specified rate

© 2013 IBM Corporation30

File source composite

Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…

Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started.

Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …

Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …

Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Page 30: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation31

Parsing composite

Two different parsing methods

– TextExtract Run AQL queries (module or TAM) over text documents– Functor Perform tuple-level manipulations used with Tokenizer

Other operators to filter data

– Filter Remove some tuples from a stream– Split Forward tuples to output streams based on a predicate

Page 31: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation32

ContentContentTagHostNameTimestamp

Parsing composite

TextExtract

– Use AQL text analysis to extract fields of the syslog

Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Page 32: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation33

Split the flow to eliminate incomplete lines

Transfer only content part with Functor operator

Parsing composite

Filter only tuples coming from openvpn

ContentContentTagHostNameTimestamp

Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…

Page 33: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation34

Parsing composite

Functor with tokenizer– Break a string based on a delimiter character

– Structure of a sample openvpn content;<userId>;<virtualIP>;<cnxServer>;<dateTime>;<status>;<message>;<IPsource>;S:<bytesSend>;R:<

bytesReceived>

– <status> can be “Connected”, “Ended” or “Rejected”

Split the flow to eliminate “Rejected” connections

Page 34: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation35

Filter pairs that correspond to Connected / Ended

Aggregation composite

Aggregation to group logs which describe the same connection

– ;Odi_platform_demonstration_49165_1;10.30.20.253;openvpn6;Sun Jun 9 04:59:53 CEST 2013;Connected;Connection accepted;70.165.201.57;

– ;Odi_platform_demonstration_49165_1;10.30.20.253;openvpn6;Sun Jun 9 06:46:39 CEST 2013;Ended;Connection terminated, temps cnx 6412 sec;70.165.201.57;S:78203;R:370963;

Page 35: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation36

Split the flow according to the userID– “stud[0-9]*” which are the students’ connexion we want to analyze– “inst[0-9]*”– other

Aggregation composite

Functor is used to create a unique output tuple for each ended connection

Page 36: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation37

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263

Aggregation composite

Partial results

Page 37: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation38

Join Correlate two streams according to the userID and the DateStart and DateEnd of the course

Enrichment composite

ODBCSource Read data from a SQL database– Extract data about the course attended by students

Page 38: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands

stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands

stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands

© 2013 IBM Corporation39 39

Enrichment composite

Final results

Page 39: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Database loading composite

ODBCAppend Insert rows into an SQL database table

INFOSPHERE STREAMS DATAWAREHOUSE

Page 40: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

HDFS loading composant

HDFSFileSink Write data to files in Hadoop File System

Page 41: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Agenda

Infosphere Streams general description

Technical overview

Streams Processing Language

Use case

Demo

Page 42: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Toolkits and Accelerators to Speed Up Development

© 2013 IBM Corporation43

Standard ToolkitRelational Operators

Filter Sort Functor JoinPunctor Aggregate

Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink

Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp SwitchParse FormatDecompress CharacterTransform

XML OperatorXMLParse

IBM Supported ToolkitsDatabase DataStageBig Data Data ExplorerMessaging InternetText Analytics MiningSPSS CEPTime Series GeospatialFinancial

Open-Source ToolkitsJSON HTTP/RESTOpenCV AccumuloHBase …

Big Data AcceleratorsTelco Event Data AnalyticsSocial Data Analytics

User-Defined ToolkitsExtend the language by adding user-defined operators, types,and functions

Page 43: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

© 2013 IBM Corporation44

Page 44: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

RESOURCE LINKS

Page 45: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Who Can Help?

The Big Data Tiger Team– Evangelists for your region and/or industry

The Big Data Black Belts– Skilled TCPs for your region

Product Management– Roger Rea <[email protected]>

Product Marketing– Asha Marsh <[email protected]>

Worldwide Sales– Stewart Hanna <[email protected]>

Worldwide Technical Sales– Robert Uleman <[email protected]>

Worldwide Software Services– Sandra Tucker <[email protected]>

Customer Enablement– Kevin Foster <[email protected]>

© 2013 IBM Corporation46

Page 46: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Where to get Streams

Software available via– Passport Advantage site

– PartnerWorld

– Academic Initiative

– IBM Internal

– 90-day trial code• IBM Trials and Demos portal• ibm.com -> Support & downloads

->Download -> Trials and demos

© 2013 IBM Corporation47

Page 47: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

More Streams links

Streams in the Information Management Acceleration Zone

Streams in the Media Library (recorded sessions)– Search on “streams”

The Streams Sales Kit– Software Sellers Workplace

(the old eXtreme Leverage)

– PartnerWorld

InfoSphere Streams Wiki on developerWorks– Discussion forum, Streams Exchange

© 2013 IBM Corporation48

Page 48: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

SPL DOC

Page 49: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Some SPL Terms

An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams

A stream connects to an operator on a port – an operator defines input and output ports

An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.

Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join

A composite operator is a collection of operators– An encapsulation of a subgraph of

• Primitive operators (non-composite)• Composite operators (nested)

– Similar to a macro in a procedural language

© 2013 IBM Corporation50

Aggregate

Employee Info

Aggregate

port

Salary Statistics

port

TCPSource

FileSink

compositeoperator

Split

Join

Page 50: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Composite Operators

Every graph is encoded as a composite– A composite is a graph of one or more operators– A composite may have input and output ports– Source code construct only

• Nothing to do with operator fusion (PEs) Each stream declaration in the composite

– Invokes a primitive operator or– another composite operator

An application is a main composite– No input or output ports– Data flows in and out but not on

streams within a graph– Streams may be exported to and

imported from other applicationsrunning in the same instance

© 2013 IBM Corporation5151

composite Main { graph stream … { } stream … { } . . . }

Application (logical view)Application (logical view)

Stream 1

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator

Page 51: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Composing a Flow Graph with Stream Definitions

© 2013 IBM Corporation52

composite POS_TxHandling{ graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…}}

Composite POS_TxHandlingTaxableSalesOperator

2

TaxesDueOperator3

TCP-Sink

Deliveries Inventory ReordersOperator5

Sales

TCP-Sink

TCP-Source

Operator4

SalesOperator

1TCP-

Source

POS_Transactions

Page 52: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Anatomy of an Operator Invocation

Operators share a common structure– italics are sections to fill in

Reading an operator invocation– Declare a stream stream-name – With attributes from stream-type – that is produced by MyOperator – from the input(s) input-stream– MyOperator behavior defined by

logic, parameters, windowspec, and configuration; output attribute assignments are specified in output

For the example:– Declare the stream Sale with the attribute

item, which is a raw (ASCII) string– Join the Bid and Ask streams with – sliding windows of 30 seconds on Bid,

and 50 tuples of Ask– When items are equal, and Bid price is

greater than or equal to Ask price– Output the item value on the Sale stream

© 2013 IBM Corporation53

stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ;}

Syntax:

53

ExampleExample

stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}

stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}

Page 53: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Data Types

© 2013 IBM Corporation54

(any)(any)

(composite)(composite)(primitive)(primitive)

(collection)(collection) tupletuplebooleanboolean enumenum (numeric)(numeric) timestamptimestamp (string)(string) blobblob

listlist setset mapmaprstringrstring ustringustring(integral)(integral) (floatingpoint)(floatingpoint) (complex)(complex)

(signed)(signed) (unsigned)(unsigned) (float)(float) (decimal)(decimal)

int8

int16

int32

int64

int8

int16

int32

int64

uint8

uint16

uint32

uint64

uint8

uint16

uint32

uint64

float32

float64

float128

float32

float64

float128

decimal32

decimal64

decimal128

decimal32

decimal64

decimal128

complex32

complex64

complex128

complex32

complex64

complex128

xmlxml

User-defined typesUser-defined types

type Integers = list<int32>;type MySchema = rstring s, Integers ls;

type Integers = list<int32>;type MySchema = rstring s, Integers ls;

Page 54: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Collection Types

list: array with bounds-checking [0, 17, age-1, 99]

– Random access: can access any element at any time– Ordered, base-zero indexed: first element is someList[0]

set: unordered collection {"cats", "yeasts", "plankton"}

– No duplicate element values map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}

– Unordered Use type constructors to specify element type

– list<type>, set<type> list<uint16>, set<rstring>

– map<key-type,value-type> map<rstring[3],int8>

Can be nested to any number of levels– map<int32, list<tuple<ustring name, int64 value>>>– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}

Bounded collections optimize performance– list<int32>[5]: at most 5 (32-bit) integer elements– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters

© 2013 IBM Corporation55

Page 55: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Toolkits and Accelerators to Speed Up Development

© 2013 IBM Corporation56

Standard ToolkitRelational Operators

Filter Sort Functor JoinPunctor Aggregate

Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink

Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp SwitchParse FormatDecompress CharacterTransform

XML OperatorXMLParse

IBM Supported ToolkitsDatabase DataStageBig Data Data ExplorerMessaging InternetText Analytics MiningSPSS CEPTime Series GeospatialFinancial

Open-Source ToolkitsJSON HTTP/RESTOpenCV AccumuloHBase …

Big Data AcceleratorsTelco Event Data AnalyticsSocial Data Analytics

User-Defined ToolkitsExtend the language by adding user-defined operators, types,and functions

Page 56: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Standard Toolkit: Relational Operators

Relational operators– Functor Perform tuple-level manipulations– Filter Remove some tuples from a stream– Aggregate Group and summarize incoming tuples– Sort Impose an order on incoming tuples in a stream– Join Correlate two streams– Punctor Insert window punctuation markers into a stream

© 2013 IBM Corporation57

Page 57: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Standard Toolkit: Adapter Operators

Adapter operators– FileSource Read data from files in formats such as csv, line, or binary– FileSink Write data to files in formats such as csv, line, or binary– DirectoryScan Detect files to be read as they appear in a given directory– TCPSource Read data from TCP sockets in various formats– TCPSink Write data to TCP sockets in various formats– UDPSource Read data from UDP sockets in various formats– UDPSink Write data to UDP sockets in various formats– Export Make a stream available to other jobs in the same

instance– Import Connect to streams exported by other jobs– MetricsSink Create displayable metrics from numeric expressions

© 2013 IBM Corporation58

Page 58: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Standard Toolkit: Utility Operators

Workload generation and custom logic– Beacon Emit generated values; timing and number configurable– Custom Apply arbitrary SPL logic to produce tuples

Coordination and timing– Throttle Make a stream flow at a specified rate– Delay Time-shift an entire stream relative to other streams– Gate Wait for acknowledgement from downstream operator– Switch Block or allow the flow of tuples based on control input

© 2013 IBM Corporation59

Page 59: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Standard Toolkit: Utility Operators (cont’d)

Parallel flows– Barrier Synchronize tuples from sequence-correlated streams– Pair Group tuples from multiple streams of same type– Split Forward tuples to output streams based on a predicate– ThreadedSplit Distribute tuples over output streams by availability– Union Construct an output tuple from each input tuple– DeDuplicate Suppress duplicate tuples seen within a given time period

Miscellaneous– DynamicFilter Filter tuples based on criteria that can change while it runs– JavaOp General-purpose operator for encapsulating Java code– Parse Parse blob data for use with user-defined adapters– Format Format blob data for use with user-defined adapters– Compress Compress blob data– Decompress Decompress blob data– CharacterTransform Convert blob data from one encoding to another

© 2013 IBM Corporation60

Page 60: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Streams Extensibility: Toolkits

Like packages, plugins, addons, extenders, etc. – Reusable sets of operators, types, and functions– What can be included in a toolkit?

• Primitive and composite operators• User-defined types• Native and SPL functions• Sample applications, utilities• Tools, documentation, data, etc.

– Versioning is supported– Define dependencies on other versioned assets (toolkits, Streams itself)

Base for creating cross-domain and domain-specific applications

Developed by IBM, partners, end-customers– Complete APIs and tools available– Same power as the “built-in” Standard Toolkit

© 2013 IBM Corporation6161

Page 61: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Integration: the Internet and Database Toolkits

Integration with traditional sources and consumers

Internet Toolkit– InetSource periodically retrieve data from HTTP, FTP, RSS, and files

Database Toolkit• ODBC databases: DB2, Informix, Oracle, solidDb, MySQL, SQLServer, Netezza

– ODBCAppend Insert rows into an SQL database table– ODBCEnrich Extend streaming data based on lookups in database tables– ODBCRun Perform SQL queries with parameters from input tuples– ODBCSource Read data from a SQL database– SolidDBEnrich Perform table lookups in an in-memory database – DB2SplitDB Split a stream by DB2 partition key– DB2PartitionedAppend Write data to table in specified DB2 partition– NetezzaLoad Perform high-speed loads into a Netezza database– NetezzaPrepareLoad Convert tuple to delimited string for Netezza loads

© 2013 IBM Corporation62

Page 62: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Integration: the Big Data Toolkit

Integration with IBM’s Big Data Platform

Data Explorer– DataExplorerPush Insert records into a Data Explorer index

BigInsights: Hadoop Distributed File System– HDFSDirectoryScan Like DirectoryScan, only for HDFS– HDFSFileSource Like FileSource, only for HDFS– HDFSFileSink Like FileSink, only for HDFS– HDFSSplit Write batches of data in parallel to HDFS

© 2013 IBM Corporation63

Page 63: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Integration: the DataStage Integration Toolkit

Streams real-time analytics and DataStage information integration– Perform deeper analysis on the

data as part of the information integration flow

– Get more timely results and offload some analytics load from the warehouse

Operators and tooling– Adapters to exchange data

between Streams and DataStage• DSSource• DSSink

– Tooling to generate integration code

– DataStage provides Streams connectors

© 2013 IBM Corporation64

Page 64: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Integration: the Messaging Toolkit

Integrate with IBM WebSphere MQ– Create a stream from a subscription to an MQ topic or queue– Publish a stream to an MQ series topic or queue

Operators– XMSSource Read data from an MQ queue or topic– XMSSink Send messages to applications that use WebSphere MQ

© 2013 IBM Corporation65

Page 65: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Integration and Analytics: the Text Toolkit

Derive structured information from unstructured text– Apply extractors

• Programs that encode rules for extracting information from text• Written in AQL (Annotation Query Language)• Developed against a static repository of text data• AQL files can be combined into modules (directories) • Modules can be compiled into TAM files• NOTE: Old-style AOG files not supported by BigInsights 2.0 and this toolkit

Can be used in Streams 3.0 with the Deprecated Text Toolkit Streams Studio plugins for developing AQL queries

– Same tooling as in BigInsights Operator and utility

– TextExtract Run AQL queries (module or TAM) over text documents– createtypes script

• Create stream type definitions that match the output views of the extractors. Plays a major role in accelerators

– SDA: Social Data Analytics– MDA: Machine Data Analytics

© 2013 IBM Corporation66

Page 66: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Analytics: The Geospatial Toolkit

High-performance analysis and processing of geospatial data Enables Location Based Services (LBS)

– Smarter Transportation: monitor traffic speed and density– Geofencing: detect when objects enter or leave a specified area

Geospatial data types– e.g. Point, LineString, Polygon

Geospatial functions– e.g. Distance, Map point to LineString, IsContained, etc.

© 2013 IBM Corporation67

Page 67: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Analytics: The TimeSeries Toolkit

Apply Digital Signal Processing techniques– Find patterns and anomalies in real time– Predict future values in real time

A rich set of functionality for working with time series data– Generation

• Synthesize specific wave forms (e.g., sawtooth, sine)– Preprocessing

• Preparation and conditioning (ReSample, TSWindowing)– Analysis

• Statistics, correlations, decomposition and transformation– Modeling

• Prediction, regression and tracking (e.g. Holt-Winters, GAMLearner)

© 2013 IBM Corporation68

Page 68: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Analytics: the Mining Toolkit

Enables scoring of real-time data in a Streams application– Scoring is performed against a predefined model (in PMML file)– Supports a variety of model types and scoring algorithms

Predictive Model Markup Language (PMML)– Standard for statistical and data mining models– Produced by Analytics tools: SPSS, ISAS, etc.– XML representation defined by http://www.dmg.org/

Scoring operators– Classification Assign tuple to a class and report confidence– Clustering Assign tuple to a cluster and compute score– Regression Calculate predicted value and standard deviation– Associations Identify the applicable rule and report consequent

(rule head), support, and confidence Supports dynamic replacement of the PMML model used

by an operator

© 2013 IBM Corporation69

Page 69: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

IBM InfoSphere Streams

Analytics: Streams + SPSS Real Time Scoring Service Included with SPSS, not with Streams

© 2013 IBM Corporation70

In m

emory

In memory

RepositoryRepository

ChangeNotification

S SPSS Scoring Operator

SPSS Modeler Scoring Stream

R SPSS Repository Operator

P SPSS Publish Operator

P

IBM SPSS Collaboration & Deployment Services

Model Refresh

R

FileSystem IBM SPSS Modeler Solution

Publisher

S

Page 70: IBM InfoSphere Streams Technical Overview Julie Garrone j.garrone@fr.ibm.com Christophe Menichetti Christophe.menichetti@fr.ibm.com © 2013 IBM Corporation.

Analytics: the Financial Services Toolkit

Adapters– Financial Information Exchange (FIX)– WebSphere Front Office for Financial Markets (WFO) – WebSphere MQ Low-Latency Messaging (LLM) Adapters

Types and Functions– OptionType (put, call), TxType (buy, sell), Trade, Quote, OptionQuote, Order– Coefficient of Correlation– “The Greeks” (put/call values, delta, theta, etc.)

Operators– Based on QuantLib financial analytics open source package.– Compute theoretical value of an option (European, American)

Example applications– Equities Trading – Options Trading

© 2013 IBM Corporation71