IBM InfoSphere Streams Technical Overview Julie Garrone [email protected] Christophe Menichetti...
-
Upload
jeremiah-stanley -
Category
Documents
-
view
224 -
download
6
Transcript of IBM InfoSphere Streams Technical Overview Julie Garrone [email protected] Christophe Menichetti...
IBM InfoSphere StreamsTechnical Overview
Julie [email protected]
Christophe [email protected]
© 2013 IBM CorporationApril 10, 2023
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
Technical overview
Streams Processing Language
Use case
Technical overview
Streams Processing Language
IBM InfoSphere Streams v3.0
A platform for real-time analytics on BIG data Volume
– Terabytes per second– Petabytes per day
Variety– All kinds of data– All kinds of analytics
Velocity– Insights in microseconds
Agility– Dynamically responsive – Rapid application development
© 2013 IBM Corporation4
Millions of events per
second
Microsecond Latency
Sensor, video, audio, text, and relational data sources
Just-in-time decisions
PowerfulAnalytics
Streams Analyzes All Kinds of Data
© 2013 IBM Corporation5
Mining in Microseconds(included with Streams)Mining in Microseconds(included with Streams)
Image & Video (Open Source)
Simple & Advanced Text (included with Streams)Simple & Advanced Text (included with Streams)
Text(listen, verb),
(radio, noun)
Acoustic(IBM Research)(Open Source)
Acoustic(IBM Research)(Open Source)
Geospatial(included with Streams)
Geospatial(included with Streams)
Predictive(included with Streams)
Predictive(included with Streams)
Advanced Mathematical Models(IBM Research)
Advanced Mathematical Models(IBM Research)
Statistics(included with Streams)
Statistics(included with Streams)
population
tt asR ),(
***New***New
***New***New
***New***New
Categories of Problems Solved by Streams
Applications that require on-the-fly processing, filtering and analysis of streaming data– Sensors: environmental, industrial, surveillance video, GPS, …– “Data exhaust”: network/system/web server/app server log files– High-rate transaction data: financial transactions, call detail records
Criteria: two or more of the following– Messages are processed in isolation or in limited data windows– Sources include non-traditional data (spatial, imagery, text, …)– Sources vary in connection methods, data rates, and processing requirements,
presenting integration challenges– Data rates/volumes require the resources of multiple processing nodes– Analysis and response are needed with sub-millisecond latency– Data rates and volumes are too great for store-and-mine approaches
© 2013 IBM Corporation6
Traditional / Relational
Data Sources
Traditional / Relational
Data Sources
The Big Data Ecosystem: Interoperability is Key
© 2013 IBM Corporation7
StreamingData
TraditionalWarehouse
Analytics onData at Rest
DataWarehouse
Analytics on Structured Data
RTAP: Analytics onData in Motion
BigInsightsBigInsightsNon-Traditional /Non-RelationalData Sources
Non-Traditional /Non-RelationalData Sources
Non-Traditional / Non-Relational
Data Feeds
Non-Traditional / Non-Relational
Data Feeds
Traditional / Relational
Data Sources
Traditional / Relational
Data Sources
Internet-Scale
Data Sets
StreamsStreams
EuroBuzz with France Televisions EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song ContestSmarter Analytics applied to the Eurovision Song Contest
8
Tweets on Eurovision event
Buzz Statistics
Tweet format{"entities":{"hashtags":[{"text":“eurovision","text":“Sweeden
will win Eurovision for sure #eurovision”….
JSON Format
Data-visualisation Partner
Gauge by country
Latest tweets on Eurovision
Pictures of Eurovision/Twitter top contributor
s
Streams Overall
statistics(World & France)
Twitter Streaming API
Tweets filtered by Eurovision specific hashtags:#eurovision#eurofrancetv…
Real-time analysis
•Filters, Aggregations, Calculations on incoming tweets•2 different Streams based on hashtags
• World stream collecting tweets from all eurovision hastags
• France stream collecting tweets from the France Televisions hastag #eurofrancetv
•Analyse of 2 different languages: English and French
24 hours Data Aggregation
France 3 website
Live Eurovision
Show
All blades with RHEL6.1.64
SAN
SAN
DS8700
HS22 8 cores 2.93 GHz
48GB RAMMaster Node
HS22 8 cores 2.93 GHz
48GB RAMSecondary Node
Data-visualisation Partner
CPU Total HS22 29/05/2012
0
10
20
30
40
50
60
70
80
90
100
20:3
7
20:4
0
20:4
3
20:4
6
20:4
9
20:5
2
20:5
5
20:5
8
21:0
1
21:0
4
21:0
7
21:1
0
21:1
3
21:1
6
21:1
9
21:2
2
21:2
5
21:2
8
21:3
1
21:3
4
21:3
7
21:4
0
21:4
3
21:4
6
21:4
9
21:5
2
21:5
5
21:5
8
22:0
1
22:0
4
22:0
7
22:1
0
22:1
3
22:1
6
22:1
9
22:2
2
22:2
5
22:2
8
22:3
1
22:3
4
22:3
7
22:4
0
22:4
3
22:4
6
22:4
9
22:5
2
22:5
5
22:5
8
23:0
1
23:0
4
23:0
7
23:1
0
23:1
3
23:1
6
23:1
9
User% Sys% Wait%
Max 100%
CPU Total HS22 29/05/2012
0
1
2
3
4
5
6
7
8
9
10
20:3
7
20:4
0
20:4
3
20:4
6
20:4
9
20:5
2
20:5
5
20:5
8
21:0
1
21:0
4
21:0
7
21:1
0
21:1
3
21:1
6
21:1
9
21:2
2
21:2
5
21:2
8
21:3
1
21:3
4
21:3
7
21:4
0
21:4
3
21:4
6
21:4
9
21:5
2
21:5
5
21:5
8
22:0
1
22:0
4
22:0
7
22:1
0
22:1
3
22:1
6
22:1
9
22:2
2
22:2
5
22:2
8
22:3
1
22:3
4
22:3
7
22:4
0
22:4
3
22:4
6
22:4
9
22:5
2
22:5
5
22:5
8
23:0
1
23:0
4
23:0
7
23:1
0
23:1
3
23:1
6
23:1
9
User% Sys% Wait%
Network I/O HS22 (KB/s) - 29/05/2012
-400
-300
-200
-100
0
100
200
300
400
500
600
20:3
7
20:4
0
20:4
3
20:4
6
20:4
9
20:5
2
20:5
5
20:5
8
21:0
1
21:0
4
21:0
7
21:1
0
21:1
3
21:1
6
21:1
9
21:2
2
21:2
5
21:2
8
21:3
1
21:3
4
21:3
7
21:4
0
21:4
3
21:4
6
21:4
9
21:5
2
21:5
5
21:5
8
22:0
1
22:0
4
22:0
7
22:1
0
22:1
3
22:1
6
22:1
9
22:2
2
22:2
5
22:2
8
22:3
1
22:3
4
22:3
7
22:4
0
22:4
3
22:4
6
22:4
9
22:5
2
22:5
5
22:5
8
23:0
1
23:0
4
23:0
7
23:1
0
23:1
3
23:1
6
23:1
9
Total-Read Total-Write (-ve)
EuroBuzz with France Televisions EuroBuzz with France Televisions Smarter Analytics applied to the Eurovision Song ContestSmarter Analytics applied to the Eurovision Song Contest
Network peak at the beginning of
each song
Very low CPU utilization : < 1%
-> InfoSphere Streams Efficiency
Max 10%
Infrastructure for the PoC
•2 Blades in Active/Active Mode (Nehalem Sockets) •SAN Storage to store all data during the PoC (All Tweets)
Use case– Neonatal infant monitoring– Predict infection in ICU 24 hours in advance
Solutions– 120 children monitored :120K msg/sec, billion msg/day– Trials expanding to include hospitals in US and China
University of Ontario Institute of Technology
SensorNetwork
Stream-based Distributed Interoperable Health care Infrastructure
Solutions (Applications)
Event Pre-processer
Analysis Framework
11
Asian Telco reduces billing costs and improves customer satisfaction
Asian Telco reduces billing costs and improves customer satisfaction
Capabilities:
Stream ComputingAnalytic Accelerators
Real-time mediation and analysis of
5B CDRs per day
Data processing time reduced from 12 hrs to 1 min
Hardware cost reduced to 1/8th
Proactively address issues (e.g. dropped calls) impacting customer satisfaction.
11
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
continuous ingestion Continuous ingestion Continuous analysis
How Streams Works
© 2013 IBM Corporation13
Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts
Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity
TransformTransformFilter / SampleFilter / Sample
ClassifyClassifyCorrelateCorrelate
AnnotateAnnotate
Where appropriate: Elements can be fused togetherfor lower communication latency
Continuous ingestion Continuous analysis
How Streams Works
© 2013 IBM Corporation14
InfoSphere Streams Runtime
Distributed runtime, running on a cluster– No limit on cluster size
Import/export streams between jobs– Agility: dynamically reconfigure applications
Performance and health metrics – down to operator level
Jobs submitted and canceled at any time
Hosts added to and removed from cluster at any time
High Availability – Restart and relocate portions of applications– Assign parts of applications to host pools based on host tags
Application portability – Simplifies test and operations
© 2013 IBM Corporation15
From Operators to Running Jobs
Streams application graph:– A directed, possibly cyclic, graph– A collection of operators– Connected by streams
Each complete application is a potentially deployable job
Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance)
An instance can include a single processing node (hardware)
Or multiple processing nodes
© 2013 IBM Corporation16
Streams instance
OP
OP
Src
Src
Sink
Sink
OP
stream
h/w node
node nodenode
nodenode node
node
InfoSphere Streams Objects: Runtime View
Instance– Runtime instantiation of InfoSphere
Streams executing across one or more hosts
– Collection of components and services Processing Element (PE)
– Fundamental execution unit that is run by the Streams instance
– Can encapsulate a single operator or many “fused” operators
Job– A deployed Streams application
executing in an instance – Consists of one or more PEs
© 2013 IBM Corporation17
InstanceInstance
JobJob
NodeNode
Stream 1 PEPE PEPE
NodeNode
PEPE
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
InfoSphere Streams Objects: Development View Operator
– The fundamental building block of the Streams Processing Language
– Operators process data from streams and may produce new streams
Stream– An infinite sequence of structured tuples– Can be consumed by operators on a tuple-
by-tuple basis or through the definition of a window
Tuple– A structured list of attributes and their types.
Each tuple on a stream has the form dictated by its stream type
Stream type– Specification of the name and data type of
each attribute in the tuple Window
– A finite, sequential group of tuples– Based on count, time, attribute value,
or punctuation marks
© 2013 IBM Corporation18
directory: "/img"filename: "farm"
directory: "/img"filename: "bird"
directory: "/opt"filename: "java"
directory: "/img"filename: "cat"
Streams Application
stream
tuple
height: 640width: 480data:
height: 1280width: 1024data:
height: 640width: 480data:
operator
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
What is Streams Processing Language?
Designed for stream computing– Define a streaming-data flow graph– Rich set of data types to define tuple attributes
Declarative– Operator invocations name the input and output streams– Referring to streams by name is enough to connect the graph
Procedural support– Full-featured imperative language– Custom logic in operator invocations– Expressions in attribute assignments and parameter definitions
Extensible– User-defined data types– Custom functions written in SPL or a native language (C++ or Java)– Custom operators written in SPL– User-defined operators written in a native language (C++ or Java)
© 2013 IBM Corporation21
Some SPL Terms
An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams
A stream connects to an operator on a port – an operator defines input and output ports
An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.
Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join
A composite operator is a collection of operators– An encapsulation of a subgraph of
• Primitive operators (non-composite)• Composite operators (nested)
– Similar to a macro in a procedural language
© 2013 IBM Corporation22
Aggregate
Employee Info
Aggregate
port
Salary Statistics
port
TCPSource
FileSink
compositeoperator
Split
Join
Composite Operators
Every graph is encoded as a composite– A composite is a graph of one or more operators– A composite may have input and output ports– Source code construct only
• Nothing to do with operator fusion (PEs) Each stream declaration in the composite
– Invokes a primitive operator or– another composite operator
An application is a main composite– No input or output ports– Data flows in and out but not on
streams within a graph– Streams may be exported to and
imported from other applicationsrunning in the same instance
© 2013 IBM Corporation2323
composite Main { graph stream … { } stream … { } . . . }
Application (logical view)Application (logical view)
Stream 1
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
Composing a Flow Graph with Stream Definitions
© 2013 IBM Corporation24
composite POS_TxHandling{ graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…}}
Composite POS_TxHandlingTaxableSalesOperator
2
TaxesDueOperator3
TCP-Sink
Deliveries Inventory ReordersOperator5
Sales
TCP-Sink
TCP-Source
Operator4
SalesOperator
1TCP-
Source
POS_Transactions
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
© 2013 IBM Corporation26
Example description
Goal– Leverage information about the use of the OpenVPN infrastructure by on-line
students
Syslog server
REPORTS WITH COGNOSAND REAL-TIME REPORTS
WITH COGNOS RTM
DATAWAREHOUSE
INFOSPHERE STREAMSUsers
OpenVPN
DATAWAREHOUSEDATAWAREHOUSE
INFOSPHERE STREAMS
DATAWAREHOUSE
INFOSPHERE STREAMS
DATAWAREHOUSE
© 2013 IBM Corporation27
Global diagram
Different composite for each step of the program– File source– Parsing– Aggregation– Enrichment with database– Database loading– HDFS loading
Transfers between composites with Export / Import operators
FileSource Read data from files in formats such as csv, line, or binary
© 2013 IBM Corporation28
File source composite
Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
FileSource Read data from files in formats such as csv, line, or binary
© 2013 IBM Corporation29
Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started.
File source composite
Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
Throttle Make a stream flow at a specified rate
© 2013 IBM Corporation30
File source composite
Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…
Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …
Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …
Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
© 2013 IBM Corporation31
Parsing composite
Two different parsing methods
– TextExtract Run AQL queries (module or TAM) over text documents– Functor Perform tuple-level manipulations used with Tokenizer
Other operators to filter data
– Filter Remove some tuples from a stream– Split Forward tuples to output streams based on a predicate
© 2013 IBM Corporation32
ContentContentTagHostNameTimestamp
Parsing composite
TextExtract
– Use AQL text analysis to extract fields of the syslog
Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
© 2013 IBM Corporation33
Split the flow to eliminate incomplete lines
Transfer only content part with Functor operator
Parsing composite
Filter only tuples coming from openvpn
ContentContentTagHostNameTimestamp
Jun 9 03:46:02 psscgwy6 kernel: imklog 4.6.2, log source = /proc/kmsg started."Jun 9 03:46:02 psscgwy6 rsyslogd: [origin software=\"rsyslogd\" swVersion=\"4.6.2\" x-pid…Jun 9 03:46:02 psscgwy6 rhsm-complianced: This system is missing one or more valid …Jun 9 04:59:53 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…Jun 9 05:12:03 openvpn5 openvpn5: ;frbonfils;10.30.16.39;openvpn5;Sun Jun 9 05:12:03 …Jun 9 06:46:39 openvpn6 openvpn6: ;Odi_platform_demonstration_49165_1;10.30.20.253;…
© 2013 IBM Corporation34
Parsing composite
Functor with tokenizer– Break a string based on a delimiter character
– Structure of a sample openvpn content;<userId>;<virtualIP>;<cnxServer>;<dateTime>;<status>;<message>;<IPsource>;S:<bytesSend>;R:<
bytesReceived>
– <status> can be “Connected”, “Ended” or “Rejected”
Split the flow to eliminate “Rejected” connections
© 2013 IBM Corporation35
Filter pairs that correspond to Connected / Ended
Aggregation composite
Aggregation to group logs which describe the same connection
– ;Odi_platform_demonstration_49165_1;10.30.20.253;openvpn6;Sun Jun 9 04:59:53 CEST 2013;Connected;Connection accepted;70.165.201.57;
– ;Odi_platform_demonstration_49165_1;10.30.20.253;openvpn6;Sun Jun 9 06:46:39 CEST 2013;Ended;Connection terminated, temps cnx 6412 sec;70.165.201.57;S:78203;R:370963;
© 2013 IBM Corporation36
Split the flow according to the userID– “stud[0-9]*” which are the students’ connexion we want to analyze– “inst[0-9]*”– other
Aggregation composite
Functor is used to create a unique output tuple for each ended connection
© 2013 IBM Corporation37
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263
Aggregation composite
Partial results
© 2013 IBM Corporation38
Join Correlate two streams according to the userID and the DateStart and DateEnd of the course
Enrichment composite
ODBCSource Read data from a SQL database– Extract data about the course attended by students
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands
stud213_8,openvpn,71.184.253.25,,10.30.9.94,openvpn3,Ended,1370804901,09-06-2013,19:08:21,09-06-2013,19:10:17,117,4723,2779,JL27,1SM00,Cross brand,STG brands
stud209_4,openvpn,89.133.212.172,,10.30.9.99,openvpn3,Ended,1370866973,10-06-2013,12:22:53,10-06-2013,12:25:37,163,5231,3263,5367,1SX2G,Cross brand,STG brands
© 2013 IBM Corporation39 39
Enrichment composite
Final results
Database loading composite
ODBCAppend Insert rows into an SQL database table
INFOSPHERE STREAMS DATAWAREHOUSE
HDFS loading composant
HDFSFileSink Write data to files in Hadoop File System
Agenda
Infosphere Streams general description
Technical overview
Streams Processing Language
Use case
Demo
Toolkits and Accelerators to Speed Up Development
© 2013 IBM Corporation43
Standard ToolkitRelational Operators
Filter Sort Functor JoinPunctor Aggregate
Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink
Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp SwitchParse FormatDecompress CharacterTransform
XML OperatorXMLParse
IBM Supported ToolkitsDatabase DataStageBig Data Data ExplorerMessaging InternetText Analytics MiningSPSS CEPTime Series GeospatialFinancial
Open-Source ToolkitsJSON HTTP/RESTOpenCV AccumuloHBase …
Big Data AcceleratorsTelco Event Data AnalyticsSocial Data Analytics
User-Defined ToolkitsExtend the language by adding user-defined operators, types,and functions
© 2013 IBM Corporation44
RESOURCE LINKS
Who Can Help?
The Big Data Tiger Team– Evangelists for your region and/or industry
The Big Data Black Belts– Skilled TCPs for your region
Product Management– Roger Rea <[email protected]>
Product Marketing– Asha Marsh <[email protected]>
Worldwide Sales– Stewart Hanna <[email protected]>
Worldwide Technical Sales– Robert Uleman <[email protected]>
Worldwide Software Services– Sandra Tucker <[email protected]>
Customer Enablement– Kevin Foster <[email protected]>
© 2013 IBM Corporation46
Where to get Streams
Software available via– Passport Advantage site
– PartnerWorld
– Academic Initiative
– IBM Internal
– 90-day trial code• IBM Trials and Demos portal• ibm.com -> Support & downloads
->Download -> Trials and demos
© 2013 IBM Corporation47
More Streams links
Streams in the Information Management Acceleration Zone
Streams in the Media Library (recorded sessions)– Search on “streams”
The Streams Sales Kit– Software Sellers Workplace
(the old eXtreme Leverage)
– PartnerWorld
InfoSphere Streams Wiki on developerWorks– Discussion forum, Streams Exchange
© 2013 IBM Corporation48
SPL DOC
Some SPL Terms
An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams
A stream connects to an operator on a port – an operator defines input and output ports
An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.
Many operators have one input port and one output port; others have – zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join
A composite operator is a collection of operators– An encapsulation of a subgraph of
• Primitive operators (non-composite)• Composite operators (nested)
– Similar to a macro in a procedural language
© 2013 IBM Corporation50
Aggregate
Employee Info
Aggregate
port
Salary Statistics
port
TCPSource
FileSink
compositeoperator
Split
Join
Composite Operators
Every graph is encoded as a composite– A composite is a graph of one or more operators– A composite may have input and output ports– Source code construct only
• Nothing to do with operator fusion (PEs) Each stream declaration in the composite
– Invokes a primitive operator or– another composite operator
An application is a main composite– No input or output ports– Data flows in and out but not on
streams within a graph– Streams may be exported to and
imported from other applicationsrunning in the same instance
© 2013 IBM Corporation5151
composite Main { graph stream … { } stream … { } . . . }
Application (logical view)Application (logical view)
Stream 1
Stream 1
Stream 2
Stream 3
Stream 3
Stream 4
Stream 5
operator
Composing a Flow Graph with Stream Definitions
© 2013 IBM Corporation52
composite POS_TxHandling{ graph stream<…> POS_Transactions = TCPSource() {…} stream<…> Sales = Operator1(POS_Transactions) {…} stream<…> TaxableSales = Operator2(Sales) {…} stream<…> TaxesDue = Operator3(TaxableSales) {…} () as Sink1 = TCPSink(TaxesDue) {…} stream<…> Deliveries = TCPSource() {…} stream<…> Inventory = Operator4(Sales;Deliveries) {…} stream<…> Reorders = Operator5(Inventory) {…} () as Sink2 = TCPSink(Reorders) {…}}
Composite POS_TxHandlingTaxableSalesOperator
2
TaxesDueOperator3
TCP-Sink
Deliveries Inventory ReordersOperator5
Sales
TCP-Sink
TCP-Source
Operator4
SalesOperator
1TCP-
Source
POS_Transactions
Anatomy of an Operator Invocation
Operators share a common structure– italics are sections to fill in
Reading an operator invocation– Declare a stream stream-name – With attributes from stream-type – that is produced by MyOperator – from the input(s) input-stream– MyOperator behavior defined by
logic, parameters, windowspec, and configuration; output attribute assignments are specified in output
For the example:– Declare the stream Sale with the attribute
item, which is a raw (ASCII) string– Join the Bid and Ask streams with – sliding windows of 30 seconds on Bid,
and 50 tuples of Ask– When items are equal, and Bid price is
greater than or equal to Ask price– Output the item value on the Sale stream
© 2013 IBM Corporation53
stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ;}
Syntax:
53
ExampleExample
stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}
stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}
Streams Data Types
© 2013 IBM Corporation54
(any)(any)
(composite)(composite)(primitive)(primitive)
(collection)(collection) tupletuplebooleanboolean enumenum (numeric)(numeric) timestamptimestamp (string)(string) blobblob
listlist setset mapmaprstringrstring ustringustring(integral)(integral) (floatingpoint)(floatingpoint) (complex)(complex)
(signed)(signed) (unsigned)(unsigned) (float)(float) (decimal)(decimal)
int8
int16
int32
int64
int8
int16
int32
int64
uint8
uint16
uint32
uint64
uint8
uint16
uint32
uint64
float32
float64
float128
float32
float64
float128
decimal32
decimal64
decimal128
decimal32
decimal64
decimal128
complex32
complex64
complex128
complex32
complex64
complex128
xmlxml
User-defined typesUser-defined types
type Integers = list<int32>;type MySchema = rstring s, Integers ls;
type Integers = list<int32>;type MySchema = rstring s, Integers ls;
Collection Types
list: array with bounds-checking [0, 17, age-1, 99]
– Random access: can access any element at any time– Ordered, base-zero indexed: first element is someList[0]
set: unordered collection {"cats", "yeasts", "plankton"}
– No duplicate element values map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}
– Unordered Use type constructors to specify element type
– list<type>, set<type> list<uint16>, set<rstring>
– map<key-type,value-type> map<rstring[3],int8>
Can be nested to any number of levels– map<int32, list<tuple<ustring name, int64 value>>>– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}
Bounded collections optimize performance– list<int32>[5]: at most 5 (32-bit) integer elements– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters
© 2013 IBM Corporation55
Toolkits and Accelerators to Speed Up Development
© 2013 IBM Corporation56
Standard ToolkitRelational Operators
Filter Sort Functor JoinPunctor Aggregate
Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink
Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp SwitchParse FormatDecompress CharacterTransform
XML OperatorXMLParse
IBM Supported ToolkitsDatabase DataStageBig Data Data ExplorerMessaging InternetText Analytics MiningSPSS CEPTime Series GeospatialFinancial
Open-Source ToolkitsJSON HTTP/RESTOpenCV AccumuloHBase …
Big Data AcceleratorsTelco Event Data AnalyticsSocial Data Analytics
User-Defined ToolkitsExtend the language by adding user-defined operators, types,and functions
Streams Standard Toolkit: Relational Operators
Relational operators– Functor Perform tuple-level manipulations– Filter Remove some tuples from a stream– Aggregate Group and summarize incoming tuples– Sort Impose an order on incoming tuples in a stream– Join Correlate two streams– Punctor Insert window punctuation markers into a stream
© 2013 IBM Corporation57
Streams Standard Toolkit: Adapter Operators
Adapter operators– FileSource Read data from files in formats such as csv, line, or binary– FileSink Write data to files in formats such as csv, line, or binary– DirectoryScan Detect files to be read as they appear in a given directory– TCPSource Read data from TCP sockets in various formats– TCPSink Write data to TCP sockets in various formats– UDPSource Read data from UDP sockets in various formats– UDPSink Write data to UDP sockets in various formats– Export Make a stream available to other jobs in the same
instance– Import Connect to streams exported by other jobs– MetricsSink Create displayable metrics from numeric expressions
© 2013 IBM Corporation58
Streams Standard Toolkit: Utility Operators
Workload generation and custom logic– Beacon Emit generated values; timing and number configurable– Custom Apply arbitrary SPL logic to produce tuples
Coordination and timing– Throttle Make a stream flow at a specified rate– Delay Time-shift an entire stream relative to other streams– Gate Wait for acknowledgement from downstream operator– Switch Block or allow the flow of tuples based on control input
© 2013 IBM Corporation59
Streams Standard Toolkit: Utility Operators (cont’d)
Parallel flows– Barrier Synchronize tuples from sequence-correlated streams– Pair Group tuples from multiple streams of same type– Split Forward tuples to output streams based on a predicate– ThreadedSplit Distribute tuples over output streams by availability– Union Construct an output tuple from each input tuple– DeDuplicate Suppress duplicate tuples seen within a given time period
Miscellaneous– DynamicFilter Filter tuples based on criteria that can change while it runs– JavaOp General-purpose operator for encapsulating Java code– Parse Parse blob data for use with user-defined adapters– Format Format blob data for use with user-defined adapters– Compress Compress blob data– Decompress Decompress blob data– CharacterTransform Convert blob data from one encoding to another
© 2013 IBM Corporation60
Streams Extensibility: Toolkits
Like packages, plugins, addons, extenders, etc. – Reusable sets of operators, types, and functions– What can be included in a toolkit?
• Primitive and composite operators• User-defined types• Native and SPL functions• Sample applications, utilities• Tools, documentation, data, etc.
– Versioning is supported– Define dependencies on other versioned assets (toolkits, Streams itself)
Base for creating cross-domain and domain-specific applications
Developed by IBM, partners, end-customers– Complete APIs and tools available– Same power as the “built-in” Standard Toolkit
© 2013 IBM Corporation6161
Integration: the Internet and Database Toolkits
Integration with traditional sources and consumers
Internet Toolkit– InetSource periodically retrieve data from HTTP, FTP, RSS, and files
Database Toolkit• ODBC databases: DB2, Informix, Oracle, solidDb, MySQL, SQLServer, Netezza
– ODBCAppend Insert rows into an SQL database table– ODBCEnrich Extend streaming data based on lookups in database tables– ODBCRun Perform SQL queries with parameters from input tuples– ODBCSource Read data from a SQL database– SolidDBEnrich Perform table lookups in an in-memory database – DB2SplitDB Split a stream by DB2 partition key– DB2PartitionedAppend Write data to table in specified DB2 partition– NetezzaLoad Perform high-speed loads into a Netezza database– NetezzaPrepareLoad Convert tuple to delimited string for Netezza loads
© 2013 IBM Corporation62
Integration: the Big Data Toolkit
Integration with IBM’s Big Data Platform
Data Explorer– DataExplorerPush Insert records into a Data Explorer index
BigInsights: Hadoop Distributed File System– HDFSDirectoryScan Like DirectoryScan, only for HDFS– HDFSFileSource Like FileSource, only for HDFS– HDFSFileSink Like FileSink, only for HDFS– HDFSSplit Write batches of data in parallel to HDFS
© 2013 IBM Corporation63
Integration: the DataStage Integration Toolkit
Streams real-time analytics and DataStage information integration– Perform deeper analysis on the
data as part of the information integration flow
– Get more timely results and offload some analytics load from the warehouse
Operators and tooling– Adapters to exchange data
between Streams and DataStage• DSSource• DSSink
– Tooling to generate integration code
– DataStage provides Streams connectors
© 2013 IBM Corporation64
Integration: the Messaging Toolkit
Integrate with IBM WebSphere MQ– Create a stream from a subscription to an MQ topic or queue– Publish a stream to an MQ series topic or queue
Operators– XMSSource Read data from an MQ queue or topic– XMSSink Send messages to applications that use WebSphere MQ
© 2013 IBM Corporation65
Integration and Analytics: the Text Toolkit
Derive structured information from unstructured text– Apply extractors
• Programs that encode rules for extracting information from text• Written in AQL (Annotation Query Language)• Developed against a static repository of text data• AQL files can be combined into modules (directories) • Modules can be compiled into TAM files• NOTE: Old-style AOG files not supported by BigInsights 2.0 and this toolkit
Can be used in Streams 3.0 with the Deprecated Text Toolkit Streams Studio plugins for developing AQL queries
– Same tooling as in BigInsights Operator and utility
– TextExtract Run AQL queries (module or TAM) over text documents– createtypes script
• Create stream type definitions that match the output views of the extractors. Plays a major role in accelerators
– SDA: Social Data Analytics– MDA: Machine Data Analytics
© 2013 IBM Corporation66
Analytics: The Geospatial Toolkit
High-performance analysis and processing of geospatial data Enables Location Based Services (LBS)
– Smarter Transportation: monitor traffic speed and density– Geofencing: detect when objects enter or leave a specified area
Geospatial data types– e.g. Point, LineString, Polygon
Geospatial functions– e.g. Distance, Map point to LineString, IsContained, etc.
© 2013 IBM Corporation67
Analytics: The TimeSeries Toolkit
Apply Digital Signal Processing techniques– Find patterns and anomalies in real time– Predict future values in real time
A rich set of functionality for working with time series data– Generation
• Synthesize specific wave forms (e.g., sawtooth, sine)– Preprocessing
• Preparation and conditioning (ReSample, TSWindowing)– Analysis
• Statistics, correlations, decomposition and transformation– Modeling
• Prediction, regression and tracking (e.g. Holt-Winters, GAMLearner)
© 2013 IBM Corporation68
Analytics: the Mining Toolkit
Enables scoring of real-time data in a Streams application– Scoring is performed against a predefined model (in PMML file)– Supports a variety of model types and scoring algorithms
Predictive Model Markup Language (PMML)– Standard for statistical and data mining models– Produced by Analytics tools: SPSS, ISAS, etc.– XML representation defined by http://www.dmg.org/
Scoring operators– Classification Assign tuple to a class and report confidence– Clustering Assign tuple to a cluster and compute score– Regression Calculate predicted value and standard deviation– Associations Identify the applicable rule and report consequent
(rule head), support, and confidence Supports dynamic replacement of the PMML model used
by an operator
© 2013 IBM Corporation69
IBM InfoSphere Streams
Analytics: Streams + SPSS Real Time Scoring Service Included with SPSS, not with Streams
© 2013 IBM Corporation70
In m
emory
In memory
RepositoryRepository
ChangeNotification
S SPSS Scoring Operator
SPSS Modeler Scoring Stream
R SPSS Repository Operator
P SPSS Publish Operator
P
IBM SPSS Collaboration & Deployment Services
Model Refresh
R
FileSystem IBM SPSS Modeler Solution
Publisher
S
Analytics: the Financial Services Toolkit
Adapters– Financial Information Exchange (FIX)– WebSphere Front Office for Financial Markets (WFO) – WebSphere MQ Low-Latency Messaging (LLM) Adapters
Types and Functions– OptionType (put, call), TxType (buy, sell), Trade, Quote, OptionQuote, Order– Coefficient of Correlation– “The Greeks” (put/call values, delta, theta, etc.)
Operators– Based on QuantLib financial analytics open source package.– Compute theoretical value of an option (European, American)
Example applications– Equities Trading – Options Trading
© 2013 IBM Corporation71