IBM Stream au Hadoop User Group

© 2011 IBM Corporation

Big DataJerome Chailloux, Big Data Specialist

[email protected]


Imagine the Possibilities of Analyzing All Available Data

Real-time Traffic Flow Optimization

Fraud & risk detection

Accurate and timely threat detection

Predict and act on intent to purchase

Faster, More Comprehensive, Less Expensive

Understand and act on customer sentiment

Low-latency network analysis

© 2011 IBM Corporation3

Where is this data coming from?

Source: McKinsey & Company, May 2011

Every second of HD video generates > 2,000 times as

many bytes as required to store a single page of text.

Every day, the New York Stock Exchange captures 1 TB of trade information.

More than 30M networked sensor, growing at a rate

>30% per year.

12 TB of tweets being created each day.

5 Billion mobile phones in use in 2010. Only 12% were

smartphones.

What is your business doing with it?


4

Why is Big Data important ?

Data AVAILABLE to an organization

data an organization can PROCESS

Missed

opportunity

Enterprises are “more blind” to new opportunities.

Organizations are able to process less and less of the

available data.


What does a Big Data platform do ?

Analyze Information in MotionStreaming data analysis

Large volume data bursts & ad-hoc analysis

Analyze a Variety of Information Novel analytics on a broad set of mixed information that could not be analyzed before

Discover & ExperimentAd-hoc analytics, data discovery & experimentation

Analyze Extreme Volumes of InformationCost-efficiently process and analyze petabytes of information

Manage & analyze high volumes of structured, relational data

Manage & Plan

Enforce data structure, integrity and control to ensure consistency for repeatable queries


Complementary Approaches for Different Use Cases

Traditional ApproachStructured, analytical, logical

New ApproachCreative, holistic thought, intuition

StructuredRepeatable

Linear

Monthly sales reportsProfitability analysis

Customer surveys

Internal App Data

Data Warehouse

Traditional Sources

StructuredRepeatableLinear

Transaction Data

ERP data

Mainframe Data

OLTP System Data

UnstructuredExploratoryIterative

Brand sentimentProduct strategyMaximum asset utilization

HadoopStreams

New Sources

UnstructuredExploratoryIterative

Web Logs

Social Data

Text Data: emails

Sensor data: images

RFID

Enterprise Integration


IBM Big Data Strategy: Move the Analytics Closer to the Data

BI / Reporting

BI / Reporting

Exploration / Visualization

FunctionalApp

IndustryApp

Predictive Analytics

Content Analytics

Analytic Applications

IBM Big Data Platform

Systems Management

Application Development

Visualization & Discovery

Accelerators

Information Integration & Governance

HadoopSystem

Stream Computing

Data Warehouse

New analytic applications drive the requirements for a big data platform

• Integrate and manage the full variety, velocity and volume of data

• Apply advanced analytics to information in its native form

• Visualize all available data for ad-hoc analysis

• Development environment for building new analytic applications

• Workload optimization and scheduling

• Security and Governance


Most Client Use Cases Combine Multiple Technologies

Pre-processing

Ingest and analyze unstructured data types and convert to structured data

Combine structured and unstructured analysis

Augment data warehouse with additional external sources, such as social media

Combine high velocity and historical analysis

Analyze and react to data in motion; adjust models with deep historical analysis

Reuse structured data for exploratory analysis

Experimentation and ad-hoc analysis with structured data


IBM is in a lead position to exploit the Big Data opportunity

February 2012 “The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012”

Forrester Wave™: Enterprise Hadoop Solutions, Q1 ’12

IBM DifferentiationEmbracing Open SourceData in Motion (Streams) and Data at

Rest (Hadoop/BigInsights)Tight integration with other Information

Management productsBundled, scalable analytics technologyHardened Apache Hadoop for enterprise

readiness


IBM’s unique strengths in Big Data

Ingest, analyze and act on massive volumes of streaming data.Faster AND more cost-effective for specific use cases. (10x volume

of data on the same hardware.)

Analyzes a variety of data types, in their native format – text, geospatial, time series, video, audio & more.

Open source enhanced for reliability, performance and security.High performance warehouse software and appliancesEase of use with end users, admin and development UIs.

Integration into your IM architecture.Pre-integrated analytic applications.

Big Data in Real-Time

Fit for purpose analytics

Enterprise Class

Integration


Analytic Results

More context

Traditional Data, Sensor Events,

Signals

Alerts

ThreatPrevention

Systems

Logging

Active response

Storage andWarehousing

What if you could get IMMEDIATE insight?

What if you could analyze MORE kinds of data?

What if you could do it with exceptional performance?

Stream Computing : What is good for ?Analyze all your data, all the time, just in time


What is Stream Processing ?

Relational databases and warehouses find information stored on disk

Stream computing analyzes data before you store it

Databases find the needle in the haystack

Streams finds the needle as it’s blowing by

© 2011 IBM Corporation13 IBM and Customer Confidential

Without Streams With Streams• Intensive scripting• Embedded SQL• File / Storage management by hand• Record management embedded in application

code• Data Buffering, Locality• Security• Dynamic Application Composition• High Availability• Application management (checkpointing,

performance optimization, monitoring, workload management, error and event handling)

• Application tied to specific Hardware, Infrastructure

• Multithreading, Multiprocessing• Debugging• Migration from development to production• Integration of best-of-breed commercial tools• Code reusability• Source / Target interfaces

Streams provide a Productive and Reusable Development Environment

Streams Runtime provides your Application Infrastructure

“TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language.“

– Alex Philp, TerraEchos


Streams

15© 2011 IBM Corporation

Achieve scale:By partitioning applications into software componentsBy distributing across stream-connected hardware hosts

Infrastructure provides services forScheduling analytics across hardware hosts, Establishing streaming connectivity

TransformTransformFilter / SampleFilter / Sample

ClassifyClassifyCorrelateCorrelate

AnnotateAnnotate

Where appropriate: Elements can be fused togetherfor lower communication latency

Continuous ingestion Continuous analysis

How Streams Works


Scalable Stream Processing

Streams programming model: construct a graph

– Mathematical concept• not a line -, bar -, or pie chart!• Also called a network• Familiar: for example, a tree structure is a graph

– Consisting of operators and the streams that connect them• The vertices (or nodes) and edges of the mathematical graph• A directed graph: the edges have a direction (arrows)

Streams runtime model: distributed processes– Single or multiple operators form a Processing Element (PE)– Compiler and runtime services make it easy to deploy PEs

• On one machine• Across multiple hosts in a cluster when scaled-up processing is required

– All links and data transport are handled by runtime services• Automatically• With manual placement directives where required

OP

OP

OP

OP

OP

OP

OP

stream


InfoSphere Streams Objects: Runtime View

Instance– Runtime instantiation of InfoSphere

Streams executing across one or more hosts

– Collection of components and services

Processing Element (PE)– Fundamental execution unit that is run by

the Streams instance – Can encapsulate a single operator or

many “fused” operators

Job– A deployed Streams application

executing in an instance – Consists of one or more PEs

InstanceInstance

JobJob

NodeNode

Stream 1 PEPE PEPE

NodeNode

PEPE

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator


InfoSphere Streams Objects: Development View

directory: "/img"filename: "farm"

directory: "/img"filename: "bird"

directory: "/opt"filename: "java"

directory: "/img"filename: "cat"

Streams Application

stream

tuple

height: 640width: 480data:



operator

Operator– The fundamental building block of the Streams

Processing Language– Operators process data from streams and may

produce new streams

Stream– An infinite sequence of structured tuples– Can be consumed by operators on a tuple-by-

tuple basis or through the definition of a window

Tuple– A structured list of attributes and their types.

Each tuple on a stream has the form dictated by its stream type

Stream type– Specification of the name and data type of each

attribute in the tuple

Window– A finite, sequential group of tuples– Based on count, time, attribute value,

or punctuation marks


What is Streams Processing Language?

Designed for stream computing–Define a streaming-data flow graph–Rich set of data types to define tuple attributes

Declarative–Operator invocations name the input and output streams–Referring to streams by name is enough to connect the graph

Procedural support–Full-featured C++/Java-like language–Custom logic in operator invocations–Expressions in attribute assignments and parameter definitions

Extensible–User-defined data types–Custom functions written in SPL or a native language (C++ or Java)–Custom operator written in SPL–User-defined operators written in C++ or Java


Some SPL Terms

An operator represents a class of manipulations – of tuples from one or more input streams– to produce tuples on one or more output streams

A stream connects to an operator on a port – an operator defines input and output ports

An operator invocation – is a specific use of an operator – with specific assigned input and output streams– with locally specified parameters, logic, etc.

Many operators have one input port and one output port; others have

– zero input ports: source adapters, e.g., TCPSource– zero output ports: sink adapters, e.g., FileSink– multiple output ports, e.g., Split– multiple input ports, e.g., Join

A composite operator is a collection of operators– An encapsulation of a subgraph of

• Primitive operators (non-composite)• Composite operators (nested)

– Similar to a macro in a procedural language

Aggregate

Employee Info

Aggregate

port

Salary Statistics

port

TCPSource

FileSink

compositeoperator

Split

Join


Composite Operators

Every graph is encoded as a composite–A composite is a graph of one or more operators–A composite may have input and output ports–Source code construct only

• Nothing to do with operator fusion (PEs)

Each stream declaration in the composite– Invokes a primitive operator or–another composite operator

An application is a main composite–No input or output ports–Data flows in and out but not on

streams within a graph–Streams may be exported to and

imported from other applicationsrunning in the same instance

21

composite Main { graph stream … { } stream … { } . . . }

Application (logical view)Application (logical view)

Stream 1

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator


Anatomy of an Operator Invocation

Operators share a common structure– <> are sections to fill in

Reading an operator invocation– Declare a stream stream-name – With attributes from stream-type – that is produced by MyOperator – from the input(s) input-stream– MyOperator behavior defined by

logic, parameters, windowspec, and configuration; output attribute assignments are specified in output

For the example:– Declare the stream Sale with the attribute

item, which is a raw string– Join Bid and Ask streams with – sliding windows of 30 seconds on Bid,

and 50 tuples of Ask– When items are equal, and Bid price is

greater than or equal to Ask price– Output the item value on the Sale stream

stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; param parameters ; output output ; window windowspec ; config configuration ;}

stream<rstring item> Sale = Join(Bid; Ask) { window Bid: sliding, time(30); Ask: sliding, count(50); param match: Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item }

Syntax:

Example:

22


Streams V2.0 Data Types

(any)

(composite)(composite)(primitive)

(collection)(collection) tupletupleboolean enumenum (numeric) timestamptimestamp (string) blobblob

list setset mapmaprstring ustringustring(integral) (floatingpoint) (complex)(complex)

(signed) (unsigned)(unsigned) (float) (decimal)(decimal)

int8

int16

int32

int64

uint8

uint16

uint32

uint64

uint8

uint16

uint32

uint64

float32

float64

float128

decimal32

decimal64

decimal128

decimal32

decimal64

decimal128

complex32

complex64

complex128

complex32

complex64

complex128


Stream and Tuple Types Stream type (often called “schema”)

– Definition of the structure of the data flowing through the stream

Tuple type definition– tuple<sequence of attributes>tuple<uint16 id, rstring name>

• Attribute: a type and a name• Nesting: any attribute may be another tuple type

Stream type is a tuple type– stream<sequence of attributes> stream<uint16 id, rstring name>

Indirect stream type definitions– Fully defined within the output stream declaration

stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…

– Reference a tuple type

CallInfo = tuple<uint32 callerNum, … rstring endTime, list<uint32> mastIDs>;

stream<CallInfo> InternationalCalls = Op(…) {…}

– Reference another stream

stream<uint32 callerNum, … rstring endTime, list<uint32> mastIDs> Calls = Op(…)…

stream<Calls> RoamingCalls = Op(…) {…}


Collection Types

list: array with bounds-checking [0, 17, age-1, 99]– Random access: can access any element at any time

Ordered, base-zero indexed: first element is someList[0]

set: unordered collection {"cats", "yeasts", "plankton"}– No duplicate element values

map: key-to-value mappings {"Mon":0, "Sat":99, "Sun":-1}– Unordered

Use type constructors to specify element type– list<type>, set<type> list<uint16>, set<rstring>– map<key-type,value-type> map<rstring[3],int8>

Can be nested to any number of levels– map<int32, list<tuple<ustring name, int64 value>>>– {1 : [{"Joe",117885}, {"Fred",923416}], 2 : [{"Max",117885}], -1 : []}

Bounded collections optimize performance– list<int32>[5]: at most 5 (32-bit) integer elements– Bounds also apply to strings: rstring[3] has at most 3 (8-bit) characters


The Functor Operator

Transforms input tuples into output tuples

– One input port– One or more output ports

May filter tuples– Parameter filter– A boolean expression– If true, emit output tuple;

if false, do not

Arbitrary attribute assignments– Full-blown expressions– Including function calls– Drop, add, transform attributes– Omitted attributes auto-assigned

Custom logic supported– logic clause– May include state– Applies to filter and assignments

stream<rstring name,

uint32 age,

uint64 salary> Person = Op(…){}

stream<rstring name,

uint32 age,

rstring login,

tuple<boolean young,

boolean rich> info>

Adult = Functor(Person) {

param

filter : age >= 21u;

output Adult :

login = lower(name),

info = {young = (age < 30u),

rich = (salary > 100000ul)};

}

Person AdultFunctorname

agesalary

nameagelogininfo


The FileSink Operator

Writes tuples to a file

Has a single input port–No output port: data goes to a file,

not a Streams stream

Selected Parameters– file

• Mandatory• Base for relative paths is

data subdirectory• Directories must already exist

– flush• Flush the output buffer after a given

number of tuples– format

• csv: comma-separated values• txt, line, binary, block

() as Sink = FileSink(StreamIn) {

param

file : "/tmp/people.dat";

format : csv;

flush : 20u;

}File-Sink


Communication Between Streams Applications

Streams jobs exchange data with the outside world–Source- and Sink-type operators–Can also be used between Streams jobs (e.g., TCPSource/Sink)

Streams jobs can exchange data with each other–Within one Streams Instance

Supports Dynamic Application Composition–By name or based on properties (tags)–One job exports a stream; another imports it

Implemented using two new pseudo-operators: Export and Import

Job 2

Job 1

Export Import

sourceoper-ator

oper-ator

sink

oper-ator

sinksource

Stream exported by Job 1and imported by Job 2


Application Design – Dynamic Stream Properties

API available for toolkit development

Can add/modify/delete– Exported stream properties– Imported stream subscription expression

Dynamic Job Flow Control Bus Pattern– Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job

ExportedControl Stream

Job A Job B Job C Job D

[A,B,C]

Data Stream

Flow Control Tuples


Application Design – Dynamic Stream Properties

API available for toolkit development

Can add/modify/delete– Exported stream properties– Imported stream subscription expression

Dynamic Job Flow Control Bus Pattern– Operators within jobs interpret control stream tuples – Rewire the flow of data from job to job

ExportedControl Stream

Job A Job B Job C Job D

[A,B,C]

[A,C,D]

Data Stream

Flow Control Tuples


Streams Instance: stream1Streams Instance: stream1

Application Design – Multi-job Design

Application / Job Decomposition– Dynamic Job Submission + Stream Import / Export

Job: imagefeeder

File metadata

Image-Source

Directory-Scan

Job: imagewriter

File metadataTimestamp + Filename

FunctorImage-Sink

FileSink

properties:name = "Feed",type = "Image",write = “ok"


subscription:type == "Image" &&write == “ok"






Job: imagefeeder

File metadata

Image-Source

Directory-Scan

Job: imagewriter


FunctorImage-Sink

FileSink

Image +File metadata









Job: imagefeeder

File metadata

Image-Source

Directory-Scan

Job: imagewriter


FunctorImage-Sink

FileSink

Job:greyscaler

Greyscale






subscription:name == "Feed"subscription:name == "Feed"

properties:name = “Grey",type = "Image",write = “ok"






Job: resizer

Job:facial scan Job: Alerter

Job: imagefeeder

File metadata

Image-Source

Directory-Scan

Job: imagewriter


FunctorImage-Sink

FileSink

Job:greyscaler

Greyscale











Job: imagewriter

WriteImageFile metadata

SinkFunctor



Job: imagefeeder

DirReaderFile metadata

Job: imagefeeder

DirReaderFile metadata

Job: resizer

Job:facial scan Job: Alerter

Job: imagefeeder

File metadata

Image-Source

Directory-Scan

Job: imagewriter


FunctorImage-Sink

FileSink

Job:greyscaler

Greyscale










Two Styles of Export/Import

Publish and subscribe (Recommended approach):–The exporting application publishes a stream with certain properties–The importing stream subscribes to an exported stream with properties

satisfying a specified condition

Point to point:–The importing application names a specific stream of a specific exporting

application

Dynamic publish and subscribe:–Export properties and Import expressions can be altered during the execution of

a job–Allows dynamic data flows–Alter the flow of data based on the data (history, trends, etc.)

() as ImageStream = Export(ImagesIn) { param properties : { streamName = "ImageFeed", dataType = "IplImage", writeImage = "true"};}

stream<IplImage image, rstring filename, rstring directory> ImagesIn = Import() { param subscription : dataType == "IplImage" && writeImage == "true";}


Parallelization Patterns – Introduction

Problem Statement–Series of operations to be performed on a piece of data (a tuple)–How to improve performance of these operations?

Key Question–Reduce latency?

• For a single piece of data – Increase throughput?

• For the entire data flow

Three possible design patterns–Serial Path–Parallel Operators (Task Parallelization)–Parallel Paths (Data Parallelization)


Parallelization Patterns – Pipeline, Task

Pipeline (serial path)

–Base pattern: inherent in graph paradigm–Results arrive at D in time T(A) + T(B) + T(C)

Parallel operators (task parallelization)

–Process the tuple in operators A, B, and C at the same time–Requires merger (e.g., Barrier) before operator D–Results arrive at D in time Max(T(A),T(B),T(C)) + T(M)–Use when tuple latency requirement < T(A) + T(B) + T(C)–Complexity of merger depends on behavior of operators A, B, and C

A B C D

A

B

C

M D


Parallelization Patterns – Parallel Pipelines

Parallel pipelines (data parallelization)

–Migration step from pipeline patttern–Can improve throughput

• Especially good for variable-size data / processing time

Design Decisions–Are there latency and/or throughput requirements?–Do the operators perform filtering, feature extraction, transformation?–Is there an execution order requirement?–Is there a tuple order requirement?

Recommend Pipeline Parallel Pipelines when possible

A B C

A B C

A B C

D


Application Design – Multi-tier Design

N-tier design–Number and purpose of tiers is a result of Application Design

Create well-defined interfaces between the tiers

Supports several overarching concepts– Incremental development / testing–Application / Job / Operator reuse–Modular programming practices

Each tier in these examples may be made up of one or more jobs (programs)

Transport AdaptationTransport Adaptation IngestionIngestion ReductionReduction Processing /

AnalyticsProcessing /

Analytics TransformationTransformation TransportAdaptationTransportAdaptation

Transport AdaptationTransport Adaptation IngestionIngestion Processing /

AnalyticsProcessing /

AnalyticsTransportAdaptationTransportAdaptation

Examples


Application Design – High Availability

HA application design pattern–Source job exports stream, enriched with tuple ID–Jobs 1 & 2 process in parallel, and export final streams–Sink job imports streams, discards duplicates, alerts on missing tuples

Host pool 1

Host pool 4

Host pool 2

Host pool 3

Job 1Job 1 Job 1Job 1


Job 2Job 2

SinkSink

x86 host x86 host x86 host x86 host x86 host

Job 2Job 2 Job 2Job 2Job 2Job 2SourceSource


Application Design – High Availability

HA application design pattern–Source job exports stream, enriched with tuple ID–Jobs 1 & 2 process in parallel, and export final streams–Sink job imports streams, discards duplicates, alerts on missing tuples

Host pool 1

Host pool 4

Host pool 2

Host pool 3



Job 2Job 2

SinkSink

x86 host

x86 host x86 host x86 host x86 host

Job 2Job 2 Job 2Job 2Job 2Job 2

SourceSource


IBM InfoSphere Streams

Eclipse IDE

Streams Live Graph

Streams Debugger

Agile Development Environment

Distributed Runtime Environment

Sophisticated Analytics with Toolkits & Adapters

Clustered runtime for massive scalability

RHEL v5.x and v6.x, CentOS v6.x

x86 & Power multicore hardware

Ethernet & InfiniBand

Toolkits Database Mining Financial Standard Internet BigData

• HDFS• DataExplorer

Over 50 samples

Front Office 3.0

Advanced Text Geospatial Timeseries Messaging ... User-defined


Toolkits and Operators to Speed and Simplify Development

Standard ToolkitRelational Operators

Filter Sort Functor JoinPunctor Aggregate

Adapter OperatorsFileSource UDPSourceFileSink UDPSinkDirectoryScan Export TCPSource ImportTCPSink MetricsSink

Utility OperatorsCustom SplitBeacon DeDuplicateThrottle Union Delay ThreadedSplitBarrier DynamicFilterPair GateJavaOp

Financial Toolkit Data Mining Toolkit Big Data toolkit Text Toolkit ….. User-Defined Toolkits

Extend the language by adding user-defined operatorsand functions

Internet ToolkitInetSource

HTTP FTP HTTPSFTPS RSS file

Database ToolkitODBCAppendODBCEnrichODBCSource SolidDBEnrichDB2SplitDB DB2PartitionedAppend

Supports: DB2 LUW, IDS, solidDB, Netezza, Oracle, SQL Server, MySQLSupports: DB2 LUW, IDS, solidDB, Netezza, Oracle, SQL Server, MySQL

Standard toolkit contains the default operators shipped with the product

Standard toolkit contains the default operators shipped with the product


User Defined Toolkits

Streams supports toolkits–Reusable sets of operators and functions–What can be included in a toolkit?

• Primitive and composite operators• Native and SPL functions• Types• Tools/documentation/samples/data, etc.

–Versioning is supported–Define dependencies on other versioned assets (toolkits, Streams)–Create cross-domain and domain-specific accelerators

45


InfoSphere Streams Instance – Single HostInfoSphere Streams Instance – Single Host

Management Services & Applications

Management Services & ApplicationsStreams Web Service (SWS)Streams Web Service (SWS)

Streams Application Manager (SAM)Streams Application Manager (SAM)

Streams Resource Manager (SRM)Streams Resource Manager (SRM)

Authorization and Authentication Service (AAS)Authorization and Authentication Service (AAS)

SchedulerScheduler Name ServerName ServerRecover DBRecover DB

File SystemFile System

Host ControllerHost Controller Processing ElementContainer

Processing ElementContainer

A quick peek inside …


InfoSphere Streams Instance – Multi host, Management Services on separate nodeInfoSphere Streams Instance – Multi host, Management Services on separate node

Management ServicesManagement ServicesStreams Web Service (SWS)Streams Web Service (SWS)

Streams Application Manager (SAM)Streams Application Manager (SAM)

Streams Resource Manager (SRM)Streams Resource Manager (SRM)

Authorization and Authentication Service (AAS)Authorization and Authentication Service (AAS)

SchedulerScheduler Name ServerName ServerRecover DBRecover DB

Shared File SystemShared File System

Application HostApplication Host

Host ControllerHost Controller













InfoSphere Streams Instance – Multi host, Management Services on multiple hostsInfoSphere Streams Instance – Multi host, Management Services on multiple hosts

Shared File SystemShared File System






Management Management

Streams Web ServiceStreams Web Service


Streams App ManagerStreams App Manager














Streams Resource MgrStreams Resource Mgr


AASAAS


SchedulerScheduler


Name ServerName Server


Recovery DBRecovery DB





IBM Stream au Hadoop User Group

Documents

Transcript of IBM Stream au Hadoop User Group