Stream Computing for Data Fusion and Computational Intelligence · PDF fileStream Computing...

42
IBM Research Stream Computing for Data Fusion and Stream Computing for Data Fusion and Computational Intelligence Computational Intelligence Chitra Venkatramani Chitra Venkatramani IBM T.J. Watson Research Center IBM T.J. Watson Research Center

Transcript of Stream Computing for Data Fusion and Computational Intelligence · PDF fileStream Computing...

IBM Research

Stream Computing for Data Fusion and Stream Computing for Data Fusion and Computational IntelligenceComputational Intelligence

Chitra VenkatramaniChitra VenkatramaniIBM T.J. Watson Research CenterIBM T.J. Watson Research Center

IBM Research

Outline

� The data deluge problem

� The stream computing paradigm and Streams

�Related Research Activities

IBM Research

The world is getting more instrumented and interconneted…

Volume In 2009, American drones sent back 24 years worth of video. This year, mankind will create 1,200 exabytes of data. Soon, the codified information base of the world is expected to double every 11 hours.

Variety 80% of new data growth is unstructured content, generated largely by email, with increasing contribution by documents, images, and video and audio

VelocityAn average company with 1,000 employees spends $5.3 million a year to find information stored on its servers. Data is available from static as well as continuous sources

3

IBM Research

The Need for New Intelligence is Everywhere…

Stock market• Impact of weather on

securities prices• Analyze market data at

ultra-low latencies

Fraud prevention• Detecting multi-party fraud• Real time fraud prevention

Radio Astronomy• Detection of transient events

Health & Life Sciences• Neonatal ICU monitoring• Epidemic early warning system• Remote healthcare monitoring

Transportation• Intelligent traffic

management

Law Enforcement• Real-time multimodal surveillance

Manufacturing• Process control for

microchip fabrication

Natural Systems• Seismic monitoring

• Wildfire management• Water management

IBM Research

The Need for New Intelligence brings New Challenges

•High Volume of data: faster than a database can handle

•Complex Analytics:: correlation from multiple sources and/or signals; video, audio or other non-relational data types

•Time Sensitive: low-latency responses for real-time decision making

IBM Research

Application Requirements

� Streaming Data

� Correlation

– Connect the dots

– Reinforce existing knowledge

– Detect anamolies

� Hypothesis-driven

– Morphable Applications

– Feedback and control

� Complex analytics

– Structured and unstructured data

– Resource Adaptive

� High performance

– Throughput, Latency

– Scalable to keep up with the data-rates

– Leverage advances in computation and communication

IBM Research

7

Stream Computing Illustrated

Continuous Ingestion Continuous Complex Analysis

IBM Research

System S : Commercialized as InfoSphere Streams

Streams Runtime

StreamsApplication

Database/data warehouse

…minimizing time to react

…extracting and organizing

information and intelligence

…processing data as it is

continuously generated

Streams applications are composed

of analytic operators…

� A stream computing software platform to enable better analysis ofstructured and unstructured data for faster, more informed, anddifferentiated decision making

IBM Research

9

Application Model

Transform

Filter Classify

Video News

CaptionExtraction

TopicFiltration

Speech Recognition

EarningsRelatedNews

Analysis

CaptionExtraction

TopicFiltration

Speech Recognition

Earnings

RelatedNews

AnalysisVideo News

Caption

Extraction

TopicFiltration

Speech extraction

EarningsRelated

NewsAnalysis

Earnings

MovingAverage

Calculation

HurricaneWeather

DataExtractionWeather Data

SEC Edgar

10 Q Earnings

Extraction

VWAPCalculation

NYSEDynamic P/E Ratio

Calculation

Hurricane

Impact

Join P/E with

Aggregate Impact

HurricaneIndustryImpact

Trade

Decision

Correlate

Annotate

HurricaneRisk

Encoder

EarningsNewsJoin

Data Rate Analysis Complexity

IBM Research

10

Surveillance

‘Applications

Smart Traffic

Smarter Telecom

Neonatal Care

IBM Research

“Who’s Talking to Whom” Application

� Framework: VoIP Network

– PSTN Networks and gateways

– K sniffers

PSTN N

etw

ork

Gateway

GatewayPS

TN

Netw

ork

B����A

A����B

Distillery System

Approach• Pair streams • Detect speakers of interest• Join asynchronous events• Denoise

P1

P2

� Workload

– 679 speakers

– 2,000+ conversations

– 300+ GSM concurrent streams

• Tough speech corpus• Very low bit rate compression• Noisy VAD output

– 15,000+ GSM frames per second

IBM Research

Incremental Audio Analysis

……

Processing complexity

Volumetric

• Packet size• Packet inter-arrival rate

IBM Research

Volumetric

• Packet size• Packet inter-arrival rate

Packet fields

• Energy• Pitch

……

Processing complexity

Incremental Audio Analysis

IBM Research

Volumetric

• Packet size• Packet inter-arrival rate

Packet fields

• Energy• Pitch

Compressed-Domain

• Speech Cepstrum

Cepstrum

f(.)f(.)

f(.)

……

Processing complexity

Incremental Audio Analysis

IBM Research

Volumetric

• Packet size• Packet inter-arrival rate

Packet fields

• Energy• Pitch

Raw Domain

• Speech Cepstrum

raw Cepstrum

Compressed-Domain

• Speech Cepstrum

Cepstrum

f(.)f(.)

f(.)

……

Processing complexity

Incremental Audio Analysis

IBM Research

Analysis Stages

Stream A

Stream B

Stream C

Stream D

Speaker Detection

Olivier Mihalis

Ching-Yung Upendra

talks to

talks to

Deepak

Conversation Pairing

A B

C D

talks to

talks to

E

After denoising

- Just-in-time

- Features: Volumetrics

- Very high accuracy

- Very low complexity

- Robust to noise

- Just-in-time

- Features: GSM domain

- High accuracy

- Moderate complexity

- Robust to noise

- Social network

- Fusion technique

- Iterative method

Denoising & Social Network Analysis

IBM Research

Smarter Telco Services

� Low-latency data processing

– Mediation, summarization, monitoring, preprocessing

� Real-time services

– Context based advertizing

– Real-time campaign management

� Online data analysis and learning

– Churn prediction, model building

Customer Requirements

� Processing of CDR data using Streams

– Extend to other data types

� Offline data exploration (Warehouse, Hadoop, SNAzzy, TABI)

– Tight integration for automated interactions

� Model scoring and incremental model learning on Streams

Solution

IBM Research

Current Architecture

Business logic

• CDR collection• CDR Filtering • CDR stitching

Data miningWarehouseSNAzzy, TABI

Hours to days!

Data warehouse

Business Intelligence

Telc

o D

ata

Cloud

Support

data

IBM Research

Stream Processing Architecture

Stream Processing

CDR collectionCDR Filtering CDR stitching

Data

summaries

Data MiningScoring Engine

�� ��

Online Learning with Offline Analytics

Business Intelligence

Real-time dashboards

Data warehouse

Mediation

Real-time Services

Online Monitoring

Telc

o D

ata

Co

nte

xt

Weather

GPS Location, Transactions,

Personal Health Monitor etc.

Cloud

Support

data

Data miningWarehouseSNAzzy, TABI

Data warehouse

Cloud

IBM Research

Efficient Traffic Management

� Multimodal Data Streams– GPS

– Cell-phones (location tracking),

– Public Transport (bus, docking),

– Pollution measurements,

– Weather Conditions (including road conditions)

– Optical traffic flow detectors,

– Travel time data based on plate recognition,

– Induction loop detector data,

– Accidents in network as they are being recorded,

– Road closures (road work, etc),

– Still pictures from road cameras.

� Real Time Traffic Monitoring

� Real Time Traffic Information

� (Multimodal) Travel Planner

GPSData

Streams

Real Time Transformation

Logic

Real Time Geo

Mapping

Real Time Speed & Heading

Estimation

Real Time Aggregates & Statistics

DataWarehouseWeb

Server

GoogleEarth

Offlinestatisticalanalysis

Interactivevisualization

Storageadapters

Only 4 x86 Blade servers to process

250,000250,000250,000250,000 GPS probes per second, maps of 630,000 line per second, maps of 630,000 line per second, maps of 630,000 line per second, maps of 630,000 line segments

IBM Research

Real Time Traffic Monitoring

IBM Research

Real Time Traffic Information

IBM Research

ITS Application Flow-graph (125k GPS/second)

GPS

source

ISO 8601 to

timestamp

Time of day, day

of week month of

year

Filter taxis and

invalid values

GeoMatch

GpsTrack

Clean Sort Adapt

Travel

times

JoinQuery

5min

Aggr.

KML

Color

map

DSP

Inter-quartiles

(IQ) week-ends

IQ

IQ

Title

Current

Title

IBM Research

2424

Electricity Grid in U.S. Penetrated By SpiesWall Street Journal, April 08, 2009

� Cyberspies have penetrated the U.S. electrical grid and left behind software programs that could be used to disrupt the system

� The growing reliance of utilities on Internet-based communication has increased the vulnerability of control systems to spies and hackers

� It is nearly impossible to know who is attacking because of the difficulty in tracking true identities in cyberspace

Vast Spy System Loots Computers

in 103 CountriesThe New York Times, March 28, 2009

Ghostnet Toolset —Back Door at the Click of a Button

Biggest Botnets January 2009: (Wikipedia et al)

� Cutwail: 175,000 infected machines, Type: HTTP Encrypted; Purpose: Spam, Malware: Trojan/Rootkit

� Rustock: 130,000 active members per 24 hour period; Type: HTTP Encrypted; Purpose: Spam; Malware: Trojan/Rootkit

� Donbot: Size: 125,000 active members per 24 hour period, Type: Custom TCP, Purpose: Spam, Download; Malware: Trojan

� Ozdok, Xarvester, Grum, Gheg, Cimbot, Waledac, …

Biggest Botnets January 2009: (Wikipedia et al)

� Cutwail: 175,000 infected machines, Type: HTTP Encrypted; Purpose: Spam, Malware: Trojan/Rootkit

� Rustock: 130,000 active members per 24 hour period; Type: HTTP Encrypted; Purpose: Spam; Malware: Trojan/Rootkit

� Donbot: Size: 125,000 active members per 24 hour period, Type: Custom TCP, Purpose: Spam, Download; Malware: Trojan

� Ozdok, Xarvester, Grum, Gheg, Cimbot, Waledac, …

Cyber-Security – Botnet Detection

IBM Research

25

Independent

Reports

Current Solutions: Limitations

IDS

Firewall

IPS/ADS

Sensors

DNS

LiveData

SecurityEvents

StaticConfigurationFiles

ISS/VSOC

ArcSight

…Partial Malware Signatureprevents safe detection

Partial Malware Signatureprevents safe detection

Missing: Botnet traffic profiles and behavioral anomaly detection

Real-time Correlation benefitso Evolving botnet detection

o Near real-time detection

o Characterizing traffic anomalies tominimize false positives

IBM Research

26

Temporal Anomalies,

Event / Destination

Correlations,Partial periodicity

Analytics ArchitectureIDS

Annotations

IBM SPSS / Warehouse

Firewall

IPS/ADS

Sensors

DNS

LiveData

ID & NAC

App/DB

Logs

Unsupervised Supervised

SecurityEvents

Self-t

uning

Feedba

ck

Manual T

uning

True Positives/

False Positives

• Feature extraction• Data representation• Raw data preprocessing• Aggregation, filtering

Derived Streams

•Behavioral Profiling- Incremental model learning

•Behavioral Anomaly Detection- Identification of suspects

• Detection of known signatures

PortScan

Synflood

HTTPScan

SQLInjection

Temporal RuleDiscovery

FeatureDiscovery

DiscriminativeSignatureExtraction

IBM Research

Neonatal Care

• Multiple devices are attached to the baby or humidicrib

• Medical devices output via serial port in a range of formats

• Indicative readings are recorded on paper every 30 or 60 minutes

• Correlation across multiple sources and episodic conditions make detection of early indicators difficult

http://preemie.info/cms/modules/news/

IBM Research

A 3-Way Collaboration

System and Analytics

IBM Research

IBM Research

The Data Baby Commercial

� Video

IBM Research

30

System Components

Language and

Compiler

Runtime

Environment

Tools and Technology Integration

Scalable stream processing runtime

Streamsight,Built-in Stream Relational Analytics,

AdaptersToolkits

Streams StudioEclipse IDE for SPL

)

Front Office 3.0Front Office 3.0

IBM Research

Language and Compiler

� Streams Processing Language

� Parallelization constructs

� Reuable Composite operators

� Incremental application composition

� Resource hints

� Compiler Framework

IBM Research

32

Compiler Framework

� Operator Fusion

– Fine-grained operators

– From small parts, make larger ones that fit

� Code generation

– Generates code to match the underlying runtime environment

• Number of cores

• Interconnect characteristics

• Architecture-specific instructions

– Driven by automatic profiling

– Compiler-based optimization

– Driven by incremental learning of application characteristics

Lo

gic

al a

pp

vie

wP

hys

ica

l ap

p v

iew

IBM Research

Runtime

� Distributed, Scalable

� Dynamic Application Composition

� Continuous and adaptive resource management

� Fault-tolerance

� Leverage advances in computation and

communication

– Multi-core

– 10GigE, Infiniband

IBM Research

SPADEsource

TCP-IP / EthernetPhysical Network

Streams Data Fabric

Runtime

X86 blade X86 blade X86 blade X86 blade X86 blade

PE PE

Sink

Source

Source PE

PE

PE PE

Sink

PE

Sink

PE

PE

PE

PE

PE

PE

PE

PE

Connections

Source

Sink

PE

SPADE

compilerStreams App.

manager

IBM Research

Tooling

� Application Development and Debugging

� Visualization

� Toolkits

– Stream-Relational

– Data mining scoring

– Time Series

– Graph Mining

– Financial

� Automated Application Composition

IBM Research

Analytics

� Early discard

� Incremental Information Exposure

� Incremental Analytics

� Online learning

� Resource-adaptive Analytics

IBM Research

Time to Action

SOURCES

WAREHOUSE/

REPORTSAd-hoc Queries

DATA INTEGRATIONOPERATIONAL DATA STORES

DATAMARTS

Buss Process & Event Mgmt

OperationalReports

Dashboards Planning

Scorecarding

Analytical Modeling & Information

Stream computing in the Data Management Eco-system

Reduces Time to ActionWidens the apertureReduces infrastructure costs

More context

Analytical Modeling & Information

Feedback and control

IBM Research

Current Research

� Cross-platform Integration

– High-speed Streaming Ingest into Hadoop/Databases/file-systems

– Workload aware cross-platform Scheduler

� Dynamic Resource Management

– Adaptive with dynamic operator fusion

– Streams on cloud, leveraging elastic resource model

IBM Research

Science of Analytics

� Addressing the ‘decision overload’ problem

– What analyses to perform and when to perform them?

– What methodologies, algorithms and combination of analytics to use?

– How to execute the analysis processes?

� Research to automate much of the analysis process in a principled, scientific way

– Knowledge representation and management

– Models, algorithms for data exploration

– Analytics Engineering and management

� Benefits

– Analysts focus on strategic, higher-order thinking

– Developers create new analytic algorithms and tools

IBM Research

Selected PublicationsR&D Magazine 2010 Top 100 Innovations Award Winner

Books

� C. Aggarwal. Data Streams: Models and Algorithms, Springer, 2007.

Applications

� Deepak Turaga, and others. Design Principles for Developing Stream Processing Applications. Software - Practice and Experience Journal, Wiley SP&E. To Appear

� Alain Biem, Eric Bouillet, Hanhua Feng, Anand Ranganathan, Anton Riabov, Olivier Verscheure, Haris N. Koutsopoulos, Carlos Moran: IBM infosphere streams for scalable, real-time, intelligent transportation services. SIGMOD Conference 2010: 1093-1104

� Blount, M.; Ebling, M.R.; Eklund, J.M.; James, A.G.; McGregor, C.; Percival, N.; Smith, K.P.; Sow, D., “Real-Time Analysis for Intensive Care: Development and Deployment of the Artemis Analytic System”, IEEE Engineering in Medicine and Biology Magazine, March-April 2010, Vol 29, Issue 2.

� Olivier Verscheure, Michail Vlachos, Aris Anagnostopoulos, Pascal Frossard, Eric Bouillet, Philip S. Yu: Finding "Who Is Talking to Whom" in VoIP Networks via Progressive Stream Clustering. ICDM 2006: 667-677

� H. Tseng, O. Verscheure, D. S. Turaga and U. Chaudhari, "Optimal quantization for adapted GMM-based speaker verification," IEEE Interspeech 2007.

� D. S. Turaga, O. Verscheure, J. Wong, L. Amini, G. Yocum, E. Begle, B. Pfeifer, "Online FDC Control Limit Tuning with Yield Prediction using Incremental Decision Tree Learning," AEC/APC Symposium, 2007

� Ching-Yung Lin, Olivier Verscheure, Lisa Amini: Semantic Routing and Filtering for Large-Scale Video Streams Monitoring. ICME 2005: 1408-1411

� Deepak S. Turaga, Brian Foo, Olivier Verscheure, Rong Yan: Configuring topologies of distributed semantic concept classifiers for continuous multimedia stream processing. ACM Multimedia 2008: 289-298

� F. Fu, D. S. Turaga, O. Verscheure, M. Van der Schaar and L. Amini, “Configuring Competing Classifier Chains in Distributed Stream Mining Systems," IEEE J. Selected Topics in Signal Processing.

IBM Research

PublicationsLanguage, Runtime

� Robert Soulé, and others A Universal Calculus for Stream Processing Languages, European Symposium on Programming, ESOP, 2010.

� Rohit Khandekar and others. COLA: Optimizing Stream Processing Applications Via Graph Partitioning. ACM/IFIP/USENIX International Middleware Conference, Middleware, 2009.

� Gabriela Jacques-Silva and others. Language Level Checkpointing Support for Stream Processing Applications. International Conference on Dependable Systems and Networks. IEEE/IFIP DSN 2009.

� Xiaolan J. Zhang, and others. Implementing a High-Volume, Low-Latency Market Data Processing System on Commodity Hardware using IBM Middleware. Workshop on High Performance Computational Finance at SC09, Nov, 2009.

� Buğra Gedik, Henrique Andrade, and Kun-Lung Wu. A Code Generation Approach to Optimizing High-Performance Distributed Data Stream Processing. International Conference on Information and Knowledge Management, ACM CIKM, 2009.

� Scott Schneider, Henrique Andrade, Buğra Gedik, Alian Biem, and Kun-Lung Wu. Elastic Scaling of Data Parallel Operators in Stream Processing. International Parallel and Distributed Processing Symposium. IEEE IPDPS 2009.

� Shicong Meng, Srinivas R. Kashyap, Chitra Venkatramani, Ling Liu: REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems. ICDCS 2009

� N. Bansal, R. Bhagwan, N. Jain, Y. Park, D. S. Turaga, C. Venkatramani, "Towards Optimal Operator Placement in Partial-Fault Tolerant Applications", IEEE Infocom 2008, April, Phoenix, AZ

Tooling

� Wim De Pauw, Henrique Andrade, Lisa Amini: Streamsight: a visualization tool for large-scale streaming applications. SOFTVIS 2008: 125-134

� Buğra Gedik, Henrique Andrade, Andy Frenkiel, Wim De Pauw, Michael Pfiefer, Paul Allen, Norman Cohen, and Kun-Lung Wu. Debugging Tools and Strategies for Distributed Stream Processing Applications. Software - Practice and Experience Journal, Wiley SP&E. Volume 39 Issue 16, 2009.

� Eric Bouillet, Mark Feblowitz, Zhen Liu, Anand Ranganathan, Anton Riabov: A tag-based approach for the design and composition of information processing applications. OOPSLA 2008: 585-602

IBM Research

Thank You!