Blueprints for the analysis of social media

47
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 2014 © Trivadis Blueprints for the analysis of social media Java Lounge Zurich, Mai 2015 Guido Schmutz Trivadis AG May 2015 Blueprints for the analysis of social media 1

Transcript of Blueprints for the analysis of social media

2015 © Trivadis

BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

2014 © Trivadis

Blueprints for the analysis of social media

Java Lounge Zurich, Mai 2015

Guido Schmutz

Trivadis AG

May 2015 Blueprints for the analysis of social media

1

2015 © Trivadis

Guido Schmutz

§  Working for Trivadis for more than 18 years §  Oracle ACE Director for Fusion Middleware and SOA §  Co-Author of different books §  Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data §  Member of Trivadis Architecture Board §  Technology Manager @ Trivadis

§  More than 25 years of software development

experience

§  Contact: [email protected] §  Blog: http://guidoschmutz.wordpress.com §  Twitter: gschmutz

May 2015 Blueprints for the analysis of social media

2

2015 © Trivadis

A little story of a “real-life” customer situation

Traditional system interact with its clients and does its work

Implemented using legacy technologies (i.e. PL/SQL)

New requirement:

•  Offer notification service to notify customer when goods are shipped

•  Subscription and inform over different channels

•  Existing technology doesn’t fit

May 2015 Blueprints for the analysis of social media

3

delivery

Logistic System

Oracle

Mobile Apps

Sensor ship

sort

3

Rich (Web) Client Apps

DB

schedule

Logic (PL/SQL)

delivery

2015 © Trivadis

A little story of a “real-life” customer situation

Rule Engine implemented in Java and invoked from OSB message flow

Notification system informed via queue

Higher Latency introduced (good enough in this case)

Events are “owned” by traditional application (as well as the channels they are transported over)

integrate in order to get the information!

Oracle Service Bus was already there

May 2015 Blueprints for the analysis of social media

4

delivery

Logistic System

Oracle Oracle

Service Bus

Mobile Apps

Sensor AQ ship

sort

4

Rich (Web) Client Apps

DB

schedule

Filter

Notification

Logic (PL/SQL)

JMS

Rule Engine (Java)

Logic (Java) delivery

ship delivery

delivery true SMS

Email

2015 © Trivadis

A little story of a “real-life” customer situation

May 2015 Blueprints for the analysis of social media

5

delivery

Logistic System

Oracle Oracle

Service Bus

Mobile Apps

Sensor AQ ship

sort

5

Rich (Web) Client Apps

DB

schedule

Filter

Notification

Logic (PL/SQL)

JMS

Rule Engine (Java)

Logic (Java) delivery

ship delivery

delivery true SMS

Email

Treat events as first-class citizens

Events belong to the “enterprise” and not an individual system => Catalog of Events similar to Catalog of Services/APIs !!

Event (stream) processing can be introduced and by that latency reduced!

2015 © Trivadis

Stream/Event Processing?

Infrastructure for continuous data processing

Computational model can be as general as MapReduce but with the ability to produce low-latency results

Data collected continuously is naturally processed continuously

aka. Event Processing / Complex Event Processing (CEP)

May 2015 Blueprints for the analysis of social media

6

2015 © Trivadis

Agenda

1.  Designing Stream/Event Processing Solutions

2.  Implementing the Enterprise Event Bus (Unified Log)

3.  Implementing Stream Processing

4.  Unified Log (Event) Processing Architecture in Action

May 2015 Blueprints for the analysis of social media

7

2015 © Trivadis

How to design a Streaming Processing System? It usually starts very simple … just one data pipeline

May 2015 Blueprints for the analysis of social media

8

Event Stream Consumer event Collector

2015 © Trivadis

New Event Stream sources are added …

May 2015 Blueprints for the analysis of social media

9

Event Stream Consumer

2nd Event Stream

3rd Event Stream

nth Event Stream

event

event

event

event

Collector

2nd Collector

3rd Collector

Nth Collector

2015 © Trivadis

New Processors are interested in the events …

May 2015 Blueprints for the analysis of social media

10

Event Stream Consumer

2nd Event Stream

3rd Event Stream

nth Event Stream

2nd Consumer event

event

event

event

Collector

2nd Collector

3rd Collector

Nth Collector

2015 © Trivadis

… and the solution becomes the problem

May 2015 Blueprints for the analysis of social media

11

Event Stream Consumer

2nd Event Stream

3rd Event Stream

nth Event Stream

2nd Consumer

3rd Consumer

Nth Consumer

event

event

event

event

Collector

2nd Collector

3rd Collector

Nth Collector

2015 © Trivadis

… and the solution becomes the problem

May 2015 Blueprints for the analysis of social media

12

Event Stream Consumer

2nd Event Stream

3rd Event Stream

nth Event Stream

2nd Consumer

3rd Consumer

Nth Consumer

event

event

event

event

Collector

2nd Collector

3rd Collector

Nth Collector

2015 © Trivadis

… and the solution becomes the problem

May 2015 Blueprints for the analysis of social media

13

New Customers

Operational Logs

Click Stream

Meter Readings

event

event

event

event

CDC Collector

Log Collector

Click Stream Collector

Senor Collector

Hadoop/Data Warehouse

Recommendation System

Log Search

Fraud Detection

2015 © Trivadis

Decouple event streams from consumers

May 2015 Blueprints for the analysis of social media

14

„Unified Log“

Remember Enterprise Service Bus (ESB) ?

Enterprise Event Bus

Event Stream Processor

Event Stream Source

New Customers

Operational Logs

Click Stream

Meter Readings

CDC Collector

Log Collector

Click Stream Collector

Senor Collector

Hadoop/Data Warehouse

Recommendation System

Log Search

Fraud Detection

What is the idea of a Unified Log?

2015 © Trivadis

Unified Log – What is it?

By Unified Log, we do not mean this …. 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809

… but this

•  a structured log (records are numbered beginning with 0 based on order they are written)

•  aka. commit log or journal

May 2015 Blueprints for the analysis of social media

15

0 1 2 3 4 5 6 7 8 9 10 11

1st record Next record written

2015 © Trivadis

Central Unified Log for (real-time) subscription

Take all the organization’s data (events) and put it into a central log for subscription

Properties of the Unified Log: •  Unified: “Enterprise”, single deployment •  Append-Only: events are appended, no update in place => immutable •  Ordered: each event has an offset, which is unique within a shard •  Fast: should be able to handle thousands of messages / sec •  Distributed: lives on a cluster of machines

May 2015

Blueprints for the analysis of social media 16

0 1 2 3 4 5 6 7 8 9 10 11

reads

writes

Collector

Consumer System A (time = 6)

Consumer System B (time = 10)

reads

2015 © Trivadis

Unified Log / Event Processing Architecture

Stream processing allows for computing feeds off other feeds

Derived feeds are no different than original feeds they are computed off

Single deployment of “Unified Log”

logically different feeds

May 2015 Blueprints for the analysis of social media

17

Meter Readings Collector

Enrich / Transform

Aggregate by Minute

Raw Meter Readings

Meter & Customer

Meter by Customer by Minute

Customer Aggregate by Minute

Meter by Minute

Persist

Meter by Minute

Persist

Raw Meter Readings

….

2015 © Trivadis

Agenda

1.  Designing Stream Processing Solutions

2.  Implementing the Enterprise Event Bus (Unified Log)

3.  Implementing Stream Processing

4.  Unified Log (Event) Processing Architecture in Action

May 2015 Blueprints for the analysis of social media

18

2015 © Trivadis

Apache Kafka - Overview

•  A distributed publish-subscribe messaging system

•  Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)

•  Initially developed at LinkedIn, now part of Apache

•  Does not follow JMS Standards and does not use JMS API

•  Kafka maintains feeds of messages in topics

May 2015 Blueprints for the analysis of social media

19

Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

Anatomy of a topic:

Partition 0

Partition 1

Partition 2

Writes

old new

http://kafka.apache.org/

2015 © Trivadis

Apache Kafka - Performance

Kafka at LinkedIn

Up to 2 million writes/sec on 3 cheap machines §  Using 3 producers on 3 different machines

May 2015 Blueprints for the analysis of social media

20

10+ billion writes per day

172k messages per second

(average)

55+ billion messages per day

to real-time consumers

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

2015 © Trivadis

Apache Kafka - Partition offsets

Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset

•  Consumers track their pointers via (offset, partition, topic) tuples

May 2015 Blueprints for the analysis of social media

21

Consumer group C1

2015 © Trivadis

Unified Log Alternatives

•  Amazon Kinesis (http://aws.amazon.com/kinesis/) •  Confluent (http://confluent.io/) •  Redis Pub/Sub (http://redis.io/topics/pubsub) •  Kestrel (http://robey.github.io/kestrel/) •  ZeroMQ (http://zeromq.org/) •  RabbitMQ (http://www.rabbitmq.com/) •  Oracle GoldenGate (http://bit.ly/g-gate) •  JMS compliant Server

•  Apache ActiveMQ (http://activemq.apache.org/) •  Weblogic JMS (

http://www.oracle.com/technetwork/middleware/weblogic/overview/index.html) •  IBM Websphere MQ (http://www-03.ibm.com/software/products/de/ibm-mq) •  …

May 2015 Blueprints for the analysis of social media

22

2015 © Trivadis

Apache Storm

A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. •  highly distributed real-time computation system

•  Provides general primitives to do real-time computation

•  To simplify working with queues & workers

•  scalable and fault-tolerant

Originated at Backtype, acquired by Twitter in 2011

Open Sourced late 2011

Part of Apache Incubator since September 2013

May 2015 Blueprints for the analysis of social media

23 https://storm.apache.org/

2015 © Trivadis

Apache Storm – Core concepts

Tuple •  Immutable Set of Key/value pairs

Stream •  an unbounded sequence of tuples that can be processed in parallel by Storm

Topology •  Wires data and functions via a DAG (directed acyclic graph) •  Executes on many machines similar to a MR job in Hadoop

Spout •  Source of data streams (tuples) •  can be run in “reliable” and “unreliable” mode

Bolt •  Consumes 1+ streams and produces new streams •  Complex operations often require multiple

steps and thus multiple bolts

May 2015 Blueprints for the analysis of social media

24

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Source of Stream B

Subscribes: A Emits: C

Subscribes: A Emits: D

Subscribes: A & B Emits: -

Subscribes: C & D Emits: -

T T T T T T T T

2015 © Trivadis

Stream Processing Alternatives

•  Apache Samza (http://samza.incubator.apache.org) •  Apache S4 (http://incubator.apache.org/s4/) •  Apache Spark Streaming (http://spark.apache.org/streaming/) •  Google MillWheel (http://research.google.com/pubs/pub41378.html) •  Akka Streams (http://akka.io) •  Complex Event Processing

§  Esper (http://esper.codehaus.org/) §  WSO2 Complex Event Processor (http://wso2.com/products/complex-event-processor/) §  Oracle Event Processing (

http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/index.html)

§  TIBCO BusinessEvents & TIBCO StreamBase (http://www.tibco.com/products/event-processing/complex-event-processing)

§  IBM InfoSphere (http://www-01.ibm.com/software/data/infosphere/) §  Microsoft StreamInsight (http://msdn.microsoft.com/de-ch/sqlserver/ee476990.aspx) §  …

May 2015 Blueprints for the analysis of social media

25

2015 © Trivadis

Agenda

1.  Designing Stream Processing Solutions

2.  Implementing the Enterprise Event Bus (Unified Log)

3.  Implementing Stream Processing

4.  Unified Log (Event) Processing Architecture in Action

May 2015 Blueprints for the analysis of social media

26

2015 © Trivadis

Unified Log Processing Architecture in Trivadis CRA

May 2015 Blueprints for the analysis of social media

27

Tweets Filter and Unify

Persist Tweet

Filtered Tweets

Split Text

Words

Count over Time

Count by Minute

Persist Graph

Social Graph

Remove Stopwords

Tweet

Tweets Consumer

Twitter Filter Stream

Sensor Layer Distribution Layer

Speed Layer

Kafka Storm

Cassandra Elasticsearch Titan

2015 © Trivadis

Unified Log Processing Architecture in Trivadis CRA

May 2015 Blueprints for the analysis of social media

28

Tweets Filter and Unify

Persist Tweet

Filtered Tweets

Split Text

Words

Count over Time

Count by Minute

Persist Graph

Social Graph

Remove Stopwords

Tweet

Tweets Consumer

Twitter Filter Stream

Sensor Layer Distribution Layer

Splitter

Kafka Spout

Word Remover

Splitter

Word Remover

Shuffle Fields

Kafka

Kafka

Word Remover

Storm Topology

Speed Layer

Kafka Storm

Cassandra Elasticsearch Titan

2015 © Trivadis

Storm Topology

May 2015 Blueprints for the analysis of social media

29

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Sentence Splitter

Twitter Spout

Sentence Splitter

… #barca

Shuffle Grouping

Sentence Splitter

… #fcb

bayern

fcb

juve

real

barca

barca

2015 © Trivadis

Storm Topology

May 2015 Blueprints for the analysis of social media

30

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

… #barca

… #fcb

Fields Grouping

real

juve

barca

barca

bayern

fcb

2015 © Trivadis

Storm Topology

May 2015 Blueprints for the analysis of social media

31

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

real

juve

barca

barca

bayern

fcb … #barca

… #fcb

Fields Grouping

INCR barca

INCR real

INCR juve

real = 1

juve = 1

INCR barca

INCR bayern bayern = 1

barca = 1

barca = 2

INCR fcb fcb = 1

2015 © Trivadis

Storm Topology

May 2015 Blueprints for the analysis of social media

32

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Persist

INCR real 1 INCR juve 1

INCR barca 2 INCR bayern 1

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

real

juve

barca

barca

bayern

... … #barca

… #fcb

Fields Grouping

Global Grouping

real = 1 juve = 1

bayern = 1 barca = 2

30sec

fcb = 1

INCR fcb 1

2015 © Trivadis

Storm UI

May 2015 Blueprints for the analysis of social media

33

2015 © Trivadis

Elasticsearch

Kibana Dashboards Open Source Search & Analytics engine •  Structured & unstructured data •  Real-time Analytics •  Percolator Index •  Analytics capabilities (facets) •  REST based •  Schema-free •  Distributed

Lightweight Build on top of Apache Lucene

May 2015 Blueprints for the analysis of social media

34 https://www.elastic.co/

2015 © Trivadis

Elasticsearch

May 2015 Blueprints for the analysis of social media

35 https://www.elastic.co/

2015 © Trivadis

Elasticsearch

May 2015 Blueprints for the analysis of social media

36

2015 © Trivadis

Cassandra

•  Developed at Facebook

•  Open source distributed database management system

•  Professional grade support from company called DataStax

•  Main Features §  Real-Time §  Highly Distributed §  Support for Multiple Data Center §  Highly Scalable §  No Single Point of Failure §  Fault Tolerant §  Tunable Consistency §  Cassandra Query Language (CQL)

May 2015 Blueprints for the analysis of social media

37 http://www.datastax.com/

2015 © Trivadis

Table TWEET_COUNT

22.05.2014 Big Data and Fast Data – gross und schnell, geht das? | Teil 2: Praktische Erfahrungen bei der Umsetzung

38

Sensor Bucket

AFG10 MINUTE-2014/03/05 key IBM IBM IBM … Oracle Oracle …

at 11:59 11:58 11:57 … 11:59 11:58 …

count 10 4 6 … 2 4 …

AFG10 HOUR-2014/03 key IBM IBM IBM … Oracle Oracle …

at 5T11 5T10 5T09 … 5T11 5T10 …

count 148 108 111 … 29 41 …

AFG10 DAY-2014 key IBM IBM IBM … Oracle Oracle …

at 5T 3T 2T … 5T 4T …

count 10100.2 9892.2 8987.4 … 879.8 912,3 …

GXK11 MINUTE-2014/03/5 key NoSQL NoSQL NoSQL … Hadoop Hadoop …

at 11:59 11:03 11:04 … 11:01 11:02 …

count 5 9 12 … 2 1 …

Growth

24h * 60m * n keys = n * 1’440 cols

2015 © Trivadis

Optimized to work against billions of nodes and edges

Works with several different distributed DBs

•  Apache Cassandra

•  Apache HBase

•  Oracle BerkeleyDB

Supports concurrent users doing complex graph traversals

Integration with TinkerPop stack

Supports integration with search technologies such as Lucene and Elasticsearch

Titan Graph Database

May 2015 Blueprints for the analysis of social media

39 http://thinkaurelius.github.io/titan/

2015 © Trivadis

Property Graph

Node / Vertex •  can have zero or more

edges connected to it

Edge •  connects two nodes

•  may be directed or undirected

May 2015 Blueprints for the analysis of social media

40

User [id, name]

Post [id, message,

time]

Term [name,type]

author

follow uses

retweet

mention mention

2015 © Trivadis

Titan Graph Database

Titan can integrate with distributed architectures in a few different ways

May 2015 Blueprints for the analysis of social media

41

Remote Server

•  Connects remotely to cluster

•  Can scale size as far as cluster can

•  Native Titan API

•  Possible processing bottleneck

Remote Server with Rexster

•  Put Rexster in front to allow RESTful access

•  Connects remotely to cluster

•  Can scale size as far as cluster can

•  Possible processing bottleneck

Embedded

•  TitanDB and Rexster run on each node in cluster

•  Can run on same JV

•  Considerable performance/stability improvement

2015 © Trivadis

Tinkerpop Stack

Different components all built on each other Provides abstraction Blueprints underpins the stack making it all DB agnostic Blueprints implementations •  Neo4j

•  Oracle NoSQL

•  Titan

•  FluxGraph

•  Foundation DB

•  MongoDB

•  …

Tinkerpop3 on its way ….

May 2015 Blueprints for the analysis of social media

42 http://tinkerpop.incubator.apache.org/

2015 © Trivadis

Tinkerpop - Gremlin

Graph traversal scripting language

May 2015 Blueprints for the analysis of social media

43 https://github.com/tinkerpop/gremlin/wiki

2015 © Trivadis

Tinkerpop - Rexster

Provides REST and binary protocols

Flexible extension model (e.g. ad-hoc Gremlin queries)

Server-side stored procedures (Gremlin)

Browser-based interface (Dog House)

Command-line tool for interacting with API

SPARQL plugin to work against sail graphs (OpenRDF)

May 2015 Blueprints for the analysis of social media

44 https://github.com/tinkerpop/rexster/wiki

2015 © Trivadis

Keylines - Visualizing Graphs Toolkit for visualizing graphs

Compatible with any modern browser

HTML 5 or Flash (fall-back) Compatible with any graph database

Powerful visualizations features

Built-in social network analysis

http://keylines.com

May 2015 Blueprints for the analysis of social media

45 http://keylines.com

2015 © Trivadis

Weitere Informationen...

May 2015 Blueprints for the analysis of social media

46

INFOBOX – Lesen und Löschen •  Folie wenn auf weitere Informationen

verwiesen werden soll, also z.B. Bücher, Websiten, etc.

2015 © Trivadis

BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Fragen und Antworten...

2013 © Trivadis

May 2015 Blueprints for the analysis of social media

INFOBOX – Lesen und Löschen •  Die Schlussfolie steht in zwei Varianten

zur Verfügung, einmal für die Kontaktdaten eines Referenten, einmal in der Variante für zwei oder mehr Referenten

•  Name, Titel und Location jeweils untereinander in eine Zeile (Shift+Return)

•  Die Idee ist das diese Folie als letzte Folie (auch für Fragen und Antworten) am Ende der Präsentation lange stehen bleibt, somit haben die Zuhörer die Möglichkeit die Kontaktdaten aufzuschreiben J