Real-Time Integration Between MongoDB and SQL Databases

Post on 15-Jan-2015

5.819 views 0 download

Tags:

description

 

Transcript of Real-Time Integration Between MongoDB and SQL Databases

1

Distributed, fault-tolerant, transactional

Real-Time Integration: MongoDB and SQL Databases

Eugene DvorkinArchitect, WebMD

2

WebMD: A lot of data; a lot of traffic

~900 millions page view a month~100 million unique visitors a month

3

How We Use MongoDB

User Activity

4

Why Move Data to RDBMS?

Preserve existing investment in BI and data warehouse

To use analytical database such as VerticaTo use SQL

5

Why Move Data In Real-time?

Batch process is slow

No ad-hoc queries

No real-time reports

6

Challenge in moving data

Transform Document to Relational Structure Insert into RDBMS at high rate

7

Challenge in moving data

Scale easily as data volume and velocity increase

8

Our Solution to move data in Real-time: Storm

tem. Storm – open source distributed real-time computation system.

Developed by Nathan Marz - acquired by Twitter

9

Hadoop Storm

Our Solution to move data in Real-time: Storm

10

Why STORM?

JVM-based framework

Guaranteed data processing

Supports development in multiple

languages

Scalable and transactional

11

Overview of Storm cluster

Master Node

Cluster Coordination

run worker processes

12

Storm Abstractions

Tuples, Streams, Spouts, Bolts and Topologies

13

Tuples

(“ns:events”,”email:edvorkin@gmail.com”)

Ordered list of elements

14

Stream

Unbounded sequence of tuples

Example: Stream of messages from message queue

15

Spout

Read from stream of data – Queues, web logs, API calls, mongoDB oplogEmit documents as tuples

Source of Streams

16

BoltsProcess tuples and create new streams

17

Bolts

Apply functions /transformsCalculate and aggregate data (word count!)Access DB, API , etc.Filter dataMap/Reduce

Process tuples and create new streams

18

Topology

19

Topology

Storm is transforming and moving data

20

MongoDB

How To Read All Incoming Data from MongoDB?

21

MongoDB

How To Read All Incoming Data from MongoDB?

Use MongoDB OpLog

22

What is OpLog?

Replication mechanism in MongoDBIt is a Capped Collection

23

Spout: reading from OpLog

Located at local database, oplog.rs collection

24

Spout: reading from OpLog

Operations: Insert, Update, Delete

25

Spout: reading from OpLog

Name space: Table – Collection name

26

Spout: reading from OpLog

Data object:

27

Sharded cluster

28

Automatic discovery of sharded cluster

29

Example: Shard vs Replica set discovery

30

Example: Shard discovery

31

Spout: Reading data from OpLog

How to Read data continuously from OpLog?

32

Spout: Reading data from OpLog

How to Read data continuously from OpLog?

Use Tailable Cursor

33

Example: Tailable cursor - like tail –f

34

Manage timestamps

Use ts (timestamp in oplog entry) field to track processed records

If system restart, start from recorded ts

35

Spout: reading from OpLog

36

SPOUT – Code Example

37

TOPOLOGY

38

Working With Embedded Arrays

Array represents One-to-Many relationship in RDBMS

39

Example: Working with embedded arrays

40

Example: Working with embedded arrays

{_id: 1, ns: “person_awards”, o: { award: 'National Medal of Science', year: 1975, by: 'National Science Foundation' }}

{ _id: 1, ns: “person_awards”,o: {award: 'Turing Award', year: 1977, by: 'ACM' }}

41

Example: Working with embedded arrays

public void execute(Tuple tuple) {

.........

if (field instanceof BasicDBList) {

BasicDBObject arrayElement=processArray(field)

......

outputCollector.emit("documents", tuple, arrayElement);

42

Parse documents with Bolt

43

{"ns": "people", "op":"i", o : { _id: 1, name: { first: 'John', last: 'Backus' }, birth: 'Dec 03, 1924’}

["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "DEc 03, 1924" ]]

Parse documents with Bolt

44

@Override

public void execute(Tuple tuple) {

......

final BasicDBObject oplogObject =

(BasicDBObject)tuple.getValueByField("document");

final BasicDBObject document =

(BasicDBObject)oplogObject.get("o");

......

outputValues.add(flattenDocument(document));

outputCollector.emit(tuple,outputValues);

Parse documents with Bolt

45

Write to SQL with SQLWriter Bolt

46

Write to SQL with SQLWriter Bolt

["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "Dec 03, 1924" ]

]insert into people (_id,name_first,name_last,birth) values

(1,'John','Backus','Dec 03,1924') ,

insert into people_awards

(_id,awards_award,awards_award,awards_by) values (1,'Turing

Award',1977,'ACM'),

insert into people_awards

(_id,awards_award,awards_award,awards_by) values (1,'National

Medal of Science',1975,'National Science Foundation')

47

@Override public void prepare(.....) {.... Class.forName("com.vertica.jdbc.Driver"); con = DriverManager.getConnection(dBUrl, username,password);

@Override public void execute(Tuple tuple) { String insertStatement=createInsertStatement(tuple); try { Statement stmt = con.createStatement(); stmt.execute(insertStatement); stmt.close();

Write to SQL with SQLWriter Bolt

48

Topology Definition

TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)

LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());

49

Topology Definition

TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)

LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());

50

Topology Definition

TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)

LocalCluster cluster = new LocalCluster();cluster.submitTopology("test", conf, builder.createTopology());

51

Topology Definition

TopologyBuilder builder = new TopologyBuilder();// define our spoutbuilder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress)builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,documentsStreamId)builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffleGrouping(mongoDocParserId)

StormSubmitter.submitTopology("OfflineEventProcess", conf,builder.createTopology())

52

Lesson learned

By leveraging MongoDB Oplog or

other capped collection, tailable cursor

and Storm framework, you can build

fast, scalable, real-time data

processing pipeline.

55

Questions

Eugene Dvorkin, Architect, WebMD edvorkin@webmd.net Twitter: @edvorkin LinkedIn: eugenedvorkin

56

57

58

Next Sessions at 2:505th Floor:

West Side Ballroom 3&4: Data Modeling Examples from the Real World

West Side Ballroom 1&2: Growing Up MongoDB

Juilliard Complex: Business Track: MetLife Leapfrogs Insurance Industry with MongoDB-Powered Big Data Application

Lyceum Complex: Ask the Experts: MongoDB Monitoring and Backup Service Session

7th Floor:

Empire Complex: How We Fixed Our MongoDB Problems

SoHo Complex: High Performance, High Scale MongoDB on AWS: A Hands On Guide