Fluentd + MongoDB + Spark = Awesome Sauce

42
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here

Transcript of Fluentd + MongoDB + Spark = Awesome Sauce

Page 1: Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited

Bhavani Ananth, Tech Manager, Wipro Limited

Your

company logo here

Page 2: Fluentd + MongoDB + Spark = Awesome Sauce

Wipro – Open Source Practice: Vision & Mission

“Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity.”

Vision

“Wipro’s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers’ needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro’s comprehensive suite of strategic and technology services will be delivered with passion and precision.”

Mission

Page 3: Fluentd + MongoDB + Spark = Awesome Sauce

Wipro – Open Source Practice Offerings

Advisory

Enterprise-wide adoption strategies

Best fit analysis & recommendation

Business Case Advisory

Governance

Technical Consulting

Support

Application and Infrastructure

Dev Ops Architecture,

Development Open Source

Community

Productized Services

Legacy Migration Services

Greenfield Development

Open Source Stack Setup

Open App

Cross Industry Solutions and Process Stacks

Page 4: Fluentd + MongoDB + Spark = Awesome Sauce

Connected Warehouse Platform

OMS

Sales Orders [Real-Time] Almost Real-Time

ERP/HOST

Purchase Orders Master Data [Scheduled]

TMS WMS

Direct to Customer

LMS IOT WCS

Route Plan / Carrier Tracking Associate Performance PUT/PICK Status

Connected Warehouse Platform Webservices Publisher Queues Subscriber Queues Integration Mapping

FTP (Flat file/Xml)

Facility Inventory & Orders Alerts & Notification Equipment Monitor

Automation Enabler

Performance Tracker Operations Dashboards

Warehouse KPI’s

Master Data Transaction Data

CSC SCP Warehouse Mobility & Dashboards Carrier Vendor

Warehouses Equipment Retailer Supplier

Page 5: Fluentd + MongoDB + Spark = Awesome Sauce

ANALYTICS & PREDICTION

The Awesome Sauce

Page 6: Fluentd + MongoDB + Spark = Awesome Sauce

Clickstream Analytics

User Behavior Analysis

Product Affinity

Website Resource Allocation

Prediction & recommendation

Page 7: Fluentd + MongoDB + Spark = Awesome Sauce

PREDICTION & RECOMMENDATION

Prediction Using Machine Learning

Content Recommendation

Conversion Prediction

Visitor Segmentation

Demand Forecasting

Page 8: Fluentd + MongoDB + Spark = Awesome Sauce

LOGS Sauce Raw Material

Page 9: Fluentd + MongoDB + Spark = Awesome Sauce

Logs, Logs Everywhere!

SysLog

Application

Server Logs

Social Media Feeds

Packet Data

Clickstream Data

Sensor Data

CDR

Custom App (C,

Ruby,Python)

Payment Data

Device Logs

Web Access

Logs

Database Logs

Page 10: Fluentd + MongoDB + Spark = Awesome Sauce

What can be done with logs?

Real time monitoring

Root cause analysis

Anomaly Detection and Predictive Monitoring

Debugging

Troubleshooting/Support

Page 11: Fluentd + MongoDB + Spark = Awesome Sauce

Challenges with Log Analytics

No standard log formats

Multiple logging frameworks

Logs highly decentralized

Limited real time visualization capability

Scalability Issues

Normalizing and correlating logs from disparate sources

Page 12: Fluentd + MongoDB + Spark = Awesome Sauce

What can be done with logs – Business PoV?

Input Data Analytics

User Interactions /Behavior

End user Experience/Improvements

Page 13: Fluentd + MongoDB + Spark = Awesome Sauce

Awesome Foursome– The Ingredients

Page 14: Fluentd + MongoDB + Spark = Awesome Sauce

FLUENTD The Ingredients

Page 15: Fluentd + MongoDB + Spark = Awesome Sauce

Why Fluentd

Unified Logging

Simple and Flexible

Proven

Minimal Resources

Reliable

Open Source

Community

Page 16: Fluentd + MongoDB + Spark = Awesome Sauce

Input Filter

Fluentd Plugin Architecture

Output

Filter (grep,enrich, delete.mask)

Parser (regexp,apache2)

Buffer

Format

Output out_mongo

Input

(udp,tcp,http,tail)

Page 17: Fluentd + MongoDB + Spark = Awesome Sauce

HA Fluentd topology

• “At Most once” and “At Least once” transfers

Log Aggregators

Fluentd (Active)

Log Forwarders

Fluentd (Backup)

Fluentd

Fluentd

Fluentd

Destination

MongoDB

Amazon S3

PUSH PUSH Log File

Log File

Log File

Node1

Node2

Node3

Page 18: Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd – Failure Scenarios

Forwarder goes down

Aggregator goes down

Page 19: Fluentd + MongoDB + Spark = Awesome Sauce

KAFKA The Ingredients

Page 20: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka – distributed streaming platform

DBs

Apps App App

Kafka Cluster

DBs

App

App

Apps App App

Stream Processor Connectors

Producers

Consumers

Publish-Subscribe streams of records

Store streams of records in fault tolerant way

Process streams of records

Page 21: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka –Terms

Producer

Consumer

Consumer Group

Topic

Partition

Producer

Topics

0 1 0 1 2 0 1

Partition-1 Partition-2 Partition-3

Brokers

p1 p2 p3

R1 R2 R3

Consumer Groups

C1 C2 C2

Page 22: Fluentd + MongoDB + Spark = Awesome Sauce

Why Kafka

Ideal unified platform to handle real time data feeds

Has high throughput to support high volume event streams such as log aggregation

Deals well with high volume data loads from offline systems

Fault tolerance and Scalable

Able to handle the low latency associated with traditional messaging systems

Page 23: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka – decouples data pipelines

Producers

Broker

Consumers

Producers Producers Producers

Kafka

Consumer Consumer Consumer

Page 24: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka – Guarantees

Messages sent to the topic and partition are appended in the same order

A consumer instance gets the message in the same order as they are produced

A topic with replication factor N can tolerate n-1 failures

Page 25: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka –Replication

Logs Logs Logs Logs

Topic1-part1

Topic1-part1

Topic1-part2

Topic1-part1

Topic1-part2

Topic1-part2

Broker1 Broker2 Broker3 Broker4

Leader

Follower

Follower

Leader

Follower

Follower

Producer Producer

Page 26: Fluentd + MongoDB + Spark = Awesome Sauce

Zookeeper

• Zookeeper enables highly reliable distributed coordination

• Kafka bundles single node ZooKeeper instance

• Metadata includes – broker addresses, message offsets

Producers Consumers

Kafka Cluster

Zookeeper metadata

metadata

metadata

messages messages

Page 27: Fluentd + MongoDB + Spark = Awesome Sauce

Kafka Persistence - File System

http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

Sequential File I/O very fast

Uses OS page cache for data storage

Batching of messages speeds up disk operations, network transfers and in memory iterations.

Page 28: Fluentd + MongoDB + Spark = Awesome Sauce

Batch Processing

One of the big drivers for efficiency

Producers accumulate data in memory and send larger batches in a single request

Fix the number of messages in a batch - batch.size

Wait no longer than a fixed latency bound - linger.ms

Trade off small amount of latency for better throughput

Page 29: Fluentd + MongoDB + Spark = Awesome Sauce

Log Compaction

Per-record retention, rather than the coarser-grained time-based retention

Page 30: Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd Kafka Integration • Kafka –Fluentd Consumer

• Fluentd kafka plugin

Kafka Ecosystem

Log Forwarders

Fluentd

Fluentd

Fluentd

Destination

MongoDB

Amazon S3

Kafka Clusters PUSH

Consumers

Fluentd

Fluentd

Fluentd

PULL PUSH

Page 31: Fluentd + MongoDB + Spark = Awesome Sauce

Advantage - Fluentd-Kafka

Backpressure - Pull versus Push

Reliable , Flexible data pipeline

Page 32: Fluentd + MongoDB + Spark = Awesome Sauce

Data Center – 1 - Active Data Center – 2 - Active

Kafka Broker -1 Topic – 1, Partition –

0..n

Kafka Broker –2 Topic – 1, Partition –

n+1, n+n

Fluentd-Kafka Plugin

ZK – 1 Leader ZK – 2 Follower Zookeeper Ensemble

Connected Warehouse – Kafka Cluster Architecture

Kafka Cluster

Page 33: Fluentd + MongoDB + Spark = Awesome Sauce

MONGODB The Ingredients

Page 34: Fluentd + MongoDB + Spark = Awesome Sauce

Why MongoDB

Cross platform document-oriented NOSQL database

Simple and Flexible Data Model

Field Level Indexing

Built In Query Capabilities

High Performance

Page 35: Fluentd + MongoDB + Spark = Awesome Sauce

System Architecture With Shards

Data Sources

Config Server

mongos mongos mongos

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Page 36: Fluentd + MongoDB + Spark = Awesome Sauce

MongoDB For Analytics

Denormalization with support of Embedded Documents

Text Search Queries

Connector for almost all kind of data source

Aggregation Framework

Range Queries, Key value queries

Page 37: Fluentd + MongoDB + Spark = Awesome Sauce

SPARK The Ingredients

Page 38: Fluentd + MongoDB + Spark = Awesome Sauce

Spark – Logical Architecture

Apache Spark

Spark SQL Spark

Streaming MLlib GraphX

Scala, Java, Python, R

Spark – MongoDB Connector

Page 39: Fluentd + MongoDB + Spark = Awesome Sauce

Putting It All Together – Click Stream + Inventory Mgmt

Collection

Processing

Ingestion

Data Sync

Micro-Service

Page 40: Fluentd + MongoDB + Spark = Awesome Sauce

QUESTIONS &

ANSWERS

Page 41: Fluentd + MongoDB + Spark = Awesome Sauce

Thank you

Page 42: Fluentd + MongoDB + Spark = Awesome Sauce

www.modsummit.com

www.developersummit.com