Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from...

24

Transcript of Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from...

Page 1: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation
Page 2: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Lessons Learned from Building a Big Data Technology Stack

Haggai Shachar Director, Data Services [email protected]

Page 3: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

!{ name: "Haggai Shachar", work: [

{ employee: "LivePerson", title: "Director, Data Services“ }, { employee: “NuConomy”, title: “Co-Founder, CTO” }, { employee: “Israeli Intelligence Corps”, title: “n/a” } ],

likes: [ “data”, “machine learning”, “cycling”, “diving” ], wife: "Orit", kids: [ { gender: “female”, age: -0.2, name: undefined } ] , todos: [ "buy a stroller" ] }

Hello World!

Page 4: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

LivePerson(“you do something with chat, right ??”)

1990s Click-to-Chat User initiated

2000 Proactive Based on Real-Time Behavior

2010 Real-time Prediction Multichannel

Predictive Intelligence

TodayEngage

everywhereWeb, Social, Native Apps, SMS, Email

Page 5: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

40

TB Raw data 2

2M Interactions 2 B

Visits

* monthly figures

Page 6: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

LivePerson Data stack

LiveEngage Console

MONITORING CHAT/VOICE system

Batch track Real-Time trackAPACHE KAFKA

STORM

COMPLEX EVENT PROCESSING

PERPETUAL STORE

BUSINESS INTELLIGENCE

ANALYTICAL DB

Serving layer (Data Producers) Monitoring Engagement systems

Middleware using Kafka Batch Track (near) Real Time Track

CEP using Storm Real Time computation Real Time data aggregation

Rich Business Intelligence Pre-defined dashboards Drill down to the record level Ad-hoc and self service BI

Data Repositories DSPT, Analytics, RT

Aggregation

Data Repositories DSPT, Analytics, RT

Aggregation

LiveEngage backoffice

RT REPOSITORIES

Page 7: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Forget data, lets talk cars -What’s the ultimate vehicle ?

Page 8: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

What’s the ultimate vehicle ?

Page 9: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

What’s the ultimate vehicle ?

Page 10: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness

Lessons Learned

Page 11: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Since the beginning of mankind <-> ~2004

LL#1 choosing the right tool

2004 - Today

Page 12: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

1. Problem fit 2. Scaling fit 3. Query language (SQL is not going anywhere) 4. Aggregation framework 5. By Key R/W throughput 6. Community

LL#1 choosing the right tool

Scaling Query Language

Aggregation framework

By Key throughput

Community

Hadoop Great MR, Hive Robust but slow

n/a Huge

Cassandra Great CQL, Thrift Sucks Awesome Big

MySQL Medium SQL Ok Ok Huge

Vertica Good SQL, R Awesome Ok Small

Page 13: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

▪ 150 developers ▪ 20 scrum teams ▪ 50 services ▪ 3 floors ▪ 4 development languages (Java, Scala, Python, Javascript) ▪ 3-5 deployments a week ▪ Marketing terms keep on changing

LL#2 Organization-wide data model

Tower of Babel by Pieter Bruegel the Elder Jacob's Ladder by William Blake

OR

Page 14: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Apache Avro to rescue ▪ A schema based serialization/deserialization framework ▪ Strong Hadoop integration & efficient storage ▪ Backward & Forward Compatibility ▪ Rich data structures (primitives, records, maps, arrays, enums)

LL#2 Organization-wide data model

Page 15: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

LL#2 Organization-wide data model

Protobuf Thrift Avro

Created 2001 (2008) 2007 2009

Creator / Maintainer Google / Google Facebook / Apache Doug cutting / Apache

Hadoop support No No Yes

Used by Google Facebook, Cassandra

Hadoop, Liveperson

Lang support Good Great Good

Page 16: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

#3 Producers / Consumers decoupling

▪ Flexibility of development / deployment ▪ Publisher multi subscribers

PRODUCER

MULTIPULE CONSUMERS

Page 17: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

▪ Predicting the exact future architecture and project needs is hard

▪ Use a middleware layer to simplify the interface between producers and consumers.

▪ Happily extend and modify each of the tiers independently

LL#3 Decouple producers from consumers

middleware

Hadoop

Producer ProducerProducer

ExternalStorm

Page 18: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Apache Kafka

▪ Distributed pub-sub system ▪ Developed at LinkedIn, Maintained by Apache ▪ Very high throughput (~300K messages/sec) ▪ Horizontally scalable ▪ Multiple subscribers for topics

Page 19: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Message queues !!ActiveMQ TIBCO

Log aggregators !!

Flume Scribe

• Low throughput • Secondary indexes • Tuned for low

latency

• Focus on HDFS • Push model • No rewindable

consumption

KAFKA

Apache Kafka

Page 20: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Writers like ▪ Write fast

LL#4 Write Optimized vs Read Optimized

Readers like ▪ Pre-defined aggregations ▪ Denormalized dimensions ▪ Data duplication

Page 21: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

Not all data needs are made equal

LL#5: Freshness vs Correctness

▪ High freshness is the key ▪ Minor inaccuracy is acceptable ▪ Fire & forget or eventually

consistent ▪ NoSQL

▪ It’s all about accuracy ▪ Billable data ▪ Batch oriented ▪ Transactional ▪ RDBMS

Page 22: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

MONITORING CHAT/VOICE system

Batch track Real-Time trackAPACHE KAFKA

PERPETUAL STORE

RT REPOSITORIES

300K events/sec

STORM

CEP

ANALYTICAL DB

Real Time counters Accuracy 99.9%

Raw data & Aggregations

Accuracy 100%

~300ms~2h

LL#4 Write Optimized vs Read OptimizedLL#5: Freshness vs Correctness

Read optimized

Page 23: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness

So, what did we have ??

Page 24: Lessons Learned fromfiles.meetup.com/17453062/BDX2015 - Haggai Shachar... · Lessons Learned from ... Ad-hoc and self service BI Data Repositories DSPT, Analytics, RT Aggregation

I’m Data

We do cool stuff, come work with us! [email protected] 054-7000814