Growing into a proactive Data Platform
-
Upload
liveperson -
Category
Data & Analytics
-
view
353 -
download
0
Transcript of Growing into a proactive Data Platform
![Page 1: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/1.jpg)
![Page 2: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/2.jpg)
Yaar Reuveni & Nir Hedvat
Becoming a Proactive Data Platform
![Page 3: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/3.jpg)
Yaar Reuveni
• 6 Years at Liveperson• 1 Reporting & BI• 3 Data Platform• 2 Data Platform team lead• I love to travel• And
![Page 4: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/4.jpg)
Nir Hedvat
• Software Engineer B.Sc• 3 years as a C++ Developer
at IBM Rational Rhapsody™• 1.5 years at LivePerson• Cloud and Parallel Computing
Enthusiast • Love Math and Powerlifting
![Page 5: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/5.jpg)
Agenda
• Our Scale & Operation• Evolution in becoming proactive
i. Hope & Low awarenessii. Storming & Troubleshootingiii. Fortifyingiv. Internalization & Comprehensionv. Being Proactive
• Showcases• Implementation
![Page 6: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/6.jpg)
Our Scale
• 2 M Daily chats• 100 M Daily monitored visitor sessions• 20 B Events per day• 2 TB Raw data per day• 2 PB Total in Hadoop clusters
• Hundreds producers * event types * consumers
![Page 7: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/7.jpg)
LivePerson technology stack
![Page 8: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/8.jpg)
Stage 1: Hope & Low awarenessWe built it and it’s awesome
Online producer
Offline producer
local files
DSPT Jobs
Raw Data
* DSPT - Data single point of truth
![Page 9: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/9.jpg)
Stage 1: Hope & Low awarenessWe’ve got customers
Dashboards
Data ScienceAppsReporting
Data ScienceData AccessAd-Hoc Queries
![Page 10: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/10.jpg)
Stage 2: Storming & TroubleshootingYou’ve got NOC & SCS on speed dial
Issues arise:• Data loss• Data delays• Partial data out of frame• Missing/faulty calculations for consumers• One producer does not send for over a week
![Page 11: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/11.jpg)
Stage 2: Storming & TroubleshootingYou’ve got NOC & SCS on speed dial
Common issues types and generators:• Hadoop ops• Production ops• Events schema• New data producers• High new features rate (LE2.0)• Data stuck in pipeline• Bugs
![Page 12: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/12.jpg)
Stage 3: FortifyingEvery interruption derives a new protection
![Page 13: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/13.jpg)
Stage 3: FortifyingEvery interruption derives a new protection
![Page 14: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/14.jpg)
Stage 3: FortifyingEvery interruption derives a new protection
• Monitors on jobs, failures, success rate• Monitors on service status• Simple data freshness checks e.g. measure the
newest event• Measure latency of specific parts of the pipeline
![Page 15: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/15.jpg)
Stage 4: Internalization & ComprehensionAuditing requirements
• Measure principles:– Loss
• How much?• Which customer?• What Type?• Where in the pipeline?
– Freshness• Percentiles• Trends
– Statistics• Event type count• Event per LP customer• Trends
![Page 16: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/16.jpg)
Producer
Audit DB
Audit Aggregator
Audit Loader
Stage 4: Internalization & ComprehensionAuditing architecture
ProducerProducer
Events Audit Events
Control
Freshness
![Page 17: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/17.jpg)
Stage 4: Internalization & ComprehensionMechanism
Data
Common Header
Audit Header1. Enrich events with audit metadata
Control Event - Audit aggregation
Common Header
Audit Header2. Send control events per x minutes
![Page 18: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/18.jpg)
Stage 4: Internalization & ComprehensionMechanism
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Audit Header
Control Event - Audit aggregation
Common HeaderAudit Header
Control Event - Audit aggregation
Common HeaderAudit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Old Data Flow
Audited Data Flow
![Page 19: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/19.jpg)
Stage 4: Internalization & ComprehensionHow to measure loss?
• Tag all events going through our API with an auditing header:
<host_name>:<bulk_id>:<sequence_id>When:
• host_name - the logical identification of the producer server• bulk_id - an arbitrary unique number that should identify a bulk (changes every X
minutes)• sequence_id - auto incremented persistent number used to identify missing bulks
• Every X minutes send an audit control event:{eventType: AuditControlEvent, Bulks: [{bulk_id:“srv-xyz:111:97”, data_tier:”shark producer”, total_count:785}, {bulk_id:“srv-xyz:112:98”, data_tier:”shark producer”, total_count:1715}]}
![Page 20: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/20.jpg)
Stage 4: Internalization & ComprehensionWhat’s next?
• Immediate gain: enables research loss straight on the raw data
Next:• Count events per auditing bulk• Load into some DB for dashboarding:
In this example, assuming you look at the table after 11:34, and we refer to more than 3 hours as loss, we can see that from server srv-xyz at bulk_id 1a2b3c we can see 750 events were created and only 405+250 = 655 events arrived within 3 hours this means we can detect a loss of 95 events from this server.
Audit metadata Data Tier Insertion time Events count
srv-xyz:1a2b3c:25 Producer 08:34 750
srv-xyz:1a2b3c:25 HDFS 09:05 405
srv-xyz:1a2b3c:25 HDFS 10:13 250
![Page 21: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/21.jpg)
Stage 4: Internalization & ComprehensionHow to measure freshness?
• Run incremental on the raw data• Group events by
– Total– Event type– LP customer
• Per event calculate Insertion time - creation time• Per group:
– Total count– Min, max & average – Count into time buckets (0-30; 30-60; 60-120; 120-∞)
![Page 22: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/22.jpg)
Stage 5: Being ProactiveTools - loss dashboard
![Page 23: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/23.jpg)
Stage 5: Being ProactiveTools - loss detailed dashboard
![Page 24: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/24.jpg)
Stage 5: Being ProactiveTools - loss trends
![Page 25: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/25.jpg)
Stage 5: Being ProactiveTools - freshness
![Page 26: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/26.jpg)
Stage 5: Being ProactiveTools - freshness
![Page 27: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/27.jpg)
Stage 5: Being ProactiveTools - data statistics
![Page 28: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/28.jpg)
Showcase IBug in a new producer
![Page 29: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/29.jpg)
Showcase IIDeployment issue
• Constant loss• Only in one farm• Depends on traffic• Only a specific producer type • From all of its nodes
![Page 30: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/30.jpg)
Showcase IIIConsumer jobs issues
• Our auditing detected a loss in Alpha• Data stuck in a job failure dir• Functional monitoring missed it• We streamed the stuck data
![Page 31: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/31.jpg)
Showcase IVProducer issues
• Offline producer gets stuck• Functional monitoring misses
![Page 32: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/32.jpg)
ImplementationAuditing architecture
Producer
Audit DB
Audit Aggregator
Audit Loader
ProducerProducer
Events Audit Events
Control
Freshness
![Page 33: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/33.jpg)
ImplementationAuditing architecture
Producer
Audit DB
Audit Aggregator
Audit Loader
ProducerProducer
Events Audit Events
Control
Freshness
![Page 34: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/34.jpg)
• Storm topology• Load audit events from Kafka to MySql
Bulk Tier TS Count
xyz:123 WRPA 08:34 750
xyz:123 DSPT 09:05 405
xyz:123 DSPT 10:13 250
ImplementationAudit Loader
Audit DB
Audit Loader
Audit Events
![Page 35: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/35.jpg)
ImplementationAuditing architecture
Producer
Audit DB
Audit Aggregator
Audit Loader
ProducerProducer
Events Audit Events
Control
Freshness
![Page 36: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/36.jpg)
• Load data from HDFS
• Aggregate events according to audit metadata
• Save aggregated audit data to MySql
• Spark implementation
ImplementationAudit Aggregator
![Page 37: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/37.jpg)
HDFS
DBData
Aggregate
#1 #2 #3
∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3
Collect & Save ZooKeeperOffset
Audit Aggregator jobFirst Generation
![Page 38: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/38.jpg)
• Our jobs work incrementally or manually • Offset management by ZooKeeper
• Failing during saving stage leads to lost offset
• Saving data and offset on same stream
Audit Aggregator jobOvercoming Pitfalls
![Page 39: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/39.jpg)
Audit Aggregator jobRevised Design
HDFS
DBAggregate
#1 #2 #3
∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3
Collect & Save
Data
Offset
Bulk Tier TS Count
xyz:123 WRPA 08:34 750
xyz:123 DSPT 09:05 405
xyz:123 DSPT 10:13 250
![Page 40: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/40.jpg)
• Precedent - Spark Streaming for online auditing• We see our future with Spark• Cluster utilization• Performance
– In-memory computation– Supports multiple shuffles– Unified data processing: batch/streaming
Audit Aggregator jobWhy Spark
![Page 41: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/41.jpg)
ImplementationAuditing architecture
Producer
Audit DB
Audit Aggregator
Audit Loader
ProducerProducer
Events Audit Events
Control
Freshness
![Page 42: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/42.jpg)
• End-to-end latency assessment
• Freshness per criteria
• Output - various stats
ImplementationData Freshness
![Page 43: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/43.jpg)
Freshness jobDesign
Map
Reduce
HDFS
Total LP Customer Event Type
Min Max Avg BucketsCount
EventEventEventEvent
![Page 44: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/44.jpg)
Freshness jobMechanism
• Driver– Collects LP events from HDFS
• Map– Compute freshness latencies– Segmentize events per criteria by generating
a composite kay
• Reduce– Compute count, min, max, avg and buckets– Write stats to HDFS
![Page 45: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/45.jpg)
Freshness jobOutput usage
![Page 46: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/46.jpg)
Hadoop PlatformOvercoming Pitfalls
• Our data model is built over Avro• Avro comes with schema evolution• Avro data is stored along with its schema• High model-modification rate• LOBs schema changes are synchronized
Producer → Consumer
![Page 47: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/47.jpg)
Hadoop PlatformOvercoming Pitfalls
• MR/Spark job is revision-compiled when using SpecificRecord
• Using GenericRecord removes the burden of recompiling each time schema changes
![Page 48: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/48.jpg)
ImplementationAuditing architecture
Producer
Audit DB
Audit Aggregator
Audit Loader
ProducerProducer
Events Audit Events
Control
Freshness
![Page 49: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/49.jpg)
THANK YOU!
We are hiring
![Page 50: Growing into a proactive Data Platform](https://reader036.fdocuments.us/reader036/viewer/2022081515/589c57171a28abc4358b4ded/html5/thumbnails/50.jpg)
YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
Facebook.com/LivePersonDev
Slideshare.net/LivePersonDev