Insight Data Engineering - Week 4 - Andrew Mo
-
Upload
andrew-mo -
Category
Technology
-
view
236 -
download
1
Transcript of Insight Data Engineering - Week 4 - Andrew Mo
TRACKING LIVE WIKIPEDIA [email protected]
Insight Data Engineering Week 4 - January 2015
WIKIWATCH . ANDREWMO . COM
MOTIVATION• Raw dumps of Wikipedia data are available for analysis on a monthly basis, but…
What about changes between these intervals?
• Data Collection:Live edits for Wikimedia projects are broadcast to nearly 882 IRC channels
• Goals: Collect, filter, format, transform and produce information from live edit data
nowdata datawhy wait?what happened?
STREAM PROCESSING
Multiple Topologies(10 sec, 1min, 10 min, 1 hr)
Multiple Metrics(events, size, new pages, topics, users)
Python + Storm (Pyleus)MySQL
API ACCESSTime Series Summary Metrics
for Multiple Windows
New Pages
Detailed User Activity
Detailed Topic Activity
Top Topics, Top Users, Top Bots, etc
Thanks
Apache Software FoundationWikimedia FoundationInsight Data Science
LinkedIn (Kafka)Twitter (Storm)
Yelp (Pyleus)
ABOUT MOA Project Manager that Writes Code !
Worked at RAND Corporation Booz Allen Hamilton
Studied at Pardee RAND Graduate School UC San Diego - Electrical Engineering
Alphabet SoupPMP, PMI-ACP, CISSP, ISSEP, CSEP, CSEP-ACQ [email protected] GitHub: https://github.com/moandcompanyLinkedIn: http://linkedin.com/in/andrewmo
DEDUPE IN THE BATCH LAYER
• Parse all files and send to a Pig batch process
• All log files go into a directory consumed by pig (‘LOAD’)
• Apply ‘DISTINCT’ criteria to all rows in LOAD before computing metrics
HDFS#en logBot
#en
#en
logBot
logBot
VELOCITY AND OUR NEXT SPRINTSprint 1 (MVP Development)
18 Jan - 31 Jan 2015
Address the need + Simplify
API-query elicitation and discovery
Novel feature focus: Realtime
Maximize common-code (Python)
Sprint 2 (MVP Validation)
Context Enrichment Features
API enhancement
Batch Integration
NoSQL Optimization
Preempt Technical Debt - Refactoring
Velocity Chart
TECHNOLOGY TO EVALUATE• Presto
• Luigi
• Samza
• Hive + Tez
• Kafka on YARN (KOYA)
• Kafka Security (Authentication)
• Spark + Spark Streaming (1.2+ Python)