Building a Data Pipeline from Scratch - Joe Crobak

of 55/55
From Scratch 1 Joe Crobak @joecrobak Tuesday, June 24, 2014 Axium Lyceum - New York, NY BUILDING A DATA PIPELINE
  • date post

    27-Aug-2014
  • Category

    Software

  • view

    175
  • download

    7

Embed Size (px)

description

http://www.hakkalabs.co/articles/building-data-pipeline-scratch

Transcript of Building a Data Pipeline from Scratch - Joe Crobak

  • From Scratch 1 Joe Crobak @joecrobak ! Tuesday, June 24, 2014 Axium Lyceum - New York, NY BUILDING A DATA PIPELINE
  • INTRODUCTION 2 Software Engineer @ Project Florida ! Previously: Foursquare Adconion Media Group Joost
  • OVERVIEW 3 Why do we care? Dening Data Pipeline Events System Architecture
  • 4 DATA PIPELINES ARE EVERYWHERE
  • RECOMMENDATIONS 5 http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  • RECOMMENDATIONS 6 Clicks Views Recommendations http://blog.linkedin.com/2010/05/12/linkedin-pymk/
  • AD NETWORKS 7
  • AD NETWORKS 8 Clicks Impressions User Ad Prole
  • SEARCH 9 http://lucene.apache.org/solr/
  • SEARCH 10 Search Rankings Page Rank http://www.jevans.com/pubnetmap.html
  • A / B TESTING 11 https://ic.kr/p/4ieVGa
  • A / B TESTING 12 https://ic.kr/p/4ieVGa A conversions B conversions Experiment Analysis
  • DATA WAREHOUSING 13 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/
  • DATA WAREHOUSING 14 http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/ key metrics user events Data Warehouse
  • 15 WHAT IS A DATA PIPELINE?
  • DATA PIPELINE 16 A Data Pipeline is a unied system for capturing events for analysis and building products.
  • DATA PIPELINE 17 click data user events Data Warehouse web visits email sends Product Features Ad Hoc analysis Counting Machine Learning Extract Transform Load (ETL)
  • DATA PIPELINE 18 A Data Pipeline is a unied system for capturing events for analysis and building products.
  • 19 EVENTS
  • EVENTS 20 Each of these actions can be thought of as an event.
  • COARSE-GRAINED EVENTS 21 Events are captured as a by-product. Stored in text logs used primarily for debugging and secondarily for analysis.
  • COARSE-GRAINED EVENTS 22 127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969! IP Address Timestamp Action Status Events are captured as a Stored in debugging and secondarily for analysis.
  • COARSE-GRAINED EVENTS 23 Implicit trackingi.e. a page load event is a proxy for 1 other event. ! e.g. event GET /newsfeed corresponds to: App Load (but only if this is the rst time loaded this session) Timeline load, user is in group A of an A/B Test These implementations details have to be known at analysis time.
  • FINE-GRAINED EVENTS 24 Record events like: app opened auto refresh user pull down refresh ! Rather than: GET /newsfeed
  • FINE-GRAINED EVENTS 25 Annotate events with contextual information like: view the user was on which button was clicked
  • FINE-GRAINED EVENTS 26 Decouple logging and analysis. Create events for everything!
  • FINE-GRAINED EVENTS 27 A couple of schema-less formats are popular (e.g. JSON and CSV), but they have drawbacks. harder to change schemas inecient require writing parsers
  • SCHEMA 28 Used to describe data, providing a contract about elds and their types. ! Two schemas are compatible if you can read data written in schema 1 with schema 2.
  • SCHEMA 29 Facilities automated analyticssummary statistics, session/funnel analysis, a/b testing.
  • SCHEMA 30 https://engineering.twitter.com/research/publication/the-unied-logging-infrastructure-for-data-analytics-at-twitter Facilities automated analyticssummary statistics, session/funnel analysis, a/b testing.
  • SCHEMA 31 client:page:section:component:element:action e.g.: ! iphone:home:mentions:tweet:button:click! ! Count iPhone users clicking from home page: ! iphone:home:*:*:*:click! ! Count home clicks on buttons or avatars: ! *:home:*:*:{button,avatar}:click
  • 32 KEY COMPONENTS
  • EVENT FRAMEWORK 33 For easily generating events from your applications
  • EVENT FRAMEWORK 34 For applications
  • BIG MESSAGE BUS 35 Horizontally scalable Redundant APIs / easy to integrate
  • BIG MESSAGE BUS 36 Scribe (Facebook) Apache Chukwa Apache Flume Apache Kafka* ! Horizontally scalable Redundant APIs / easy to integrate * My recommendation
  • DATA PERSISTENCE 37 For storing your events in les for batch processing
  • DATA PERSISTENCE 38 For processing Kite Software Development Kit http://kitesdk.org/ ! Spring Hadoop http://projects.spring.io/spring-hadoop/
  • WORKFLOW MANAGEMENT 39 For coordinating the tasks in your data pipeline
  • WORKFLOW MANAGEMENT 40 or your own system written in your own language of choice. * For pipeline
  • SERIALIZATION FRAMEWORK 41 Used for converting an Event to bytes on disk. Provides ecient, cross-language framework for serializing/deserializing data.
  • SERIALIZATION FRAMEWORK 42 Apache Avro* Apache Thrift Protocol Buers (google) Used for disk framework for serializing/deserializing data.
  • BATCH PROCESSING AND AD HOC ANALYSIS 43 Apache Hadoop (MapReduce) Apache Hive (or other SQL-on-Hadoop) Apache Spark
  • SYSTEM OVERVIEW 44 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data ow workow engine Production DB dumps
  • SYSTEM OVERVIEW (OPINIONATED) 45 Application logging framework data serialization Message Bus Persistant Storage Data Warehouse Ad hoc Analysis Product data ow workow engine Production DB dumps Apache Avro Apache Kafka Luigi
  • NEXT STEPS 46 This architecture opens up a lot of possibilities Near-real time computationApache Storm, Apache Samza (incubating), Apache Spark streaming. Sharing information between services asynchronouslye.g. to augment user prole information. Cross-datacenter replication Columnar storage
  • LAMBDA ARCHITECTURE 47 Term coined by Nathan Marz (creator of Apache Storm) for hybrid batch and real- time processing. ! Batch processing is treated as source of truth, and real-time updates models/insights between batches.
  • LAMBDA ARCHITECTURE 48 http://lambda-architecture.net/
  • SUMMARY 49 Data Pipelines are everywhere. Useful to think of data as events. A unied data pipeline is very powerful. Plethora of open-source tools to build data pipeline.
  • FURTHER READING 50 The Unied Logging Infrastructure for Data Analytics at Twitter ! The Log: What every software engineer should know about real-time data's unifying abstraction (Jay Kreps, LinkedIn) ! Big Data by Nathan Marz and James Warren ! Implementing Microservice Architectures
  • THANK YOU 51 Questions? ! Shameless plug: www.hadoopweekly.com
  • 52 EXTRA SLIDES
  • WHY KAFKA? 53 https://kafka.apache.org/ documentation.html#design Pull model works well Easy to congure and deploy Good JVM support Well-integrated with the LinkedIn stack
  • WHY LUIGI? 54 Scripting language (youll end up writing scripts anyway) Simplicity (low learning curve) Idempotency Easy to deploy
  • WHY AVRO? 55 Self-describing les Integrated with nearly everything in the ecosystem CLI tools for dumping to JSON, CSV