Real Time Data Analytics with MongoDB and Fluentd at Wish

Post on 08-Sep-2014

586 views 7 download

Tags:

description

 

Transcript of Real Time Data Analytics with MongoDB and Fluentd at Wish

Analytics @ Wish

Powered by Fluentd & MongoDB

Hi

I’m Adam.

Wish ♥︎ MongoDB

• Primary database since 2011

• 67x mongod

• AWS → bare metal (SSDs ftw!)

What’s Wish?

• Mobile eCommerce

• 30M+ users worldwide

• Top 10 iOS & Android

Experiment

‘cause otherwise you’re just guessing…

Hypothesis

“Billing Zip” is confusing outside America

Data

Compare checkout conversions for international,

Android users

Conclusion

~7% boost in mobile sales

Goal

Frictionless analytics to everyone

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

Request Logs = Source of Truth

{'contest_impressions’:'53060fbd34067e4d6cee70f4,535ad13a7360465e2ca799f8,528b714df689996fdb574800,525976a71c23882ab3b73ecb,5285df6db5baba737f459037,5208ae7d3deaf74a6cc65da4,5209e5c31c238861a1ab91cc,5285df6db5baba735f459061,51f7778f3ba3770a514a5431,527be1fc227d210d2bcdeac5,532fcfe3796f6832713b5c3a,527be203227d210dd5cdeaac,52d3ef2806ea960dde85cb97,527bc781227d210d8acdea47,527bc793227d210d4fcdea48, 5208ad653deaf74a4bc65d41,5208acdd1c238846f9ab9028,5182fc1273c67621e507591b,5311ae6c796f68283f8f86c3, 52de2bf4ab980a2d00da786a,5208a9c53deaf74a75c65c6b,52eca45a717951350382e4be,52d3ef73bb5aa51ccf866c01, 533d6fae5aefb0427771f346,5285df6db5baba734d45901b,51c27d8d5ffe8f0b0b9b0359,52d0e002a30fb227725b6e06, 52f71bd89f5ef741d8f34698,52d3ef71bb5aa53135866d76, 5308bc467360464265101ed9,52d3ef27bb5aa5024d866c09, 52c399d60599170e49fd866e,5209be541c23886177ab91db,5208b15e1c2388615fab91b7', '_country_code': u'CA', '_lang': u'en', '_fb_uid': 500406911, '_device_id': None, '_uid': '4eb346049b120f09f60007c0', '_tid': 2, '_host': 'adam.corp.contextlogic.com', '_last_id': u'cc3aa96b2b3c45bca11009edc049f2f6', '_experiment_tags': ['mobile_commerce_home_v4_female_ignore', 'mobile_large_cart_cell_ignore', 'hannibal_cohort_firsttime_buyer_ignore', 'localize_product_names__fr_ignore', 'mobile_cart_guarantee_view_ignore', 'mobile_related_tags_v2_ignore', 'shipping_price_us_ignore', 'stripe_settle_on_ship_control', 'related_super_feed_iphone_show-v4', 'mobile_commerce_home_v3_male_i18n_show', 'braintree_settle_on_ship_control', 'mobile_show_tabbed_billing_page_i18n_ignore', 'mobile_new_guarantee_text_ios_ignore', 'mobile_use_category_signup_flow_i18n_ignore', 'male_curated_first_ipad_ignore', 'mobile_commerce_home_v4_female_i18n_ignore', 'commerce_product_page_show', 'mobile_use_category_signup_flow_v3_ignore', 'mobile_save_for_price_us_female_relaunch_2_ignore', 'web_stripe_checkout_ignore', 'mobile_show_tabbed_billing_page_us_ignore', 'stripe_checkout_show', 'shipping_price_i18n_fixed-price-promo', 'chukou1_pilot_experiment_ignore', 'mobile_implicit_ratings_v1_show', 'feed_commerce_2_control', 'mobile_commerce_home_v3_male_ignore', 'swap_out_male_feed_show-weight-deep', 'related_super_feed_ipad_ignore', 'female_curated_first_iphone_ignore', 'mobile_psuedo_localized_currency_show', 'hannibal_cohort_repeat_buyer_ignore', 'web_boleto_checkout_ignore', 'exploration_v2_control', 'female_curated_first_android_ignore', 'male_curated_first_android_ignore', 'related_super_feed_android_show-v4', 'curated_feed_female_shopping_ignore', 'mobile_localized_currency_control', 'male_curated_first_iphone_ignore', 'mobile_show_required_shipping_fields_ignore', 'mobile_ct2_variable_shipping_price_showcountry', 'mobile_c2c_ignore', 'localize_product_names__es_ignore', 'related_products_v2_control', 'female_curated_first_ipad_ignore', 'mobile_categories_v1_ignore', 'related_super_feed_show', 'mobile_baby_category_signup_flow_ignore', 'mobile_checkout_offer_v2_control', 'mobile_minimum_notification_interval_ignore', 'mobile_show_tabbed_feed_existing_user_ignore', 'mobile_cart_fake_only_x_left_show', 'late_shipment_apology_v2_ignore', 'mobile_show_tabbed_feed_new_user_ignore'], '_app_type': 0, 'impression_feed_category': None, '_client': 'web', '_refer_url': None, 'sort': 'recommended', '_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', '_arguments': {}, '_currency': 'CAD', '_protocol': 'http', 'offset': 0, '_method': 'GET', 'count': 40, '_locale': 'en', '_timestamp': 1401996333, '_bsid': '979b5fbcad4f4fdbb1477ae7ba8ed123', '_is_cached': False, '_version': None, '_response_status': 200, 'filter': 'all', '_response_time': 0.2887430191040039, '_uri': '/', '_remote_ip': None, '_is_user_pending': False, '_id': '1e6135e3d2eb4214afdbd99456d71183'}

A feed request…

{'products_shown': '...','feed_category': null,'sort': 'recommended','filter': 'all','offset': 0,'count': 40,

'_uid': '4eb34609ff60007c0', '_client': 'web','_country_code': 'CA',

'_id': '1e6135e3d9456d7183’,'_last_id’: 'cc39edc49f2f6','_experiment_tags': [...],

'_uri': '/','_refer_url': null,'_arguments': {},'_method': 'GET','_locale': 'en','_response_status': 200}

One problem

Searching all requests ever is slow

Transaction Log{'txn_id': '5390c295e9b9bbe68b2', 'user_id': '4eb346049b9f60007c0’,

'total': 18.0, 'shipping': 2.0,

'items': [{ 'product_id': '537b42379b9e3f55f', 'qty': 1, 'price': 16.0 }] }

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

Centralize Logs

• Synchronously?

• Fire & forget?

• fluentd!

Architecture

App server

Wishfluentd

Aggregation serverfluent

d

Aggregation serverfluent

d

Hadoop/Hive

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

Hadoop & Hive

• Great for log analysis

• Arbitrary queries

• No schema design constraints

Hadoop & Hive

• Running a Hadoop cluster sucks– TreasureData’s managed Hive solution

rocks!

{“solution”: [“logging”, “aggregation”, “analysis”, “serving”]}

MongoDB!

• Analysis results → MongoDB

• Store all combinations– Unsexy, but fast– 2 TB total

Schema

{"_id": ObjectId(…), "click_id": 2, "source_page_id": 1000, "count": 20171, "timestamp": 20140601,

Schema

"gender": "Male", "client": "Android", "country": "CA", "experiment_tag": "zip_help_text-show"}

Let’s Review

MongoDB

Logs (app servers)

Fluentd

Hadoop/Hive

Tools

Who doesn’t love nifty graphs?

Dashy

• Graphing dashboard

Perimeter

• A/B test reports– Summary

tables, detailed CSVs

– See trade-offs

Analytics = faster iteration

More growth, more revenue

Analytics = faster iteration

Powered by Fluentd & MongoDB

Happy Analyzing!

adam@wish.com

{“subtitle”:”Why Fluentd?”}

http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

Acquire Data (or so you think)

WUT!? Invalid UTF8?

Fix the encoding issue…

Yell at the engineers

Some columns are missing!?

Run the script…DIVISION BY

ZERO!!!

Hmm…

Logging.priority=> :not_super_high

Analytics.priority=> :very_high

Analytics.needs? :logs=> true

{“subtitle”: ”Overview”, “has_code”: true, “has_example”: true}

127.0.0.1 - - [05/Feb/2012:17:11:55 +0000] "GET / HTTP/1.1" 200 140 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.5 Safari/535.19"

{ "host": "127.0.0.1", "user": "-", "method": "GET", "path": "/", "code": "200", "size": "140", "referer": "-", "agent": “Mozilla/5.0 (Windows…"}

Parse as JSON!

?

[“05/Feb/2012:17:11:55”,“web.access”,{ "host": "127.0.0.1", "user": "-", "method": "GET", "path": "/", "code": "200", "size": "140", "referer": "-", "agent": “Mozilla/5.0 (Windows…"}]

?

web.mongodb

web.file

web.hdfs

web.s3

web.mysql

<source>

type tail

path /var/log/apache/access.log

tag web.access

format apache2

</source>Apache log

Fluentd

<source>

type tail

path /var/log/apache/access.log

tag web.access

format apache2

</source>

<match web.access>

type mongo

user kiyoto

password heartbleed

database web

collection access

… # host, port, etc.

</match>

Apache log

Fluentd

MongoDB

<match web.access>

type copy

<store>

type mongo

user kiyoto

password heartbleed

database web

collection access

… # host, port, etc.

</store>

<store>

type s3

… # aws secret, bucket, etc.

</store>

</match>

Apache log

Fluentd

MongoDB S3

{“subtitle”: ”scalability”}

• Automate monitoring!

• App and System metrics

• JSON everywhere

• 2000+ node• ~1B events/day• Forwarder-

Aggregator

{“subtitle”: ”Demo”, “need”: “Demo Karma”}

<source>

type mongostat

uri “172.17.0.2”

</source>

<match mongostat.*.*>

type mongo

user kiyoto

password heartbleed

database web

collection access

… # host, port, etc.

</match>

Fluentd

MongoDB

MongoDB

Build your own *MS!

{ “install”: “gem install fluentd”, “website”: “www.fluentd.org”, “github” : “fluent/fluentd”, “twitter”: “@fluentd”}