Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Building Web Analytics on Hadoop at CBS Interactive
description
Transcript of Building Web Analytics on Hadoop at CBS Interactive
Thursdays9:00 ET/PT
Building Web Analytics on Hadoop at CBS Interactive
Michael Sun
Big Data Workshop, Boston 03/10/2012
Brands and Websites of CBS interactive, Samples
GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT
MUSIC
CBSi Scale
• Top 20 global web property• 235M worldwide monthly unique users• Hadoop Cluster size:
– Currently workers: 40 nodes (260 TB)– This month: add 24 nodes, 800 TB total– Next quarter: ~ 80 nodes, ~ 1 PB
• DW peak processing: > 500M events/day globally, doubling next quarter (ad logs)
1 - Source: comScore, March 2011
Web Analytics Processing
• Collect web logs for web metrics analysis– Web logs by tracking clicks, page views, downloads,
streaming video events, ad events, etc
• Provide internal metrics for web sites monitoring• A/B testing• Billers apps, external reporting• Ad event tracking to support sales• Provide data service
– Support marketing by providing data for data mining– User-centric datastore (stay tuned)– Optimize user experience
1 - Source: comScore, March 2011
2105595680218152960 125.83.8.253 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002-e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x1050&brlang=zh-CN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&title=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1
schemas.append(Schema(( # schemas[0] SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64), SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'), SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'), SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'), SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'), SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'), SchemaField('http_status', 'int', nullable=True, signed=True), SchemaField('bytes_sent', 'int', nullable=True, signed=True), SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'), SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2))))
Modernize the platform
• The web log processing using a proprietary platform ran into the limit– Code base was 10 years old– The version we used vendor is no longer supporting– Not fault-tolerant– Upgrade to the newer version not cost-effective
• Data volume is increasing all the time– 300+ web sites– Video tracking increasing the fastest– To support new initiatives of business
• Use open source systems as much as possible
Hadoop to the Rescue / Research
• Open-source: scalable data processing framework based on MapReduce
• Processing PB of data using Hadoop Distributed files system (HDFS)– high throughput– Fault-Tolerant
• Distributed computing model – Functional programming model based
− MapReduce (M|S|R)
• Execution engine – Used as a cluster for ETL– Collect data (distributed harvester)– Analyze data (M/R, streaming + scripting + R, Pig/Hive)– Archive data (distributed archive)
The Plan• Build web logs collection (codename Fido)
– Apache web log piped to cronolog– Hourly M/R collector job to
− Gzip hourly log files & checksum− Scp from web servers to Hadoop datanodes− Put on HDFS
• Build Python ETL framework (codename Lumberjack)– Based stdin/stdout streaming, one process/one thread– Can run stand-alone or on Hadoop– Pipeline– Filter– Schema
• Build web log processing with Lumberjack– Parse
– Sessionize
– Lookup
– Format data/Load to DB
Hadoop
External data sources
Web Analytics
HDFS
Python-ETLMapReduceHive
DW Database
Sites
Apache Logs
Distribute log by Fido
Web metrics
Billers
Data mining
CMS Systems
Clickmap
Web log Processing by Hadoop Streaming and Python-ETL
• Parsing web logs– IAB filtering and checking– Parsing user agents by regex– IP range lookup– Look up product key etc
• Sessionization– Prepare Sessionize– Sessionize– Filter-unpack
• Process huge dimensions, URL/Page Title• Load Facts
– Format Load data/Load data to DB
Benefits to Ops
• Processing time to reaching SLA, saving 6 hours• Running 2 years in production without any big issues• Withstood the test of 50% / year data volume increase• Architecture by design made easy to add new
processing logic• Robust and Fault-Tolerant
– Five dead datanodes, jobs still ran OK– Upgraded JVM on a few datanodes while jobs running– Reprocessing old data while processing data of current day
Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want
• Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work
• Python ETL Framework– Home grown, under review for open-source release– Rich functionalities by Python– Extensible– NLS support
• Put on top of another platform, eg Hadoop– Distributed/Parallel– Sorting– Aggregation
Conclusions II – Power and Flexibility for Processing Big Data
• Hadoop - scale and computing horsepower– Robustness– Fault-tolerance– Scalability– Significant reduction of processing time to reach SLA– Cost-effective
− Commodity HW− Free SW
• Currently:– Build Multi-tenant Hadoop clusters using Fair Scheduler