BigData @ comScore

28
BigData @ comScore Michael Brown, CTO, comScore, Inc. March 25 th , 2011

Transcript of BigData @ comScore

Page 1: BigData @ comScore

BigData @ comScore

Michael Brown, CTO, comScore, Inc.March 25th, 2011

Page 2: BigData @ comScore

comScore is a Global Leader in Measuring the Digita l World

NASDAQ SCOR

Clients 1600+ worldwide

Employees 1,000+

Headquarters Reston, VA

Global Coverage170+ countries under measurement;43 markets reported

Local Presence 30+ locations in 21 countries

2© comScore, Inc. Proprietary.

Local Presence 30+ locations in 21 countries

V0910

Page 3: BigData @ comScore

Broad Client Base and Deep Expertise Across Key Ind ustries

Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology

3© comScore, Inc. Proprietary. V0910

Page 4: BigData @ comScore

The Trusted Source for Digital Intelligence Across Vertical Markets

47 out of the top 50

4 out of the top 4WIRELESS CARRIERS

9 out of the top 10INVESTMENT BANKS

9 out of the top 10

9 out of the top 10INTERNET SERVICEPROVIDERS

9 out of the top 10AUTO INSURERS

4© comScore, Inc. Proprietary.

47 out of the top 50 ONLINE PROPERTIES

45 out of the top 50ADVERTISING AGENCIES

9 out of the top 10MAJOR MEDIA COMPANIES

9 out of the top 10PHARMACEUTICALCOMPANIES

9 out of the top 10CONSUMER FINANCECOMPANIES

9 out of the top 10CPG COMPANIES

V0910

Page 5: BigData @ comScore

comScore History of Leadership and Innovation

To measure the search market

To measure

video streaming

To provide behavioral ad effectiveness

To meter mobile user behavior 1st

To Unify census + panel measurement

5© comScore, Inc. Proprietary.

To build and project from 2 million+ longitudinal panel

To monitor and report e-commerce data

1To deliver a worldwide Internet audience measurement

Global Shaper Company2010

V0910

Page 6: BigData @ comScore

Average Records Captured per Day (2005-2009)

800,000,000

1,000,000,000

1,200,000,000

1,400,000,000

1,600,000,000

1,800,000,000

6© comScore, Inc. Proprietary.

-

200,000,000

400,000,000

600,000,000

800,000,000

Page 7: BigData @ comScore

Launching the 3 rd Generation

� In 2009, in the midst of the recession, comScore de cided to build and release its 3 rd Generation Product – Unified Digital Measurement (UD M or Hybrid)

� Technology Goals

– Ramp up data collection

– Deploy new methodologies for data processing and analysis

– Be able to scale linearly to the environment to support growth

7© comScore, Inc. Proprietary.

– Be able to scale linearly to the environment to support growth

– Have yesterdays data available today

� And one more thing … do it in 4 months or less.

Page 8: BigData @ comScore

Unified Digital Measurement™ (UDM) Establishes Platf orm For Panel + Census Data Integration

Global PERSON Measurement

Global MACHINE Measurement

8© comScore, Inc. Proprietary.

PAGE TAGSPANEL

Unified Digital Measurement (UDM)Patent-Pending Methodology

Adopted by 88% of Top U.S. Media Properties

V0910

Page 9: BigData @ comScore

How Does the Hybrid Process Work?

Collect Traffic from PCs and devices

Clean Traffic – remove non-human, bots, apply edit rules

9© comScore, Inc. Proprietary.

Apply comScore URL Dictionary

Total Traffic Filtered Traffic

Page 10: BigData @ comScore

URL Dictionary (CFD): Advertising Industry “Currenc y”

� Intelligent grouping of Properties with 7+ levels of detail

– Property (e.g., Yahoo! Properties, Microsoft Sites)

– Media Title (e.g., Yahoo!, MSN)

10© comScore, Inc. Proprietary.

– Channel (e.g., Yahoo! Search, MSN Homepages)

– Subchannel (e.g., Yahoo! Image Search, MSNBC)

– Group/Subgroup (e.g., Yahoo! Calendar, Today)

Page 11: BigData @ comScore

URL Dictionary (CFD) Coverage Statistics

11MM Unique Domains Average/Month in 2010

• Over 80% pages viewed from top 131K domains in 2010 vs. 392K in 2009

11© comScore, Inc. Proprietary.

• 2,360K patterns in January 2011represents 85% of all pages

• 1,254K syndicated entities in January 2010

• 41K patterns added/month in 2010.

Page 12: BigData @ comScore

Worldwide UDM ™ Penetration

Europe Austria 80%

Asia Pacific

Australia 91%

North America

Canada 94%

Latin America

Argentina 94%

Middle East & Africa

Israel 93%

Percentage of Machines Included in UDM Measurement

12© comScore, Inc. Proprietary. July 2010 Penetration Data

Austria 80%Belgium 85%Switzerland 84%Germany 84%Denmark 82%Spain 90%Finland 85% France 91%Ireland 91%Italy 80%Netherlands 88%Norway 84%Portugal 86%Sweden 85%United Kingdom 90%

Australia 91%Hong Kong 88%India 84%Japan 73%Malaysia 87%New Zealand 88%Singapore 91%

Canada 94%United States 91%

Argentina 94%Brazil 92%Chile 94%Colombia 95%Mexico 93%Puerto Rico 92%

Israel 93%South Africa 73%

V0910

Page 13: BigData @ comScore

Worldwide Tags per Day

15,000,000,000

20,000,000,000

25,000,000,000

# of

rec

ords

13© comScore, Inc. Proprietary.

0

5,000,000,000

10,000,000,000

Jul 2009

Aug 2009

Sep 2009

Oct 2009

Nov 2009

Dec 2009

Jan 2010

Feb 2010

Mar 2010

Apr 2010

May 2010

Jun 2010

Jul 2010

Aug 2010

Sep 2010

Oct 2010

Nov 2010

Dec 2010

Jan 2011

Feb 2011

# of

rec

ords

Beacon Records Panel Records

Page 14: BigData @ comScore

Monthly Totals

300,000,000,000

400,000,000,000

500,000,000,000

600,000,000,000

# of

rec

ords

14© comScore, Inc. Proprietary.

0

100,000,000,000

200,000,000,000

300,000,000,000

Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb

2009 2010 2011

# of

rec

ords

Beacon Records Panel Records

Page 15: BigData @ comScore

High Level Data Flow

Panel

ETL

15© comScore, Inc. Proprietary.

Census

ETL

Delivery

Page 16: BigData @ comScore

Enterprise Data Warehouse : Sybase IQ 15.2 Multip lex

� EDW is currently comprised of 20 servers running Wi ndows 2003 R2 x64

– Currently 220 Intel CPUs

– Dedicated EDW technical team of 3 DBAs and 1 Administrator

– Ability to grow compute capacity and storage capacity independently

� EDW data repository housed on both EMC VMAX and Cla rion

– 4 EDW instances (2 in Virginia and 2 in Illinois)

– One EDW instance is 147TB usable (app. 200TB of raw data)

16© comScore, Inc. Proprietary.

– One EDW instance is 147TB usable (app. 200TB of raw data)

– Production EDW Drive Layout 416 x 1TB SATA, RAID6, 14+2

42 x 600GB 15K, RAID1

8 X 400GB Flash, RAID5, 7+1

� Current Capacity and Performance Metrics

– 1,835,412,793,799 Rows loaded

– 140TB in 14,168 tables

– Capable of Loading 56 Billion rows per hour

Page 17: BigData @ comScore

Subsystem

� System designed using multiple sub systems

� Easily take out and replace different components as demands changed

� Moved from a single server to a cluster of servers in a few months in some cases with first stage tag processing

� Periodically redesign different subsystems to suppo rt increased processing demands

17© comScore, Inc. Proprietary.

� Many systems on their third generation of technolog y

Page 18: BigData @ comScore

Homegrown Distributed Processing

Reduced core aggregation from

Reduce final product creation

2002 – comScore distributed processing framework

Open Source Hadoop

Sca

labi

lity

Wal

l

18© comScore, Inc. Proprietary.

aggregation from 48 hours to 7 hours

product creation from 24 hours to

2 hours

Hadoopframework

Sca

labi

lity

Wal

l

Page 19: BigData @ comScore

GreenPlum

� GreenPlum MPP

– 80 Node Cluster: 1 Master; 6 ETL; 72 Workers

– Using Dell R510 with 12 600GB 15K RAID, 64GB RAM, 24 cores (HT)

– Support analytic end users with access to record level data, through a SQL interface

– Ability to load over 400 billion rows in 8 hours

– Hourly data loading in place

19© comScore, Inc. Proprietary.

– Hourly data loading in place

– Allow the analysts to mine the data for the business uses

– Use for quick analysis of raw event data and for the ideation and creation of new products

Page 20: BigData @ comScore

Hadoop

� Hadoop

– Dev - 6x Dell 2950 w/6 1TB

– Prod - 10x Dell R710 w/ 6 600GB

– Prod in 2 weeks – 10x Dell R710 w/6 600GB & 20x Dell R510 w/12 2TB

– Moving large processing jobs that currently are constrained by our current framework to Hadoop. We have some large analytical runs that currently go for over 40 hours on 32 servers and we are re-engineering to reduce

20© comScore, Inc. Proprietary.

for over 40 hours on 32 servers and we are re-engineering to reduce processing time.

– We have found that the Fair Scheduler works well for our job loads

– We use a “homegrown” workflow system (BORG) that manages tasks inside and outside hadoop.

Page 21: BigData @ comScore

Sharding

� Sharding divides work across multiple systems using different mechanisms

� Shard data as far up stream as possible

� Ability to break data into multiple chunks early in processing, enables ability to compute capacity down stream to accommodate large volume increases in data ingest

21© comScore, Inc. Proprietary.

Page 22: BigData @ comScore

Sorting

� We use DMExpress from SyncSort across hundreds of ser vers this allows for efficient data processing

� We sort input data based on a column in advance

� To calculate uniques, check if the prior value chan ged from the current value and then increment a counter

� We now have aggregation systems that can process ov er 50 GB of data with 357 million rows in less than an hour on a Del l R710 2U serve

22© comScore, Inc. Proprietary.

with 357 million rows in less than an hour on a Del l R710 2U serve

Page 23: BigData @ comScore

Compression w/Sorting

� Compress Log Files when processing large volumes of log data

� Several advantages to Sorting Data First:

– Reduces the size of the data

– Improves application performance

� Examples:

– 1 Hour of our data (313 GB raw, 815 million rows)

23© comScore, Inc. Proprietary.

1 Hour of our data (313 GB raw, 815 million rows)

– Standard compression of time ordered data is 93GB (30% of original)

– Standard compression on a 2 key sorted set is 56GB (18% of original)

– For one day it saves 800GB

– For one month it saves 25 TB

– For 90 days it saves 75TB

Page 24: BigData @ comScore

Big data makes you think differently

� Question: How many distinct cookies over 3 months?

� Data: 3 monthly tables with distinct cookies, indexed

� Size: 10B records per table

� Platform: Sybase IQ

� Attempt: UNION select count(cookies) over 3 monthly tables

24© comScore, Inc. Proprietary.

– Union operator distincts

� Result: FAIL. Out of temp space. Out of luck.

– Failed after 30 minutes.

� Why? UNION performs a SELECT and then a DISTINCT (sorting 30B rows)

Page 25: BigData @ comScore

Rethink the problem!

� INNER joins are cheaper

� No sort, they use existing indexes

� Remember set theory? Of course you do!

� Let months be {A, B, C}

A B

∪ ∪

25© comScore, Inc. Proprietary.

� INNER join on only 2 tables of data at a time

� 2 month intersections took 2 hours each and less taxing on memory

� Used intersection of intermediate (indexed!) results… 5 mins

CA ∪ B ∪ C = A + B + C – A ∩ B – A ∩ C – C ∩ B + A ∩ B ∩ C

A ∩ B ∩ C = (A ∩ B) ∩ (A ∩ C) ∩ (C ∩ B)

Total query time: 6.5 hours

Page 26: BigData @ comScore

TCO with Large Cluster Systems

� Examine replication factor and disk configuration f or systems with replication built into the framework to support red undancy and concurrency

� Example:

Hadoop cluster that supports 108TB of base compresse d data

Hypothetical Configurations:

26© comScore, Inc. Proprietary.

– Replication Factor of 3R710 (6x drives, JBOD); requires 162 servers

R510 (12x drives JBOD); requires 68 servers

– Replication Factor of 2R710 (6x drives, RAID 5); requires 129 servers

R510 (12x drives, RAID 5); requires 54 servers

Page 27: BigData @ comScore

Useful Factoids

Colorful, bite-sized graphical representations of t he best discoveries we unearth.

27© comScore, Inc. Proprietary.

Visit www.comscoredatamine.com or follow @datagems for the latest gems.

Page 28: BigData @ comScore

Thank You!

Michael BrownCTOcomScore, Inc.

[email protected]

28© comScore, Inc. Proprietary.