Hemispheres of Data
Click here to load reader
-
Upload
eaiti -
Category
Technology
-
view
237 -
download
0
Transcript of Hemispheres of Data
Hemispheres of DataFOX Audience NetworkBrian Dolan, Director of Research Analytics
What is FOX Audience Network?
• Formally a division of FOX Interactive with sister company MySpace, we are now an independent ad network
• Exclusive consumer of MySpace profile data• Owner of two massive data stores:
– ~500TB Hadoop instance containing MySpace user data
– ~250 TB (1 PB w/ redundancy) ad serving events in Greenplum data warehouse.
FAN's Data Challenge
• 3-5 Billion ad serving events captured today, not including hundreds of millions of dimension
• Updating 30-50 million user profiles today• Training over 2,000 sophisticated
mathematical models weekly against multi-TB data sets
Data Character Varies Dramatically
• User Data– Very Sparse– Intermittent– Unstructured and
user generated– Untrustworthy– Enormous
• Advertiser Data– Dense– Current– Defined Business
Dimensions– Verified– Enormous
Not Separate, Isolated
Hadoop
I love Horror Movies!
I need a cell phone
It's Miley!
Greenplum
What is responding to my ad?
How much revenue did I generate today?
Is this campaign fatigued?
Platform Tasks Also Differ Dramatically
• User Data– Long strings parsed
with regexp routines– No more than three
passes through the data
– Unreliable data feed where dimensions change weekly
– Complicated APIs
• Advertiser Data– Hundreds of 1st
Normal Form dimension tables
– Self-joining a routine task
– Views and temporary tables
– Reporting needs– User management
Communicating
• Now– Flat files passed between the systems –BAD!
• Soon– Hive to provide better structured output from
Hadoop– Greenplum to release HDFS reader/writer
• Message Bus methods can feed both systems all the data, but transfer will always be necessary
• Don't drink the Kool-Aid