Demand, Media, and Search Analytics at AOL

18
October 4, 2011 Sean Timm [email protected] Twitter: @timmsc Demand, Media, and Search Analytics

description

Presented at Hadoop-DC on October 4, 2011.

Transcript of Demand, Media, and Search Analytics at AOL

Page 1: Demand, Media, and Search Analytics at AOL

October 4, 2011

Sean [email protected]: @timmsc

Demand, Media, and Search Analytics

Page 2: Demand, Media, and Search Analytics at AOL

Introduction• Who am I?• What do we use Hadoop for?• Our best practices• Lessons learned• The related searches, seasonality—example applications

Page 2

Page 3: Demand, Media, and Search Analytics at AOL

Page 3

History• Originated in Search Backend in 2007• Create data driven products for search.aol.com from search

logs• No Netezza experience, decided to try Hadoop• Took 3 weeks to write simple aggregation• Apache Pig 0.3—2 days• First product, related searches, launched in 2008• Search breaking trends product led to further demand work• Now Pig 0.8.1 and Hadoop 0.20.2

Page 4: Demand, Media, and Search Analytics at AOL

Page 4

DataHourly search.aol.com logs•5 M log lines of data per hour•Logs include searches, clicks, and other data•70% of queries we only see once

Hourly Wikipedia page view data•public data set http://dammit.lt/wikistats•7 M pages viewed per hour•2.7 M English pages per hour

BeacoN logs•Page view and click logs for AOL HuffingtonPost Media, Patch, and other AOL properties

Page 5: Demand, Media, and Search Analytics at AOL

Page 5

We like Pig!• Hourly, daily, and monthly search and click aggregation• Related searches• Auto complete dictionary• Mining spelling correction click through• Temporal pattern analysis• Classifying adult queries and URLs• Categorizing queries• Identifying queries in the form of a question or superlative• Identifying breaking trends in AOL Search and Wikipedia page views• Identify queries of local interest• Clustering queries using click graph, temporal distance, Carrot2, k-means• AOL HPMG stats and trends for page views, authors, tags, etc.

Page 6: Demand, Media, and Search Analytics at AOL

Page 6

Pig Process in GeneralScript run time < 2 minutes to > 2 hours

Ad hoc…wild west

Complex shell scripts

1. load/copy/backup data2. Launch multiple Pig scripts—some in parallel—some

with serial dependencies3. Check for errors—e-mail and halt4. Load data into MySQL, Vertica, or Solr

Page 7: Demand, Media, and Search Analytics at AOL

Page 7

Getting data out of HadoopFirst approach: special StoreFunc to write directly to MySQL/Solr•Network: Required master be on the same network as the cluster•Speculative optimization: data would be written more than once increasing contention as well as doing unnecessary writes•Replication: writing to the master in parallel, serial replication was slow (MySQL)•Timeouts: occasionally a task failed and restarted (Solr)

Page 8: Demand, Media, and Search Analytics at AOL

Page 8

Getting Data out of HadoopMySQL/Vertica Now•Write data to HDFS•Copy from HDFS to local file system using CLI•Load into database: LOAD DATA LOCAL INFILE from mysql client

Solr Now•Custom StoreFunc writes Solr XML to HDFS•Starting with Pig 0.7 fields are named using the Pig schema•Copy from HDFS to local file system using CLI•Load into Solr using remote streaming

Page 9: Demand, Media, and Search Analytics at AOL

Page 9

UDFs• Use Piggy Bank and builtins when possible• 89 custom UDFs packaged in a single jar• Most are simple

• Validate a URL, URL decode a string, calculate a hash value, date math, etc.

• Some are complex• Spell check/correct, LOESS regression, Carrot2 clustering, FFT, Euclidean distance, etc.

Page 10: Demand, Media, and Search Analytics at AOL

Page 10

Lessons learned• Many small categorization scripts, better to use a larger single

one• Set priority on large time sensitive jobs that fight for resources

with other jobs• Fair scheduler• Tuning the cluster for maps or reduces• Don't write copious debug• Use appropriate number of reducers (PARALLEL)

Page 11: Demand, Media, and Search Analytics at AOL

Related Searches

Group by Query

Page 12: Demand, Media, and Search Analytics at AOL

Challenges• Adult terms• Misspellings• Breadth of suggestions• Coverage• Timeliness of suggestions

Page 13: Demand, Media, and Search Analytics at AOL

Process Flow• Filter and clean data

• Block adult terms, long queries, non-alpha, second+ pages, operators, URL like queries, search spam

• Lower case

• Join to get query-related query groups• Contextual spell correct within group• Cluster related queries and pick the best from each

group• Load into Solr

Page 14: Demand, Media, and Search Analytics at AOL

Related Searches Graph

Page 14

“The Eagles”

The band

NFL

Boston College

Hotel California

Tribute

Page 15: Demand, Media, and Search Analytics at AOL

Classification• Supervised learning• Provide categorized set of queries and/or URLs• Calculate a score based on the edge weights• If the score exceeds a specified threshold the query or URL is

tagged with the category

Page 16: Demand, Media, and Search Analytics at AOL

Applications Outside of Search• Author/citation bipartite graph• Social network graphs• User/Page view graphs

Page 17: Demand, Media, and Search Analytics at AOL

Temporal traffic correlation of Wikipedia Page Views

Page 17

Page 18: Demand, Media, and Search Analytics at AOL

Tomato SeasonalityMay: planting tomatoes, tomato cages, types of tomatoesJune: pruning tomato plantsJuly: tomato diseases, tomato blight, tomato wormAugust: tomato recipes, tomato soup, tomato sauce, tomato salsaSeptember: sun dried tomatoes, canning and freezing tomatoesOctober: green tomato recipes

Page 18