Ranking the Web with Spark

47
Ranking the Web with Spark Apache Big Data Europe 2016 [email protected] @sylvinus

Transcript of Ranking the Web with Spark

Page 1: Ranking the Web with Spark

Ranking the Web with Spark

Apache Big Data Europe 2016

[email protected] @sylvinus

Page 2: Ranking the Web with Spark

/usr/bin/whoami

• Jamendo (Founder & CTO, 2004-2011)

• TEDxParis (Co-founder, 2009-2012)

• dotConferences (Founder, 2012-)

• Pricing Assistant (Co-founder & CTO, 2012-)

Page 3: Ranking the Web with Spark

transparency

reproducibility

Page 4: Ranking the Web with Spark
Page 5: Ranking the Web with Spark

https://uidemo.commonsearch.org

Page 6: Ranking the Web with Spark

https://explain.commonsearch.org/?q=python&g=en

Page 7: Ranking the Web with Spark

Ranking

Page 8: Ranking the Web with Spark

Disclaimer: IANASRE(I Am Not A Search Relevance Engineer)

Page 9: Ranking the Web with Spark

What's in a score

score = fn( doc, query, language, user, time )

Page 10: Ranking the Web with Spark

What's in a score

score = fn( doc, query )

Page 11: Ranking the Web with Spark

What's in a score

score = fn( static_score, dynamic_score ( query ))

Page 12: Ranking the Web with Spark

Static score

Page 13: Ranking the Web with Spark

Static features

• Scopes:

• Page: URL depth, markup stats, ...

• Domain: Age, page count, blacklists, ...

• WebGraph: PageRank, ...

Page 14: Ranking the Web with Spark

http://infolab.stanford.edu/~backrub/google.htmlThe Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler

Indexer

Database

SearcherRanker

Page 15: Ranking the Web with Spark

Dynamic score

Page 16: Ranking the Web with Spark

Dynamic features

• Text match: TF-IDF, BM25, proximity, topic, ...

• Query-level: number of words, popularity, ...

• Usage: clicks, dwell time, reformulations, ...

• Time

Page 17: Ranking the Web with Spark

Scoring function

Page 18: Ranking the Web with Spark

Users

DatabaseElasticsearch

IndexerPython, Spark

Data sourcesCommon Crawl, Alexa top 1M, ...

words, static score

query top 10 docs, final scores

Offline

OnlineSearcher

Go

Page 19: Ranking the Web with Spark
Page 20: Ranking the Web with Spark

https://explain.commonsearch.org/?q=python&g=en

Page 21: Ranking the Web with Spark

Issues with this architecture

• Static & dynamic scoring are in different codebases

• No control over result diversity

• Hard to optimize

• Very dependent on Elasticsearch

Page 22: Ranking the Web with Spark

Rescoring

Page 23: Ranking the Web with Spark

Users

Database

Indexer

words, static score, features

query

Searcher

top 1k docs, features

Rescorer

final 10 docs

Page 24: Ranking the Web with Spark

Issues with rescoring

• Latency

• Pagination

• Harder to explain

Page 25: Ranking the Web with Spark

Learning to rank

Page 26: Ranking the Web with Spark

LTR Model

• Features

• Training dataset

• Evaluation: NDCG, ERR, ...

• Algorithms: AdaRank, ListNet, LambdaMART, ...

• Learning with Spark!

Page 27: Ranking the Web with Spark

The right questions

• What do users expect?

• What features?

• How to evaluate and fine-tune in the real world?

Page 28: Ranking the Web with Spark

PageRank with Spark

Page 29: Ranking the Web with Spark
Page 30: Ranking the Web with Spark

http://commoncrawl.org

Page 31: Ranking the Web with Spark

https://github.com/commonsearch/cosr-back

Page 32: Ranking the Web with Spark

Common Search Pipeline

Doc sourcesCommon Crawl,

WARC files, URLs ...

Filter plugins

Document parsing

Output plugins

Data outputDatabase, file, HDFS, S3, ...

Page 33: Ranking the Web with Spark

Most popular Wikipedia pages

Page 34: Ranking the Web with Spark

Dumping the web graph

Page 35: Ranking the Web with Spark

Naive pyspark PageRank

Page 36: Ranking the Web with Spark

GraphFrames

Page 37: Ranking the Web with Spark

SparkSQL PageRank

Page 38: Ranking the Web with Spark

SparkSQL PageRank

https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

Page 39: Ranking the Web with Spark

Tests

http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

Page 40: Ranking the Web with Spark

https://about.commonsearch.org/developer/get-started

Page 41: Ranking the Web with Spark
Page 42: Ranking the Web with Spark

Top 10

Page 43: Ranking the Web with Spark
Page 44: Ranking the Web with Spark

Spam

Page 45: Ranking the Web with Spark
Page 46: Ranking the Web with Spark

Spamdexing• Keyword stuffing, hidden text

• Scraper sites, Mirrors

• Link farms

• Splogs, Comment spam

• Domaining

• Cloaking

• Bombing

Page 47: Ranking the Web with Spark

Questions?https://about.commonsearch.org/contributing

https://github.com/commonsearch [email protected]

slack.commonsearch.org