TECHNOLOGIES LARGE-SCALE SEARCH

LARGE-SCALE SEARCH TECHNOLOGIES

Why is building an internet search engine so hard?

Over 150 trillion webpages. Only 1 of every 3,000 webpages is indexed.

It is practically impossible to index all the words in every webpage

One cannot build a competitive search without huge amount of users’ data

It’s too expensive to compute search results on the fly

State of the art technologies

State of the art technologies wouldn’t work to power large-scale web search engines

Scalability Noise

Scalability challenges

The core algorithm of an inverted index scheme is based on a list intersection

This is a computationally inefficient operation, which is impractical when it comes to the web and its billions of webpages.

Scalability

Noise reduction challenges

Textual measures: Word intersections, word distances, word positions, td-idf, BM25 etc.

No matter which textually-based approach you take, it might work well on one query, but poorly on the other.

The conclusion

There is no point trying to index the parts of a webpage that are unlikely

to appear in prospective users’ search queries

Query log

Suppose that for each webpage you had a list of all the search queries that ended with a user clicking on this webpage (say from Google)

Any part of a webpage that does interact with any of the search queries leading to it is completely irrelevant for the retrieval process

Hence, you could focus on only indexing the search queries.

Example – Cyberpunk 2077

Smaller index size

Less noise with more accurate results

User queries

Predict first, index later

Indexing only the prospective user queries

Indexing problem is turned into a prediction problem

Data collection

Long tail queries No in-depth understanding

Not independent Expansive infrastructures

Without massive data collection are we doomed to fail?

Data prediction

Use AI to predict the prospective user's query

What about recall and precision?

This solves the indexing problem

The next challenge is to maximize the recall and precision

It’s all about the context

What is the release date of Cyberpunk 2077 to PS4 and Xbox One?

Not an NLP problem

release date Cyberpunk 2077 PS4 Xbox One

What will you choose?

Cyberpunk 2077 PS4 console

Contextual Graph

The power of context

Taxonomy vs. Contextuality

Shoes Sporting Goods

Running Basketball

Air Max

Cortez AdidasLebron James

StaticDynamic

Contextual graph

NikeAir

MaxSporting Goods

0.01%≈

Contextual Taxonomical

Contextual graph

Running Basketball

Sporting Goods

Air Max

Cortez

Lebron James

Adidas

0%≈0%≈

2% 10% 8%

25% 20%

90% 80%

Graph statistics

Input: 5 billion webpages

10 billion entities

300 billion edges

10 million entities added daily

Cities Books Shoes

Restaurants Cars Songs

Quotes Company Sites

Challenges

Data collection

What to extract

How to connect

Scalable architecture

Solution

Neuroscience instead of NLP

Brain inspired auto associative networks and statistical pattern recognition

Technology

Applications to Search

Possible query forms

Greater Toronto Area

Possible query forms

Query augmentation

Contextual GraphIt’s all about the context

TECHNOLOGIES LARGE-SCALE SEARCH

Documents

Transcript of TECHNOLOGIES LARGE-SCALE SEARCH

Scale Model Technologies - Home

SOURCE SEARCH TECHNOLOGIES, LLC, Plaintiff.

Search and Access Technologies for Large Scale Web Archives

CS688: Web-Scale Image Search Keypoint Localization

Search Technologies€¦ · Search Technologies Pupils should be taught to: •Use search technologies effectively, appreciate how results are selected and ranked and be discerning

Efficient Large Scale Maximum Inner Product Search

Nutch - web-scale search engine toolkit

Hadoop-scale Search with Apache Solr

Innovative Remediation Technologies: Field-Scale ... · PDF fileInnovative Remediation Technologies: Field-Scale Demonstration Pr ojects in ... by Environmental Management Support,

SEO Search Engine Optimization Spirit Technologies

HathiTrust Large Scale Search: Scalability meets Usability

Overcoming Variations in Nanometer-Scale Technologies

USPTO Patent Search by Einfolge Technologies

Towards Large-Scale Multi-Modal Image Search

Summon: True Web-Scale Discovery. Leveraging new technologies to empower library search Ben McLeish 14 th November 2011 The worlds first Web-Scale Discovery.

Zenith Technologies - Your search is over

Very Large-Scale Neighborhood Search Thorsten Krenek.

Validity & Invalidity Search by Einfolge Technologies

Spyglass: Fast, Scalable Metadata Search for Large-Scale ... · USENIX Association 7th USENIX Conference on File and Storage Technologies 153 Spyglass: Fast, Scalable Metadata Search

Semantic Search Technologies for Video Archives