The Infocious Web Search Engine:
Improving Web Searching Through Linguistic Analysis
Alexandros Ntoulas1,2 Gerald Chao1 Junghoo Cho2
1 Infocious Inc.{ntoulas, gerald}@infocious.com
2 University of California Los Angeles{ntoulas, cho}@cs.ucla.edu
April 21, 2023 WWW 2005 Chiba Japan
Motivation
Current Web search engines identify relevant pages based on keyword matching
Example: jaguar
Jaguar Cars Official worldwide web site of Jaguar Cars.
www.jaguar.com/
April 21, 2023 WWW 2005 Chiba Japan
Motivation
Is keyword matching enough ? Natural languages are inherently ambiguous Example: jaguar
The car brand ? Apple Mac OS X 10.2 ? The animal ? Chemical software …
April 21, 2023 WWW 2005 Chiba Japan
The Infocious Web Search Engine Uses Language Analysis techniques to:
Resolve ambiguities inside Web pages Rank the Web pages based on the
coherence (quality) of the text Help users organize the results in intuitive
ways through categorization Provide suggestions for query refinement
April 21, 2023 WWW 2005 Chiba Japan
What is different about Infocious ? Search Engines today do not apply
Language Analysis to the level Infocious does
It is not simply a matter of applying existing algorithms: need optimizations for Web scale
Features made possible only through language analysis
Makes Language Analysis features intuitive (yet powerful) for the user
April 21, 2023 WWW 2005 Chiba Japan
Architecture
April 21, 2023 WWW 2005 Chiba Japan
Architecture
Crawler
• Follows links to discover Web pages
• Refreshes changed pages using sampling [VLDB’02]
• Can download pages from the Hidden Web [JCDL’05]
April 21, 2023 WWW 2005 Chiba Japan
Architecture
Linguistic Processing
• Resolves language ambiguities [COLING’02]
• Annotates Web pages
• Extracts concepts
• Extracts named entities
• Operates at crawl speed
April 21, 2023 WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation Part-of-speech (POS) tagging Example: house plants
Done probabilistically:Given sentence S, set of tags T find
Tbest(S) = arg maxT P(T | S)
... most house plants are hybrids of plant species
... garden built to house our most valuable plants ...
Adj Noun Noun Verb Noun Prep Noun Noun
Noun VerbD Inf Verb PronP Adv Adj Noun
April 21, 2023 WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation POS information stored inside the index User can manually specify POS at query time (or
click on examples)
Query N:house N:plants
GreenPatio.Com – Tips for buying house plants.Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants....www.greenpatio.com/tips.shtml
Low Light Plants for the HouseIs a common name for plants in the species Dieffenbachia.... As with most house plants …www.plantsgalore.com/articles/houseplants/houseplants-low-light-plantfacts.htm
April 21, 2023 WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation
Query V:house N:plants
Over Wintering Bonsai … One method is to build a cold frame to house your plants in the winter. ...www.evergreengardenworks.com/overwint.htm
Keeping Your Sunroom Cozy …And if you want to house a hot tub or plants, think about enclosing the …doityourself.com/sunroom/sunroomcozy.htm
POS information stored inside the index User can manually specify POS at query time (or
click on examples)
April 21, 2023 WWW 2005 Chiba Japan
Linguistic Processing: Disambiguation Word-sense disambiguation Previous Example: jaguar Approach through Web page categorization
Use the categories of DMOZ (~600,000) Given set of categories C and a page d
Find maxc C P(c|d)
In Infocious a page may belong to multiple categories
April 21, 2023 WWW 2005 Chiba Japan
Categorization
The category of a result is highlighted onMouseOver()
Allow users to restrict search within a category:
jaguar cat:Computers
Can also be done by clicking on a category
Jaguar CarsOfficial worldwide site of jaguar cars
www.jaguar.com/
Apple Mac OS XThe Apple Mac OS Product
pagewww.apple.com/macosx/
Computers Recreation/AutosComputers
Apple Mac OS X
April 21, 2023 WWW 2005 Chiba Japan
Linguistic Processing: Concept Extraction More accurate phrase identification:
Identify concepts through a set of rules (pre-specified or automatically learned)
Example: VerbPhrase-PrepPhrase-NounPhras lightly tossed with salad dressing tossed with oil and vinegar dressing tossed immediately with blue-cheese dressing Reduced to Concept: tossed with dressing
In the profession of cooking
oil is an important ingredient
April 21, 2023 WWW 2005 Chiba Japan
Answering a query
Default is AND-semantics Query disambiguation (e.g. in query train a
pet Infocious knows train has to be a verb) Ranking takes into account a variety of
factors Presence of keywords, Proximity Title, URL, formatting, font size, coloring etc. Popularity of a page measured by in/out links TextQuality
April 21, 2023 WWW 2005 Chiba Japan
Architecture
TextQuality
• Summarize probabilities from Linguistic Processing into one metric
• Promote coherent text
• Demote incoherent text
April 21, 2023 WWW 2005 Chiba Japan
TextQuality (disabled) Promotes well-written pages (preferable from
the user perspective)
Britney Spears Pictures – britney spears pictures …picture of britney spears, hot pictures of britney spears …britney-spears-pictures.hotyoungstars.com/nude/
Hot Britney Spears Pics - hot britney spears pics,...britney spears, new hot pics of britney spears,...hot-britney-spears-pics.hotyoungstars.com/nude/
Britney Spears Photos – britney spears photos …spears, britney spears nude photos, nude photos of …britney-spears-photos.hotyoungstars.com/nude/
TextQuality DISABLED
April 21, 2023 WWW 2005 Chiba Japan
Is Britney Spears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer …azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm
IMPERSONATORS – BRITNEY SPEARS Is Proud to Present! Contact: Gary Shortall Back… www.impersonators.com/brittany/brit.html
Britney Spears’ Coke HabitBritney Spears’ Coke Habit Destroys Her…www.emptyv.org/britney_spears.htm
TextQuality ENABLED
TextQuality (enabled) Promotes well-written pages (preferable from
the user perspective)
April 21, 2023 WWW 2005 Chiba Japan
Other Language Analysis-Enhanced Features Key phrases: Present a list of the salient
concepts within the results Related topics: Concepts related to the
present query Hone your search: Suggestion of more
specific queries Spell Checking Personalization: I like Sports but not Politics
April 21, 2023 WWW 2005 Chiba Japan
Evaluation of Categorization
Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy
Infocious actually employs an improved classification technique (76% accuracy)
We used four different flavors of NB on 100,000 Web pages:
C1: Words C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts
April 21, 2023 WWW 2005 Chiba Japan
Evaluation of Categorization
C1: Words only C2: Words + POS tags
C3: Words + extracted concepts C4: Words + POS + extracted concepts
Accuracy of NB classifiers
60%
61%
62%
63%
64%
65%
66%
67%
68%
C1 C2 C3 C4
Classifier
Acc
ura
cy3% accurary increase – 8% error reduction
April 21, 2023 WWW 2005 Chiba Japan
User Interface
April 21, 2023 WWW 2005 Chiba Japan
Conclusion
Infocious: uses language analysis to improve Web search
Resolves language ambiguities Incorporates text coherence in the ranking Provides query suggestions and refinements Organizes information intuitively through
categorization
April 21, 2023 WWW 2005 Chiba Japan
Related Work
Web Search Engines: Google, Yahoo!, MSNSearch, Ask/Teoma,
Altavista, Looksmart, Vivisimo, … Enterprise Search
Autonomy, Inquira, Inxight, iPhrase, … Answer Engines
START@MIT, BrainBoost, …
April 21, 2023 WWW 2005 Chiba Japan
Ongoing work
Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls
Apply our Language Analysis algorithms to additional languages
Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…)
Investigate how to use Language Analysis to improve relevance in advertisements
April 21, 2023 WWW 2005 Chiba Japan
Thank you !
You can check out our Search Engine at:
www.infocious.com
Top Related