Characteristics of Web Searching

Post on 22-Feb-2016

33 views 0 download

description

Intelligent Meta-Search and Clustering Technology http://tamas.nlm.nih.gov/metasearch/ http://toxseek.nlm.nih.gov Tamas Doszkocs, Ph.D. Computer Scientist National Library of Medicine doszkocs@nlm.nih.gov. Characteristics of Web Searching. - PowerPoint PPT Presentation

Transcript of Characteristics of Web Searching

Intelligent Meta-Search and Clustering Technology

http://tamas.nlm.nih.gov/metasearch/ http://toxseek.nlm.nih.gov

Tamas Doszkocs, Ph.D.Computer Scientist

National Library of Medicine doszkocs@nlm.nih.gov

Characteristics of Web Searching• Content is created by diverse

organizations and individuals

• Information on the Web is inherently heterogeneous

• Content is distributed on multiple servers in multiple locations and multiple formats and languages aimed for diverse audiences and purposes

(In its April 2005 survey NetCraft received responses from 62,286,451 web sites)

• The “Open Web” of billions of static Web pages is indexed and searched via multiple search engines and directories

Problems in Web Searching• Even the largest of the current search engines

index only a fraction of all Web pages (The WayBackMacine of Internet Archive has indexed 40 billion pages, Google about 8.1 billion, Yahoo

about 20.8 billion -- August 2005)

• The not so “Hidden Web” of content databases (e.g. PubMed, Web of Science) is estimated to be thousands of times larger than the Open Web.

• Both the Open Web and the Hidden Web are characterized by problems of information coverage, quality, overload, relevancy, currency and completeness, as well as inherent language ambiguity and incompatible user interfaces

Meta-Searching

• Meta-Search Engines may simultaneously search multiple Open Web and Hidden Web sites in order to increase content coverage, precision, relevance and/or search efficiency and effectiveness.

Overlap Among 3 Major Search Engineshttp://missingpieces.dogpile.com/whitepaper.pdf

http://comparesearchengines.dogpile.com/OverlapAnalysis.pdf

Overlap Among AskJeeves, Google, MSN and YahooGoogle Isn’t Everything!

http://www.forbes.com/business/free_forbes/2005/0815/056.html?partner=yahoomag

Generations of Meta-Search Engines

• First Generation

• Second Generation

• Third Generation

• Next Generation

• “Broadcast” or “Federated” search– List of results

• Merging and Ranking– Increased coverage

• Result Clustering– Focused drill-down– Dynamic Query Mods

• Semantic and Pragmatic Intelligence

– tamas.nlm.nih.gov/metasearch/– toxseek.nlm.nih.gov– http://bestmeta.com

Moving Targets:Nine Search Engines Compared

By Ben Patterson (May 9, 2005)

http://reviews.cnet.com/4520-10572_7-6219242-2.html?tag=txt

Moving Targetsand the need for

Automatic Change Detection and Monitoringand

Integrating New Capabilities

The ToxSeek Meta-Search and ClusteringProject

• Goals:– Integrate best practices Information Retrieval and

Natural Language Processing techniques with AI heuristics to create an advanced general purpose meta-search, result clustering and knowledge discovery tool

– Apply ToxSeek to efficiently access diverse biomedical and environmental health information resources

– Create specialized applications for accessing quality information sources on HIV/AIDS, consumer health, homeland security, public health law, library research and other applications

ToxSeek Features• Integrates multiple spellcheckers and sophisticated lexical,

morphologic, syntactic and semantic resources • Merges and ranks the results from heterogeneous

information sources • Employs efficient Natural Language Phrase Parser and AI

heuristics to automatically identify Key Concepts and their Associations in queries and retrieved documents

• Uses the automatically identified Key Concepts and Associations to create topical Result Clusters

• Supports focused multi-concept drill-down, dynamic query refinement, multi-media and limited question answering

ToxSeek Implementation• Production applications and research prototypes have

been implemented for meta-searching diverse content on:– Toxicology and Environmental Health– Consumer Health– Library Catalogs and Proprietary Databases– HIV/AIDS– BioDefense– Homeland Security

• “Shift Happens…”– http://library.nps.navy.mil/home/staff/gmarlatt/HSDL%20ALI%20April

%202005%20%20final%20rev%207%20april.ppt

ToxSeek Web Search Query: “terrorism”

ToxSeek Query: “police state”

Win the Search Engine Wars with Intelligent Meta-Search and Clustering Technology

http://tamas.nlm.nih.gov/metasearch/ http://toxseek.nlm.nih.gov

Tamas Doszkocs, Ph.D.Computer Scientist

National Library of Medicine doszkocs@nlm.nih.gov