Searching the Hidden Web

• What is the hidden Web• Two approaches in searching the hidden

WeboBrowsing Yahoo! like Web directoryoCrawling the hidden Web

• conclusion


The surface Web ◦ reachable via hyperlinks

What is the Hidden Web The hidden Web

◦ no static hyperlink points to the webpage◦ access via a query interface◦ dynamically generated base on the query


Size of the Hidden Web About 500 times larger than the surface

web◦ The surface web - 1 billion pages◦ Hidden web - over 550 billion pages

Top sixty largest Deep web sites are about 40 times larger than the surface web.

Quality of the Hidden Web Name URL Web Size (GBs)

National Climatic Data Center (NOAA) 366,000


National Oceanographic (combined with

Geophysical) Data Center (NOAA), 32,940 4,300

US PTO - Trademarks + Patents, 2,440

Informedia (Carnegie Mellon Univ.) 1,830

UC Berkeley Digital Library Project 766

US Census 610

NCI CancerNet Database 488 461

IBM Patent Center 345

NASA Image Exchange 337

Browsing Yahoo! like Web directory Crawling the Hidden Web.

Manually populate Yahoo! like directory Classify collections of text database into

categories and subcategories

Pros◦ Intuitive◦ Easy to use

Cons◦ Labor intensive

Yahoo Directory containing 200, 0000 categories and there are millions of database searchable online

◦ Accurate classification is not an easy task

Main challenge in searching the hidden Web◦ How to automatically generate meaningful query as

input against query interface

The query generation problem◦ assume that a Web site contains a set of pages, s.◦ each query qi issued returns a subset of s, si

◦ the task is to select a set of queries that would return maximum number of unique pages in the database with minimum cost

Random - select the query randomly from a list of keywords (e.g. a random word from an English dictionary).

Generic Frequency - select a list of most frequent key words from a generic document corpus.

Adaptive - select promising keywords from documents downloaded based on previously issued queries.

comparison of policies for dmoz (modified from Ntoulas et al )

comparison of policies for PubMed (modified from Ntoulas et al)

The surface web is the tip of the iceberg Beneath it is an even vaster hidden Web Two main approaches to access the hidden Web

◦ Yahoo! like web directory◦ Crawling the Hidden Web

Much work need to be done. Hidden Web searching technology would enable us to

connect different data sources and allow businesses use data in new ways.


