LOCAL SEARCH INTERNET IS YELLOW PAGES!

9
0018-9162/05/$20.00 © 2005 IEEE 26 Computer Local Search: The Internet Is the Yellow Pages E very day, millions of people use their local newspapers, classified ad circulars, Yellow Pages directories, regional mag- azines, and the Internet to find informa- tion pertaining to the activities of daily life: nearby places, local merchants and services, items for sale, and happenings about town. The Internet is not meeting its potential for deliv- ering this type of geographically oriented informa- tion. Sometimes the information that people seek is on the Internet, but the tools for locating it are inad- equate. In other cases, our industry has not devel- oped the counterparts to replace traditional delivery methods such as the printed Yellow Pages. The trends that point to the rapid growth of geo- graphically oriented search, known as local search, are unmistakable. The most important predictor of the intensity of an individual’s Internet usage is the availability of a broadband connection. As of early 2004, 55 percent of all US adult Internet users had access to such a connection. 1 Further, the number of adult Americans who had broadband Internet connections at home increased 60 percent from the same time in 2003, to 24 percent. Broadband access makes the Internet a pervasive, “always-on information appliance.” 2 People with high-speed access do more things on the Internet, and they do them more frequently. The Internet has always been used to support local activities, rang- ing from Yellow Pages searches, mapping, and vaca- tion planning to researching products prior to pur- chasing them in a nearby store. Ubiquitous broad- band access will serve to increase users’ expectations for better support for all types of location-based computing. On the search side, a market study of 5,000 online shoppers conducted by TKG and Bizrate. com found that 25 percent of the responders’ searches were for merchants “near my home or work.” 3 A recent Forrester Group study found that 65 percent of online shoppers researched a product online before purchasing it offline. 4 On the content side, at least 20 percent of Web pages include one or more easily recognizable and unambiguous geographic identifiers, such as a postal address. Many of these Web pages have locally relevant content; Web authors don’t put addresses on pages haphazardly. This content is already on the Internet despite the lack of an over- arching mechanism for accessing it. On the revenue side, US businesses spend $22 bil- lion annually on local advertising, $14 billion of which is for the Yellow Pages, but only a sliver is for the Internet. Greg Sterling, managing editor for the Kelsey Group, a research firm that provides Yellow Pages metrics, puts the upper limit of adver- tisers worldwide who purchase paid search slots on the Internet at 250,000, but few of the slots are for Marty Himmelstein Long Hill Consulting, LLC The proposed Internet-Derived Yellow Pages aggregate, annotate, and certify Web content for use in geographically oriented searching. The IDYP provides a framework for combining Internet-derived content with the trust and fairness that characterize the printed Yellow Pages, still the predominant source of consumer-oriented local information. COMPUTING PRACTICES Published by the IEEE Computer Society

description

 

Transcript of LOCAL SEARCH INTERNET IS YELLOW PAGES!

Page 1: LOCAL SEARCH INTERNET IS YELLOW PAGES!

0018-9162/05/$20.00 © 2005 IEEE26 Computer

Local Search: The Internet Is the Yellow Pages

E very day, millions of people use theirlocal newspapers, classified ad circulars,Yellow Pages directories, regional mag-azines, and the Internet to find informa-tion pertaining to the activities of daily

life: nearby places, local merchants and services,items for sale, and happenings about town.

The Internet is not meeting its potential for deliv-ering this type of geographically oriented informa-tion. Sometimes the information that people seek ison the Internet, but the tools for locating it are inad-equate. In other cases, our industry has not devel-oped the counterparts to replace traditional deliverymethods such as the printed Yellow Pages.

The trends that point to the rapid growth of geo-graphically oriented search, known as local search,are unmistakable. The most important predictor ofthe intensity of an individual’s Internet usage is theavailability of a broadband connection. As of early2004, 55 percent of all US adult Internet users hadaccess to such a connection.1 Further, the numberof adult Americans who had broadband Internetconnections at home increased 60 percent from thesame time in 2003, to 24 percent.

Broadband access makes the Internet a pervasive,“always-on information appliance.”2 People withhigh-speed access do more things on the Internet,and they do them more frequently. The Internet hasalways been used to support local activities, rang-

ing from Yellow Pages searches, mapping, and vaca-tion planning to researching products prior to pur-chasing them in a nearby store. Ubiquitous broad-band access will serve to increase users’ expectationsfor better support for all types of location-basedcomputing.

On the search side, a market study of 5,000online shoppers conducted by TKG and Bizrate.com found that 25 percent of the responders’searches were for merchants “near my home orwork.”3 A recent Forrester Group study found that65 percent of online shoppers researched a productonline before purchasing it offline.4

On the content side, at least 20 percent of Webpages include one or more easily recognizable andunambiguous geographic identifiers, such as apostal address. Many of these Web pages havelocally relevant content; Web authors don’t putaddresses on pages haphazardly. This content isalready on the Internet despite the lack of an over-arching mechanism for accessing it.

On the revenue side, US businesses spend $22 bil-lion annually on local advertising, $14 billion ofwhich is for the Yellow Pages, but only a sliver isfor the Internet. Greg Sterling, managing editor forthe Kelsey Group, a research firm that providesYellow Pages metrics, puts the upper limit of adver-tisers worldwide who purchase paid search slots onthe Internet at 250,000, but few of the slots are for

MartyHimmelsteinLong HillConsulting,LLC

The proposed Internet-Derived Yellow Pagesaggregate, annotate, and certify Web contentfor use in geographically oriented searching.The IDYP provides a framework for combining Internet-derived content with thetrust and fairness that characterize the printedYellow Pages, still the predominant source of consumer-oriented local information.

C O M P U T I N G P R A C T I C E S

P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y

Page 2: LOCAL SEARCH INTERNET IS YELLOW PAGES!

February 2005 27

local content.5 Contrast this with the more than 12million small and medium businesses (SMBs) in theUS, and another 20 million or so in other devel-oped countries. The predominant market for SMBsis local: 60 percent of businesses in the US reportthat 75 percent of their customers come fromwithin a 50-mile radius.5

LOCAL SEARCH TODAY AND TOMORROWLocal search today is discussed in the context of

paid listings—the advertisements that appear nearthe algorithmically computed, or natural, resultssearch engines return in response to user queries.However, paid listings and their variants are notthe bedrock upon which local search will be built.To see why, we only need to examine the originaland still predominant local search tool, the printedYellow Pages.

The Yellow Pages have many shortcomings, butthey also have two virtues that are indispensablefor local search: They are both trustworthy andinclusive—they contain at least minimal informa-tion on all businesses.

Paid listings do not provide the infrastructurefor replicating these core Yellow Pages virtues—infact, the value of paid listings is that they are theopposite of fair. Rather, to reach the widest audi-ence, paid listings require a stratum of YP-like databeneath them, and the richer that stratum is thebetter.

The challenge for the local search community,then, is to facilitate the creation of this stratum ofdata. It must create better ways to collect and dis-seminate geographically oriented informationabout the activities of daily life. To meet this chal-lenge, local search must supplant both the printedYellow Pages and the current generation of InternetYellow Pages (IYP)—a transplanted direct-mailmailing list—as a means for gathering and pre-senting consumer-oriented business information.

In ways that are readily evident, the Internet canfurnish richer content than the Yellow Pages, butit cannot yet duplicate its orderliness and fairness.And fairness is the crucible by which local searchwill be judged. If users don’t trust local search, itwon’t matter how much better than the YellowPages it is. People won’t depend on it.

People use the Yellow Pages occasionally, butthey are involved in local activities continually. It istherefore natural for local search to reflect the rangeof activities in which people participate. For exam-ple, much of our local activity has a temporal com-ponent. The Internet has the potential to provideaccess to transient local information more effi-

ciently than older distribution mediums. Adefinition of local search that encompassesits temporal, commercial, and noncommer-cial aspects is that “local search tells me whatis located within 100 miles from here andwhat is happening within 100 miles fromhere.”

Broadly speaking, there are two sources oflocal content on the Internet. Offline-derivedlocal content originates from other, usuallyolder, sources, but is distributed on the Inter-net. The IYP is the primary source of offline-derived local content on the Internet.

Internet-derived local content is gathered directlyfrom the Internet. While many searches returnpages with local content, to date only a few sys-tems have attempted to gather and present contentthat is specifically relevant for local search.Geosearch, a joint project between Vicinity andNorthern Light, was the first large-scale effort toderive local content directly from the Internet.

Currently, content aggregators, such as the var-ious city and vacation guides that abound on theInternet, provide some of the best local content.These aggregators have good information for pop-ular categories, such as lodging and restaurants,and they rely on the IYP to fill gaps in their cover-age. While they are good sources for some types of content, they do not provide a mechanism forreplacing either the print or Internet Yellow Pages.

The Internet-Derived Yellow Pages provide aframework for local search that incorporates thetrustworthiness associated with the Yellow Pageswithout jettisoning the potential for distributed,unencumbered content creation that is theInternet’s inherent strength. The IDYP uses theInternet for both content distribution and contentaggregation. Aggregation is a more significant chal-lenge than distribution, but one that is not ade-quately addressed by the local search community.

The IDYP’s ken is wider than commerce, butlocal search’s first requirement is to be a betterYellow Pages.

GEOSPATIAL PROXIMITY SEARCHINGAll varieties of local search require the ability to

find information associated with locations withina given distance of a specified search center, knownas geospatial proximity searching. Preparing datasources for proximity searching requires severalsteps.

The first step is to locate text that the geoenabledapplication can map to a physical location. This stepis easy for the IYP because it is a simple structured

The Internet-DerivedYellow Pages uses

the Internet for bothcontent distribution

and content aggregation.

Page 3: LOCAL SEARCH INTERNET IS YELLOW PAGES!

28 Computer

database with defined meanings for each field.For Internet-derived content, the problem istrickier because text with geographic signifi-cance can be anywhere on a page.

The second step is to transform a location’stextual designation into physical coordinateson Earth’s surface. As the “DetectingGeographic Content in Text Documents”sidebar describes, the topic of detecting geo-graphic content within text documents hasgenerated interest in both the commercial andresearch sectors.

In the developed world, a street or postaladdress is the most common way to refer to

a location, particularly for local search. Geocodingapplications attempt to resolve a group of tokensinto a pair of geographic coordinates, usuallyexpressed as latitude and longitude. Along witheach pair of coordinates, a geocoder also returns avalue that represents the quality of the returnedgeocode. The best geocodes are accurate to withina few meters; less specific coordinates usually referto the centroid of a larger region. Geocoding data-bases are large and dynamic, like the street net-works they represent.

Efficiently processing proximity queries againstlarge data sets, such as a nationwide business direc-tory of 14 million businesses, or the Internet,requires spatial access methods. The basic idea ofSAMs is to map a two-dimensional (or n-dimen-sional) coordinate system—in this case latitude andlongitude—onto a single-dimensional coordinatesystem. By doing so, a region on Earth’s surface canbe denoted with a single attribute, a spatial key,instead of the four attributes that are necessary todescribe a bounding rectangle: the x and y coordi-nates of the rectangle’s lower left and upper rightcorners.

Spatial keys are computed as the data set to begeoenabled is being built. In the case of a businessdirectory, the spatial key for each business is storedas an additional field along with other fields for thebusiness. If a search application is indexing unstruc-tured text, it adds the spatial keys as additionalterms to the index it builds for the page. Un-structured text, such as Web pages, could requireseveral spatial keys because they may contain sev-eral addresses.

At search time, to determine the set of businessesor Web pages that satisfy a proximity query, thesearch application maps the user’s search centerand desired search radius to a set of spatial keysthat cover the area to be searched. It then adds thesespatial keys to the user’s nongeographic query

terms so that they can be compared to the pre-computed spatial keys stored with the dataset beingsearched.

The search application refers to the user’s non-geographic terms to determine which data itemswithin the radius match the user’s main search cri-teria. Proximity searching algorithms can orderresults by distance, so results closer to the searchcenter are listed before those farther away.Ordering is routine for IYP applications, but canbe problematic for Web pages because of the poten-tially large number of pages that may need to besorted. An example of a paraphrased Geosearchquery is: “Return Web pages that are about hot-airballoon rides and which contain postal addressesor telephone numbers within 100 miles of 10 MainStreet, Poughkeepsie, NY.”

INTERNET-DERIVED LOCAL CONTENTIn 1998, a research group at Vicinity developed

a prototype system to geoenable Web content.Vicinity modified the spatial access methods itdeveloped for its IYP and business locator productsto work as a software component in conjunctionwith search applications. In 1999, Vicinity teamedup with Northern Light to broaden its experimentto include the general Web corpus. Microsoft pur-chased Vicinity in December 2002.

Geosearch was publicly available from April2000 until March 2002 from both Northern Lightand Vicinity’s MapBlast property. During this time,Geosearch processed about 300 million distinctWeb pages.

The experience with Geosearch provided thebasis for two observations:

• the Internet is already a rich source of localcontent, and

• local information on the Internet possesses cer-tain characteristics that simplify the job ofaggregating it.

The basic idea of Geosearch is that it transformslocation information in text documents into a formthat search engines can use for efficient proximitysearching. Its first step is to scan documents to rec-ognize text patterns that represent geocodable entities.

Geosearch avoids semantic text analysis, prefer-ring to leave the determination of a document’s sub-ject matter to the information analysis algorithmsof the search application with which it works.Geosearch relies on the fact that a significant por-tion of the content that is valuable for local search

Geosearchtransforms locationinformation in textdocuments into aform that search

engines can use forefficient proximity

searching.

Page 4: LOCAL SEARCH INTERNET IS YELLOW PAGES!

February 2005 29

The topic of detecting geographic content within text docu-ments has generated interest in both the commercial andresearch sectors.

Commercially available productsGoogle Local (local.google.com) scans Web pages for US and

Canadian addresses and North American Numbering Plan tele-phone numbers. Whereas Geosearch used location purely as afilter, Google adds an extra step of trying to correlate theaddress information on Web pages with IYP data.

Most local search offerings combine IYP data with more in-depth content from vertical content aggregators, but so far,Google is the only search engine that uses the Geosearchapproach of deriving local content directly from Web pages.One sure way to determine whether a search product obtainslocal content directly from the Internet is to do an idiosyncraticsearch for which there is unlikely to be any IYP data. For exam-ple, both Geosearch and Google Local return results for “wormcomposting in Thetford, VT”—others do not.

MetaCarta’s Geographic Text Search (www.metacarta.com) isa commercially available product that uses a place-name directoryin combination with context-based analysis to determine the pres-ence of geographic content. It will, for example, assign a locationto the phrase “three miles south of Kandahar.” GTS is appropri-ate for corpora that might have geographic content but not theobvious markers of postal addresses or telephone numbers.

ResearchContent-based searching for location information requires

identifying tokens that might have a geographic meaning.Systems that use place-name directories, called gazetteers, needto check the gazetteer for every token in a document. A tokenthat is in the gazetteer must then be disambiguated to see if itreally represents a location, and if so, which one. This processcan be costly.

Systems based on standardized addresses typically look firstfor postal codes. Tokens that look like postal codes are rare,so few trigger additional work. Then, since the sequence oftokens in an address is rigidly constrained, it is not difficult todetermine if a potential postal code is in fact part of an address.Efficiency might not be a concern for some document collec-tions, but it is if the collection is the Internet.

Web-A-Where,1 a gazetteer-based system that associatesgeography with Web content, uses several techniques to resolveambiguities. Ambiguities are classified as geo/geo (Paris, Franceor Paris, Texas) or geo/non-geo (Berlin, Germany or IsaiahBerlin). The system also assigns a geographic focus to eachpage—a locality that a page is in some way “about.”

Junyan Ding and coauthors2 analyzed the geographic distri-bution of hyperlinks to a resource to determine its geographicscope. As expected, their analysis showed that The New YorkTimes has a nationwide geographic scope. However, so doesthe San Jose Mercury News because readers across the coun-try follow this California newspaper’s technology reports.These authors also estimated a resource’s geographic scope byusing a gazetteer to examine its content.

In contrast to Ding and coauthors, Geosearch and GoogleLocal rely on a user-centric approach to determine geographicscope because they allow users to specify the search radius ofa query.

Kevin McCurley3 discussed using addresses, postal codes,and telephone numbers to discover geographic context. RemcoBouckaert4 demonstrated the potential of using the low-levelstructure of proximate tokens, such as postal addresses, to per-form information extraction tasks.

Organizing Web content for local searchWith the exception of the work by Dan Bricklin,5 relatively

little has been written about organizing existing Web contentfor local search. Bricklin proposed the small and medium busi-ness metadata (SMBmeta) initiative as a way for enterprises topresent essential information about themselves consistently onthe Web. The idea is to create an XML file at the top level of adomain that contains basic information about the enterprise.Since SMBmeta files have a consistent location, name, andstructure across Web sites, search applications can easily findand interpret the files.

In a perfect virtual world—a Web presence for all businesses,the willing participation of search engines to promulgate theuse of metadata standards, and no spam—the originalSMBmeta initiative would offer a simple way to disseminateinformation about local businesses.

In lieu of this, Bricklin proposed the SMBmeta ecosystem,which sketches some control mechanisms that are similar toIDYP’s trusted authorities. Upon request, a registry returns a listof the domains it knows about that have SMBmeta data. Aproxy maintains the equivalent of the smbmeta.xml file fordomains that do not have their own files. An affirmationauthority performs the policing functions.

Meeting the IDYP goal of creating an Internet version of theprinted Yellow Pages requires leveraging the political and orga-nizational infrastructure of trusted authorities. Rather thanreplicating the capabilities of the Yellow Pages, SMBmeta’s goalis to help small and medium businesses establish a Web pres-ence. However, the two share the approach of annotating Webcontent with structured information to make it more accessi-ble for various search applications.

References1. E. Amitay et al., “Web-a-Where: Geotagging Web Content,” Proc.

27th Int’l Conf. Research and Development in InformationRetrieval, ACM Press, 2004, pp. 273-280.

2. J. Ding, L. Gravano, and N. Shivakumar, “Computing Geograph-ical Scopes of Web Resources,” Proc. 26th VLDB Conf., MorganKaufmann, 2000; www1.cs.columbia.edu/~gravano/Papers/2000/vldb00.pdf.

3. K. McCurley, “Geospatial Mapping and Navigation of the Web,”Proc. 10th Int’l Conf. WWW, ACM Press, 2001, pp. 221-229.

4. R. Bouckaert, “Low-Level Information Extraction: A BayesianNetwork-Based Approach,” 2002; www-ai.ijs.si/DunjaMladenic/TextML02/papers/Bouckaert.pdf.

5. D. Bricklin, “The SMBmeta Initiative,” 2004; www.smbmeta.org.

Detecting Geographic Content in Text Documents

Page 5: LOCAL SEARCH INTERNET IS YELLOW PAGES!

30 Computer

contains well-formed postal addresses, land-line telephone numbers, or both.

The presence of one or more addresses is ahint about a document’s subject that thedesigners of a search application’s relevanceranking algorithms can use as they see fit.One advantage to this approach is thatGeosearch is portable. It is a software com-ponent that is inserted at a convenient pointinto a search application’s workflow.

Address recognizersGeosearch address recognizers detect US-con-

formant addresses consisting of at least a postalcode and a preceding state, Canadian postal codesand a preceding province, and North AmericanNumbering Plan (NANP) telephone numbers (US,Canada, Caribbean). Canadian postal codes areparticularly well-suited for local search becausethey have a short but unique format, and, especiallyin urban areas, they map to accurate latitudes andlongitudes.

Geosearch scans all pages for address data. Usingbrute force to search for US addresses is justifiedby the fact that such addresses or telephone num-bers are found on more than 20 percent of pages.

To internationalize Geosearch, it might be nec-essary to develop heuristics to determine what typesof addresses to look for on a page. Address formatsvary by country, and searching for an exhaustiveset on each page could be too time-consuming.

Upon finding an address, the address recognizerforwards what it considers the relevant tokens tothe geocoder so that it can assign geographic coor-dinates to the presumed address. Because geocod-ing is usually expensive compared to scanning, theaddress recognizer works to reduce the number offalse addresses it sends to the geocoder.

Geosearch observationsFor the two years that Geosearch was publicly

available, and for the preceding year, Vicinityresearchers used these techniques to closelyobserve Internet-derived local search and identifyits strengths, weaknesses, and future opportuni-ties. As a large-scale proof of concept, Geosearchexceeded their expectations.

Local data permeates the Web. When Vicinityresearchers embarked on developing Geosearch in1998, they evaluated sets of Web pages providedby several popular search engines and portals. Ona consistent basis, more than 20 percent of thesepages contained either a well-formed US orCanadian address or NANP telephone number.

This percentage remained constant for the twoyears Geosearch was available on the Internet.

Although Geosearch only looked for NorthAmerican addresses, the pages it examined werenot restricted to North America. Therefore, the per-centage of pages with a well-formed address fromsome country is certain to be higher than the 20percent that Geosearch found.

Well-formed addresses are the rule, not the exception.The efficiency of the address recognition processwas a concern to all of the engineering groups theVicinity researchers worked with. By requiring awell-formed address, the researchers eliminatedfruitless work examining text around tokens thatmarginally looked like part of an address but werenot, such as “Chicago-style pizza.” It turns out thatrequiring the combination of a state and postalcode is not much of a sacrifice.

In most cases, addresses on Web pages conform tothe postal standards used for the delivery of land mail.Occasionally, a postal code that a group of addressesshared was factored out of individual addresses andplaced at the start of a table. Overwhelmingly, how-ever, when Web authors include an address, they makethe effort, aided by habit, to include one that is wellformed. Thus, by promulgating addressing standardsfor the efficient delivery of land mail, national postalservices have made a major contribution towardgeoenabling the Web.

If telephone numbers are excluded, Geosearchfound at least one address on 15 percent of pages.Some enterprises use a telephone number as the pri-mary contact point. Plumbers, for example, serve ageographic area, and they rely on a phone numberrather than a storefront to establish a local pres-ence. A counter example is that customer servicephone numbers are probably not interesting forlocal search. Nationwide customer support num-bers, however, are often toll free, and Geosearchdid not consider these.

Sometimes the address recognizer found a tele-phone number, but not an accompanying addressthat was in fact available. In these cases, the pres-ence of a phone number could trigger more inten-sive scanning of the surrounding text for anaddress. The basic results were so encouraging,however, that we did not consider additionalwork on the address recognizer to be a high pri-ority.

Addresses are keys to rich exogenous content. Formost people, an address provides enough infor-mation to build a mental image of a location in afamiliar neighborhood or to use as an index forfinding the location on a map. An address is not

When Web authorsinclude an address,they make an effort,

aided by habit, toinclude one that is

well formed.

Page 6: LOCAL SEARCH INTERNET IS YELLOW PAGES!

February 2005 31

directly usable for the distance computations andthe mapping and routing applications that loca-tion-based computing requires. This is the job ofgeocoding applications that associate an addresswith a physical location on Earth’s surface.

The databases that these applications use repre-sent significant intellectual capital. For example,the US street network product from GeographicData Technology, a leading provider of geocodingdatabases, contains more than 14 million addressedstreet segments, postal and census boundaries,landmarks, and water features. The companyprocesses more than one million changes for thisdatabase each month (www.geographic.com/home/productsandservices.cfm).

An address is the key that associates this rich veinof exogenous information with Web content.Addresses are proportionally more valuable forlocal search because they are computationally easyto detect.

Addresses are metadata. The WWW Consortiumdefines metadata as machine-understandable infor-mation for the Web (www.w3.org/metadata). Todate, attempts to incorporate metadata into searchengine relevancy metrics have not gone well.HTML metatags are ignored because they areeither misused or used fraudulently, and metadatastandards have no value if they are disregarded. It’sinteresting to envision the semantic Web that meta-data enables, but it’s not yet ready for prime time.

These concerns are not persuasive for localsearch. Geosearch works because of the anomalousbut fortunate circumstance that the metadata itdepends on is already pervasive on the Internet. Anaddress is metadata; its definition predates the Web,but its structure is portable to it.

Pages with addresses tend to be good quality.Organizations that put postal addresses on Webpages see the Internet not as a frivolity, but as away to convey information. Even when a pagewith an address is unappealing, a quick glance atthe site usually leads a user to conclude that theauthors don’t know how to create a good Webpresence, not that they are swindlers or kids withtoo much time on their hands.

Local search is about more than commerce. Internetcontent reflects what people do—and they do morethan shop: They have hobbies, seek like-mindedindividuals, look for support in times of stress.Sometimes when people do shop, either from pref-erence or necessity, they are not looking for theclosest chain store. They are looking for the prac-titioner of a local craft—a scrimshaw artist inNova Scotia—or for some activity that is not quite

mainstream—worm composting or homesolar power generation.

The individual constituencies for the activ-ities people pursue on a daily basis might besmall, but taken together they comprise muchof the regional information people search for.Some of the most satisfying Geosearch resultswere for idiosyncratic local content: breastcancer support groups, bird sightings, first-edition rare books, maple syrup (inVermont), Washington Irving (in Tarrytown,New York).

One hundred years ago the Sears catalog was aninnovation for distributing information aboutmainstream consumer goods. Improvements sincethen have been around the edges. The overlookedpromise of local search is that it makes niche infor-mation not routinely found in mail-order orInternet catalogs, the Yellow Pages, or on televisionor radio, easy to come by. In this it is unrivaled byprevious distribution mediums.

OFFLINE-DERIVED LOCAL CONTENTIn contrast to Internet-derived local content, the

data that characterizes the Internet Yellow Pages isbroad, uniform, shallow, and slow to change. Thisdata wends a circuitous route from initial compi-lation to its final destination in IYP listings.

List compilation vendors, whose traditional cus-tomers use their products for business mailing lists,sales lead generation, and other direct mail and tele-marketing applications, furnish IYP data. The com-pilers’ main data sources are printed telephonedirectories, which are converted to digital infor-mation with OCR devices.

InfoUSA, a leading provider of premium lists,compiles its US list of 14 million businesses from5,200 phone directories (www.infousa.com). Thecompany augments this phone book data with sec-ondary data sources such as annual reports, SEC fil-ings, business registrations, and postal servicechange-of-address files. The compilers verify theinformation they gather by calling businesses.InfoUSA makes 17 million such telephone callsannually.

The IYP is slow to incorporate new and changedinformation, a shortcoming that is inherent in thesource of its data, since telephone books are pub-lished annually. List vendors do generate periodicupdate files, but these updates are not free, and theeffort required to merge them into the IYP is nottrivial.

More fundamentally, staying current is an elu-sive goal for decentralized information that is com-

The data that characterizes theInternet YellowPages is broad,

uniform, shallow,and slow to change.

Page 7: LOCAL SEARCH INTERNET IS YELLOW PAGES!

piled centrally. Telephone directories are outof date even at the moment of publication.Then, list vendors must correlate changesfrom their incoming data streams—the 5,200directories, telephone verification calls,change-of-address files, and so forth. Eachperiodic update includes only a fraction ofthe changes in a vendor’s possession, and itincludes no changes that have occurred butare still in the pipeline.

Another problem with using compiled listsas a source for the IYP is that the consumer is notits primary market. The lists are flat structures with-out sufficient expressive power to convey the hier-archical and variable structure of many enterprises,specifically those with multiple external points ofcontact.

This missing information corresponds to someof the most dog-eared pages in printed directories:individual departments and physicians in hospitalsand medical practices, group practices of all sorts,and municipal information. Even if this deficiencywere somehow fixed, IYP service providers wouldstill need to reflect the richer structure in the onlinedatabases they build from the compiled lists.

For all their shortcomings, the compiled listsfrom which the IYP is derived are authoritative andtrusted sources of business information—charac-teristics that are not duplicated elsewhere. Theclerks making those calls provide real value. Evenif the information in the IYP already exists on theInternet, or will sometime soon, it is in a chaoticform, and there is no repeatedly reliable way toaccess it. The value of the compiled lists is dataaggregation, an area in which local search has notyet contributed.

DECENTRALIZED DATA GATHERING: THE INTERNET-DERIVED YELLOW PAGES

The central challenge for local search is to movethe job of aggregating and verifying local informa-tion closer to the sources of knowledge about thatinformation. The human and electronic knowledgeabout local information is decentralized—geo-graphically localized—and the Internet is a decen-tralized medium. Having decentralized tools forgathering this data is desirable as well.

Trusted authoritiesThe IDYP is a directory of local businesses, sim-

ilar to the IYP but richer in content. The essentialdifference is that the information in the IDYP isderived directly from the Internet, not from offlinesources.

The IDYP’s viability depends on intermediaries,trusted authorities who can vouch for the infor-mation the IDYP provides and can perform the roleof content aggregator for entities without a directWeb presence. Organizations that have relation-ships of trust with both the public and the entitieswhose information they are certifying or creatingcan perform this gatekeeper role.

Two examples of organizations that can serve asgatekeepers are those based on geography, such asa chamber of commerce, and those based on mar-ket segment, such as a trade organization. A pri-mary function of both types of representativeorganizations is to collate and disseminate accu-rate information on behalf of their members. Bothtypes of organizations are often conversant withWeb technology, and they can function as proxiesfor constituents who don’t have their own Webpresence. While there are 14 million businesses in the US—most of them small—chambers of commerce and trade organizations number in thethousands.

Preventing fraudulent interlopers from compro-mising the integrity of their constituents’ informa-tion is also in the best interests of these gatekeepers.For chambers, this is conceptually and practicallyas simple as ensuring that each member it verifiesor submits to the IDYP does indeed have a shop onMain Street.

The first function of a trusted authority is eitherto submit information to the IDYP on behalf of amember or to certify information a member hasdirectly submitted to the IDYP. The second func-tion, policing, is aimed at minimizing the amountof fraudulent or misleading data that makes its wayinto the IDYP.

Proxy mode. In proxy mode, trusted authoritiesare intermediaries for members who want a pres-ence in the IDYP but do not want to interactdirectly with the Internet.

For example, a licensed hotdog-stand vendor withno interest in using the Internet would work with arepresentative at the chamber of commerce to getthe right information into the IDYP. A hypotheticalentry for this vendor would indicate that the standis open from two hours before an event until onehour afterwards, provide the stand’s location in thestadium, and state the type and price of the hotdogs,drinks, and condiments he sells. If, at the last minute,the vendor finds he will not be at an event, he can askhis contact to update his IDYP entry. This exampleis contrived, of course—but try finding informationon street vendors in the Yellow Pages.

Authenticate mode. A trusted authority uses a Web

32 Computer

The IDYP is a directory of localbusinesses, similarto the IYP but richer

in content.

Page 8: LOCAL SEARCH INTERNET IS YELLOW PAGES!

February 2005 33

interface to help create the structured informationthe IDYP requires for the members under itspurview. The trusted authority releases this infor-mation to an IDYP server, at which point itbecomes generally available. An entity can directlysubmit information to an IDYP server as long asthe submittal refers to at least one valid registrationwith at least one trusted authority.

Policing Unlike a purely virtual search, the subject matter

in local search has a physical existence that can beconfirmed. Therefore, local search is more resistantto fraud than are purely virtual searches. In theIDYP model, if no trusted authorities vouch for abusiness, it will not be included in the IDYP. Still,we must assume that efforts will be made to gamethe system and that some businesses will be temptedto misrepresent themselves.

The Internet’s potential to provide assurancesabout local enterprises exceeds that of currentdirectory services. It isn’t possible to rely on theYellow Pages to provide guidance about a busi-ness’s reputation. The IDYP, however, can augmentits information with various data sources, such asBetter Business Bureaus, independent reviews, andpublic data. In addition, the IDYP can use practicesthat have become popular on the Internet for rat-ing products, services, and sellers.

IDYP OPERATION MODESGeosearch found that at least 20 percent of Web

pages include an overt geographic qualifier. Even ifevery local enterprise eventually registers with atrusted authority, the Web will still contain muchlocal content that is not known to the IDYP.

Geosearch’s strength is that it finds local contentin place, without requiring Web authors to changetheir routines for publishing that content. To inte-grate its content with local content on the Web thatis not part of the IDYP, the IDYP supports twomodes of operation. In one mode it supplies localsearch metadata to authorized applications; in theother, it is a stand-alone directory application.

Local search metadataIn the local search metadata mode, the IDYP

makes its content available to subscribing applica-tions. Subscribers are bound to use IDYP data inconformance with the policies and standards theIDYP sets forth. Trusted authorities and individualbusinesses can also specify directives on how sub-scribers use their information.

As a part of the page indexing process, a sub-

scribing search application seeks associatedIDYP information for the page it is indexing.If the page is authorized for local search, thesearch application includes some portion ofthe IDYP metadata in the index it builds forthe page. In this way, IDYP data is incorpo-rated into the general Web corpus.

A URL provides the connection betweenIDYP data and data on the Web. When anenterprise or its trusted authority creates itsIDYP entry, it specifies a Web page addresswith which the IDYP entry is associated. Thisis the page to which the search applicationadds the IDYP metadata.

For a member who doesn’t have a direct Webpresence, the trusted authority creates one or morepages that contain formatted content derived fromthe member’s IDYP entry. The trusted authorityeither establishes a domain for the member or guar-antees that the pages it creates for the member havepersistent URLs.

A trusted authority might choose to generatepages for all its members. This would allow it toestablish a consistent look and feel for its con-stituents. IDYP pages generated for members thatalready have established Web sites would containlinks back to these existing pages.

The IDYP provides an imprimatur for pages thatare relevant for local search. To accommodatepages with local content unknown to the IDYP,search applications can support either strict or non-strict local searches.

In strict mode, the search application only con-siders pages that are known to the IDYP. In non-strict mode, the search application uses its ownheuristics for gauging which pages are relevant, andcan return a mixed set of pages, some known to theIDYP, some not. If the search application tags theresults that are known to the IDYP, users can decidefor themselves how important the IDYP impri-matur is. It will be more valuable for Yellow Pages-like searches, less so for idiosyncratic ones.

Stand-alone local directoryGiven the popularity of search engines and por-

tals as user interfaces, observers might anticipatethat the IDYP’s main role is to provide metadatafor these applications. However, as a self-containedlocal directory, the IDYP can provide powerful fea-tures that are difficult to incorporate into a general-purpose search engine. Important, too, is that theIDYP should not depend on any particular searchapplication for its promulgation. Its status as astand-alone application ensures its independence.

In the IDYP model, if no trusted

authorities vouch for a business, it will not beincluded in the IDYP.

Page 9: LOCAL SEARCH INTERNET IS YELLOW PAGES!

34 Computer

Standard data representation. IDYP information isrepresented in XML. In addition to a standard coreof attributes, industry groups can define cus-tomized extensions—known as XML schemas. An“hours of operation” attribute, for example, is partof the standard core, since virtually all businessesuse it—although today’s IYP does not include even this basic information. The XML Schema forrestaurants should allow queries about the catchof the day at the local seafood house.

Rich categorization hierarchy. The business catego-rization schemes used by the print and InternetYellow Pages are cursory. The Internet hasspawned much work on commerce-orientedontologies and user interfaces that are broadlyapplicable to local search.

Local search query language. A rich stratum ofmetadata will facilitate the construction of alocal search query language with more expres-sive power than the Boolean keyword languagesthat current search engines use.

The parlance of local search is constrained—a vari-ation of who, what, where, when, and how much:Who provides what service or product? Where is theprovider located? When can I see the product? Howmuch does it cost? For example: “Where can I buystylish children’s clothing on sale, within 10 miles of home, open late on Saturday evening?”

Short update latency. The time interval between anenterprise making a change and having that changereflected in the IDYP is brief, converging on instan-taneous. We can define “the last croissant” heuris-tic, which states that the IDYP reaches optimalefficiency when an urban bakery can use it suc-cessfully to advertise a sale on its remaining bak-ery items 30 minutes before closing.

G eosearch, a geoenabled search engine thatallows people to search for Web pages thatcontain geographic markers within a spec-

ified geographic area, demonstrates that theInternet is a rich source of local content. It alsodemonstrates the many advantages that postaladdresses have as a key for accessing this content,especially when the content pertains to the activi-ties of daily life. Postal addresses are ubiquitous,unambiguous, standardized, computationally easyto detect, and necessary for accessing the rich andprecise content of geocoding databases.

The Internet Yellow Pages, currently the mainsource of local content on the Internet, are reliable,but they are also shallow, slow to change, central-ized, and expensive. Their primary data sources are

printed telephone directories. They do not use theInternet’s resources in any meaningful way.

Local search today provides a poor user experi-ence because it does little more than package olddata for a new medium. The richest source of localcontent can and should be the Internet itself, butmarshaling this resource requires developing aninfrastructure such as the Internet-Derived YellowPages to organize and manage its content.

The IDYP is a structured database that relies ontrusted authorities, such as chambers of commerceor trade associations, to certify the information itcontains. The IDYP can function either as a stand-alone directory or as a source of metadata forsearch applications. A search application uses IDYPmetadata to augment the information it maintainsfor Web pages that have local content. In this way,local search metadata is integrated into the generalWeb corpus. �

References1. J. Horrigan, Pew Internet Project, “Pew Internet Pro-

ject Data Memo,” Apr. 2004; www.pewinternet.org/pdfs/PIP_Broadband04.DataMemo.pdf.

2. J. Horrigan, L. Rainie, Pew Internet Project, “TheBroadband Difference,” June 2002; www.pewinternet.org/pdfs/PIP_Broadband_Report.pdf.

3. Kelsey Group & Bizrate.com, “Local Search Now25% of Internet Commercial Activity,” Feb. 2004;www.kelseygroup.com/press/pr040211.htm.

4. S. Kerner, “Majority of US Consumers ResearchOnline, Buy Offline,” Oct. 2004; www.clickz.com/stats/markets/retailing/article.php/3418001.

5. G. Sterling, “Is 2004 the Year of Local Search?” Dec.2003; www.imediaconnection.com/content/2343.asp.

AcknowledgmentsThe author thanks the many people involved

with making Geosearch happen, including the fol-lowing—from Vicinity: Jeff Doyle, CharlieGoldensher, Jerry Halstead, Dwight Aspinwall,Dave Goldberg, Faith Alexandre, Darius Martin,Kavita Desai, and Eric Gottesman; and fromNorthern Light: Marc Krellenstein, Mike Mulligan,Sharon Kass, and Margo Rowley.

Marty Himmelstein is a software consultant. Hisinterests include database systems, spatial data-bases, and Web technologies. He received an MSin computer science from SUNY Binghamton. Heis a member of the IEEE Computer Society and theACM. Contact him at [email protected].