CSCI 587 SEC 1220_final project_kotha0746

download CSCI 587 SEC 1220_final project_kotha0746

of 40

Transcript of CSCI 587 SEC 1220_final project_kotha0746

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    1/40

    SEARCH ENGINE

    A research project on SEARCH ENGINE

    SUBMITTED BY

    SATHISH KOTHA 108-00-0746

    University Of Northern Virginia

    CSCI 587 SEC 1220, SPECIAL TOPICS IN INFORMATION TECHNOLOGY-1

    6/20/2010

    1

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    2/40

    SEARCH ENGINE

    Abstract of the Project

    A web search engine is designed to search for information on the World Wide Web. The

    search results are usually presented in a list of results and are commonly called hits. Theinformation may consist ofweb pages, images, information and other types of files. Some searchengines also mine data available in databases oropen directories. Unlike Web directories, whichare maintained by human editors, search engines operate algorithmically or are a mixture ofalgorithmic and human input.

    Here in this project, we are discussing about types of search engines. How a searchengine works or finds the information for the user. What are the processes going behind thescreen? Today how many search engines are existing to provide information and facts for thecomputer users, History of search engine. What are the different stages of search engines insearching information? What are the features of Web searching? And also different topics like

    Advanced Research Projects Agency Network. And what a BOT really mean?

    Types of search queries we use in seeking information using search engine. WebDirectories? And very famous search engine like google, yahoo and their role of processing assearch engine. Challenges in language processing. Characteristics of search engines.

    I used this topic for my project is reason that I find interest in working of search engine. And Iwant every one to come across this topic and learn. As many of them uses, but they dont knowwhats the real fact going behind the screen in a search engine.

    At the end of the project I also gave some references from which I have selected for topicdiscussion of project. I hope you like this and accept this project for my topic in this course.

    2

    http://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Web_directoryhttp://en.wikipedia.org/wiki/Web_directorieshttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_pagehttp://en.wikipedia.org/wiki/Data_mininghttp://en.wikipedia.org/wiki/Databasehttp://en.wikipedia.org/wiki/Web_directoryhttp://en.wikipedia.org/wiki/Web_directorieshttp://en.wikipedia.org/wiki/Algorithm
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    3/40

    SEARCH ENGINE

    ACKNOWLEDGEMENT

    The project entitled SEARCH ENGINE is of total effort of me. It is my duty to bring

    forward each and every one who is either directly or indirectly in relation with our project and

    without whom it would not have gained a structure.

    Accordingly sincere thanks toPROF. SOUROSHI, For his support and for her valuablesuggestions and timely advice without them, the project would not be completed in time.

    We also thank many others who helped us through out the project and made our project

    successful.

    PROJECT ASSOCIATES

    3

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    4/40

    SEARCH ENGINE

    CONTENTS

    PRELIMINARIES

    Acknowledgement

    1. History of search engine page 1 - 15

    Types of search queries

    World Wide Web wanderer

    Alliweb

    Primitive web search

    2. Working of a search engine page 15 32

    Web crawling

    Indexing

    Searching

    3. New features for web searching page 33 35

    4. Conclusion page 36

    5. References page 37

    4

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    5/40

    SEARCH ENGINE

    1.Early Technology

    2.Directories

    3. Vertical Search

    4. Search Engine Marketing

    History of Search Engines: From 1945 to Google 2007

    As We May Think (1945):

    The concept of hypertext and a memory extension really came to life in July of 1945,when after enjoying the scientific camaraderie that was a side effect of WWII, Vannaver Bush'sAs We May Thinkwas published in The Atlantic Monthly.

    He urged scientists to work together to help build a body of knowledge for all mankind.Here are a few selected sentences and paragraphs that drive his point home.

    Specialization becomes increasingly necessary for progress, and the effort to bridgebetween disciplines is correspondingly superficial.

    A record, if it is to be useful to science, must be continuously extended, it must be stored,and above all it must be consulted.

    He not only was a firm believer in storing data, but he also believed that if the data source was tobe useful to the human mind we should have it represent how the mind works to the best of ourabilities.

    Our ineptitude in getting at the record is largely caused by the artificiality of the systems of

    indexing. ... Having found one item, moreover, one has to emerge from the system and re-enteron a new path.

    The human mind does not work this way. It operates by association. ... Man cannot hopefully to duplicate this mental process artificially, but he certainly ought to be able to learn fromit. In minor ways he may even improve, for his records have relative permanency.

    He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memorystorage and retrieval system. He named this device a memex.

    5

    http://www.theatlantic.com/doc/194507/bushhttp://www.theatlantic.com/doc/194507/bushhttp://www.theatlantic.com/doc/194507/bush
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    6/40

    SEARCH ENGINE

    Gerard Salton (1960s - 1990s):

    Gerard Salton, who died on August 28th of 1995, was the father of modern search technology.His teams at Harvard and Cornell developed the SMART informational retrieval system.Saltons Magic Automatic Retriever of Text included important concepts like the vector space

    model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values,and relevancy feedback mechanisms.

    Ted Nelson:

    Ted Nelson created Project Xanadu in 1960 and coined the term hypertext in 1963. His goal withProject Xanadu was to create a computer network with a simple user interface that solved manysocial problems like attribution.

    While Ted was against complex markup code, broken links, and many other problems associatedwith traditional HTML on the WWW, much of the inspiration to create the WWW was drawnfrom Ted's work.

    There is still conflict surrounding the exact reasons why Project Xanadu failed to take off.

    Advanced Research Projects Agency Network:

    ARPANet is the network which eventually led to the internet. The Wikipedia has a great

    background article on ARPANetand Google Video has a free interesting video about ARPANetfrom 1972.

    Archie (1990):

    The first few hundred web sites began in 1993 and most of them were at colleges, but longbefore most of them existed came Archie. The first search engine created was Archie, created in1990 by Alan Emtage, a student at McGill University in Montreal. The original intent of thename was "archives," but it was shortened to Archie.

    6

    http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.htmlhttp://en.wikipedia.org/wiki/Arpanethttp://en.wikipedia.org/wiki/Arpanethttp://en.wikipedia.org/wiki/Arpanethttp://video.google.com/videoplay?docid=-7426343190324622223http://video.google.com/videoplay?docid=-7426343190324622223http://archie.icm.edu.pl/archie-adv_eng.htmlhttp://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.htmlhttp://en.wikipedia.org/wiki/Arpanethttp://en.wikipedia.org/wiki/Arpanethttp://video.google.com/videoplay?docid=-7426343190324622223http://video.google.com/videoplay?docid=-7426343190324622223
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    7/40

    SEARCH ENGINE

    Archie helped solve this data scatter problem by combining a script-based data gatherer with aregular expression matcher for retrieving file names matching a user query. Essentially Archiebecame a database of web filenames which it would match with the users queries.

    Bill Slawski has more background on Archie here.

    Veronica & Jughead:

    As word of mouth about Archie spread, it started to become word of computer and Archie hadsuch popularity that the University of Nevada System Computing Services group developedVeronica. Veronica served the same purpose as Archie, but it worked on plain text files. Soonanother user interface name Jughead appeared with the same purpose as Veronica, both of thesewere used for files sent via Gopher, which was created as an Archie alternative by MarkMcCahill at the University of Minnesota in 1991.

    File Transfer Protocol:

    Tim Burners-Lee existed at this point, however there was no World Wide Web. The main waypeople shared data back then was via File Transfer Protocol (FTP).

    If you had a file you wanted to share you would set up an FTP server. If someone was interestedin retrieving the data they could using an FTP client. This process worked effectively in smallgroups, but the data became as much fragmented as it was collected.

    Tim Berners-Lee & the WWW (1991):

    From the Wikipedia:

    While an independent contractor at CERN from June to December 1980, Berners-Lee proposed aproject based on the concept of hypertext, to facilitate sharing and updating information amongresearchers. With help from Robert Cailliau he built a prototype system named Enquire.

    After leaving CERN in 1980 to work at John Poole's Image Computer Systems Ltd., he returnedin 1984 as a fellow. In 1989, CERN was the largest Internet node in Europe, and Berners-Leesaw an opportunity to join hypertext with the Internet. In his words, " I just had to take thehypertext idea and connect it to the TCP and DNS ideas and ta-da! the World Wide Web".

    7

    http://www.seobythesea.com/?p=106http://www.w3.org/People/Berners-Lee/Weaving/http://en.wikipedia.org/wiki/Tim_Berners-Leehttp://www.w3.org/People/Berners-Lee/http://www.seobythesea.com/?p=106http://www.w3.org/People/Berners-Lee/Weaving/http://en.wikipedia.org/wiki/Tim_Berners-Lee
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    8/40

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    9/40

    SEARCH ENGINE

    Nancy Blachman's Google Guideoffers searchers free Google search tips, and Greg R.Notess'sSearch Engine Showdownoffers asearch engine features chart.

    There are also many popular smaller vertical search services. For example, Del.icio.us allowsyou to search URLs that users have bookmarked, and Technoratiallows you to search blogs.

    World Wide Web Wanderer:

    Soon the web's first robot came. In June 1993 Matthew Gray introduced the World Wide WebWanderer. He initially wanted to measure the growth of the web and created this bot to countactive web servers. He soon upgraded the bot to capture actual URL's. His database becameknows as the Wandex.

    The Wanderer was as much of a problem as it was a solution because it caused system lag byaccessing the same page hundreds of times a day. It did not take long for him to fix this software,but people started to question the value of bots.

    ALIWEB:

    In October of 1993 Martijn Koster created Archie-Like Indexing of the Web, or ALIWEB inresponse to the Wanderer. ALIWEB crawled meta information and allowed users to submit theirpages they wanted indexed with their own page description. This meant it needed no bot tocollect data and was not using excessive bandwidth. The downside of ALIWEB is that manypeople did not know how to submit their site.

    Robots Exclusion Standard:

    Martijn Kojer also hosts the web robots page, which created standards for how search enginesshould index or not index content. This allows webmasters to block bots from their site on awhole site level or page by page basis.

    By default, if information is on a public web server, and people link to it search enginesgenerally will index it.

    In 2005 Google led a crusade against blog comment spam, creating a nofollow attribute that canbe applied at the individual link level. After this was pushed through Google quickly changed thescope of the purpose of the link nofollow to claim it was for any link that was sold or not undereditorial control.

    Primitive Web Search:

    By December of 1993, three full fledged bot fed search engines had surfaced on the web:JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering(RBSE) spider. JumpStation gathered info about the title and header from Web pages andretrieved these using a simple linear search. As the web grew, JumpStation slowed to a stop. TheWWW Worm indexed titles and URL's. The problem with JumpStation and the World Wide

    9

    http://www.googleguide.com/http://www.googleguide.com/http://www.searchengineshowdown.com/http://www.searchengineshowdown.com/http://www.searchengineshowdown.com/features/http://www.searchengineshowdown.com/features/http://www.searchengineshowdown.com/features/http://del.icio.us/http://www.technorati.com/http://www.technorati.com/http://www.mit.edu/~mkgray/net/http://www.greenhills.co.uk/mak/mak.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://googleblog.blogspot.com/2005/01/preventing-comment-spam.htmlhttp://googleblog.blogspot.com/2005/01/preventing-comment-spam.htmlhttp://www.googleguide.com/http://www.searchengineshowdown.com/http://www.searchengineshowdown.com/features/http://del.icio.us/http://www.technorati.com/http://www.mit.edu/~mkgray/net/http://www.greenhills.co.uk/mak/mak.htmlhttp://www.robotstxt.org/wc/robots.htmlhttp://googleblog.blogspot.com/2005/01/preventing-comment-spam.htmlhttp://googleblog.blogspot.com/2005/01/preventing-comment-spam.html
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    10/40

    SEARCH ENGINE

    Web Worm is that they listed results in the order that they found them, and provided nodiscrimination. The RSBE spider did implement a ranking system.

    Since early search algorithms did not do adequate link analysis or cache full page content if youdid not know the exact name of what you were looking for it was extremely hard to find it.

    Excite:

    Excite came from the project Architext, which was started by in February, 1993 by six Stanfordundergrad students. They had the idea of using statistical analysis of word relationships to makesearching more efficient. They were soon funded, and in mid 1993 they released copies of theirsearch software for use on web sites.

    Excite was bought by a broadband provider named @Home in January, 1999 for $6.5 billion,and was named Excite@Home. In October, 2001 Excite@Home filed for bankruptcy. InfoSpacebought Excite from bankruptcy court for $10 million.

    Web Directories:

    VLib:

    When Tim Berners-Lee set up the web he created the Virtual Library, which became aloose confederation of topical experts maintaining relevant topical link lists.

    EINet Galaxy

    The EINet Galaxy web directory was born in January of 1994. It was organized similar to howweb directories are today. The biggest reason the EINet Galaxy became a success was that it alsocontained Gopher and Telnet search features in addition to its web search feature. The web sizein early 1994 did not really require a web directory; however, other directories soon did follow.

    Yahoo! Directory

    10

    http://www.excite.com/http://news.com.com/2100-1033-273689.html?legacy=cnethttp://searchenginewatch.com/showPage.html?page=2158291http://searchenginewatch.com/showPage.html?page=2158291http://vlib.org/http://vlib.org/http://www.galaxy.com/http://www.excite.com/http://www.excite.com/http://news.com.com/2100-1033-273689.html?legacy=cnethttp://searchenginewatch.com/showPage.html?page=2158291http://searchenginewatch.com/showPage.html?page=2158291http://vlib.org/
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    11/40

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    12/40

    SEARCH ENGINE

    Looksmart

    Looksmart was founded in 1995. They competed with the Yahoo! Directory by frequentlyincreasing their inclusion rates back and forth. In 2002 Looksmart transitioned into a pay perclick provider, which charged listed sites a flat fee per click. That caused the demise of any goodfaith or loyalty they had built up, although it allowed them to profit by syndicating those paidlistings to some major portals like MSN. The problem was that Looksmart became too dependanton MSN, andin 2003, when Microsoft announced they were dumping Looksmart that basicallykilled their business model.

    In March of 2002, Looksmart bought a search engine by the name of WiseNut, but it nevergained traction. Looksmart also owns a catalog of content articles organized in vertical sites, butdue to limited relevancy Looksmart has lost most (if not all) of their momentum. In 1998Looksmart tried to expand their directory by buying the non commercial Zeal directory for $20million, but on March 28, 2006 Looksmart shut down the Zeal directory, and hope to drivetraffic using Furl, a social bookmarking program.

    WebCrawler:

    Brian Pinkerton of the University of Washington released WebCrawleron April 20, 1994. It wasthe first crawler which indexed entire pages. Soon it became so popular that during daytimehours it could not be used. AOL eventually purchased WebCrawler and ran it on their network.Then in 1997, Excite bought out WebCrawler, and AOL began using Excite to power itsNetFind. WebCrawler opened the door for many other services to follow suit. Within 1 year ofits debuted came Lycos, Infoseek, and OpenText.

    Lycos:

    Lycos was the next major search development, having been design at Carnegie MellonUniversity around July of 1994. Michale Mauldin was responsible for this search engine andremains to be the chief scientist at Lycos Inc.

    On July 20, 1994, Lycos went public with a catalog of 54,000 documents. In addition toproviding ranked relevance retrieval, Lycos provided prefix matching and word proximitybonuses. But Lycos' main difference was the sheer size of its catalog: by August 1994, Lycos

    12

    http://www.theage.com.au/articles/2003/10/07/1065292591055.htmlhttp://www.theage.com.au/articles/2003/10/07/1065292591055.htmlhttp://wisenut.com/http://www.furl.net/http://www.furl.net/http://www.webcrawler.com/http://www.webcrawler.com/http://www.lycos.com/http://www.lycos.com/http://www.webcrawler.com/http://www.looksmart.com/http://www.theage.com.au/articles/2003/10/07/1065292591055.htmlhttp://wisenut.com/http://www.furl.net/http://www.webcrawler.com/http://www.lycos.com/
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    13/40

    SEARCH ENGINE

    had identified 394,000 documents; by January 1995, the catalog had reached 1.5 milliondocuments; and by November 1996, Lycos had indexed over 60 million documents -- more thanany other Web search engine. In October 1994, Lycos ranked first on Netscape's list of searchengines by finding the most hits on the word surf..

    Infoseek:

    Infoseekalso started out in 1994, claiming to have been founded in January. They really did notbring a whole lot of innovation to the table, but they offered a few add on's, and in December1995 they convinced Netscape to use them as their default search, which gave them majorexposure. One popular feature of Infoseek was allowing webmasters to submit a page to thesearch index in real time, which was a search spammer's paradise.

    AltaVista:

    AltaVista debut online came during this same month. AltaVistabrought many important features to the web scene. They hadnearly unlimited bandwidth (for that time), they were the first to

    allow natural language queries, advanced searching techniques and they allowed users to add ordelete their own URL within 24 hours. They even allowed inbound link checking. AltaVista alsoprovided numerous search tips and advanced search features.

    Due to poor mismanagement, a fear of result manipulation, and portal related clutter AltaVistawas largely driven into irrelevancy around the time Inktomi and Google started becoming popular. On February 18, 2003, Overture signed a letter of intent to buy AltaVista for $80

    million in stock and $60 million cash. After Yahoo! bought out Overture they rolled some of theAltaVista technology into Yahoo! Search, and occasionally use AltaVista as a testing platform.

    Inktomi:

    The Inktomi Corporation came about on May 20, 1996 with its search engineHotbot. Two Cal Berkeley cohorts created Inktomi from the improvedtechnology gained from their research. Hotwire listed this site and it becamehugely popular quickly.

    In October of 2001 Danny Sullivan wrote an article titledInktomi Spam Database Left Open To Public, which highlights how Inktomi accidentally allowed the public to access their database ofspam sites, which listed over 1 million URLs at that time.

    Although Inktomi pioneered the paid inclusion model it was nowhere near as efficient as the pay per click auction model developed by Overture. Licensing their search results also was notprofitable enough to pay for their scaling costs. They failed to develop a profitable business

    13

    http://web.archive.org/web/19960512212113/http:/www.infoseek.com/http://web.archive.org/web/19960512212113/http:/www.infoseek.com/http://www.altavista.com/http://www.corporate-ir.net/ireye/ir_site.zhtml?ticker=OVER&script=410&layout=0&item_id=383636http://searchenginewatch.com/showPage.html?page=2164241http://searchenginewatch.com/showPage.html?page=2164241http://web.archive.org/web/19960512212113/http:/www.infoseek.com/http://www.altavista.com/http://www.corporate-ir.net/ireye/ir_site.zhtml?ticker=OVER&script=410&layout=0&item_id=383636http://searchenginewatch.com/showPage.html?page=2164241http://searchenginewatch.com/showPage.html?page=2164241
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    14/40

    SEARCH ENGINE

    model, and sold out to Yahoo! for approximately approximately $235 million, or $1.65 a share,in December of 2003.

    Ask.com (Formerly Ask Jeeves):

    In April of 1997 Ask Jeeves was launched as a naturallanguage search engine. Ask Jeeves used human editors totry to match search queries. Ask was powered by DirectHitfor a while, which aimed to rank results based on theirpopularity, but that technology proved to easy to spam asthe core algorithm component. In 2000 the Teoma searchengine was released, which uses clustering to organize sites by Subject Specific Popularity, which is another way ofsaying they tried to find local web communities. In 2001

    Ask Jeeves bought Teoma to replace the DirectHit searchtechnology.

    Jon Kleinberg's Authoritative sources in a hyperlinkedenvironment [PDF] was a source of inspiration what lead to the eventual creation of Teoma.Mike Grehan'sTopic Distillation [PDF] also explains how subject specific popularity works.

    AllTheWeb

    AllTheWeb was a search technology platform launched in May of 1999 to showcase Fast'ssearch technologies. They had a sleek user interface with rich advanced search features, but onFebruary 23, 2003, AllTheWeb was bought by Overturefor $70 million. After Yahoo! boughtout Overture they rolled some of the AllTheWeb technology into Yahoo! Search, andoccasionally use AllTheWeb as a testing platform.

    Google also has a Scholar search program which aims to make scholarly research easier to do.

    On November 15, 2005 Google launched a product called Google Base, which is a database ofjust about anything imaginable. Users can upload items and title, describe, and tag them as theysee fit. Based on usage statistics this tool can help Google understand which vertical searchproducts they should create or place more emphasis on. They believe that owning other verticalswill allow them to drive more traffic back to their core search service. They also believe thattargeted measured advertising associated with search can be carried over to other mediums. Forexample, Google bought dMarc, a radio ad placement firm. Yahoo! has also tried to extend their

    14

    http://docs.yahoo.com/docs/pr/release1050.htmlhttp://www.cs.cornell.edu/home/kleinber/auth.pdfhttp://www.cs.cornell.edu/home/kleinber/auth.pdfhttp://www.cs.cornell.edu/home/kleinber/auth.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.alltheweb.com/http://searchenginewatch.com/showPage.html?page=2165241http://searchenginewatch.com/showPage.html?page=2165241http://scholar.google.com/http://googleblog.blogspot.com/2005/11/first-base.htmlhttp://base.google.com/http://www.google.com/intl/en/press/pressrel/dmarc.htmlhttp://www.alltheweb.com/http://docs.yahoo.com/docs/pr/release1050.htmlhttp://www.cs.cornell.edu/home/kleinber/auth.pdfhttp://www.cs.cornell.edu/home/kleinber/auth.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.searchguild.com/topic_distillation.pdfhttp://www.alltheweb.com/http://searchenginewatch.com/showPage.html?page=2165241http://scholar.google.com/http://googleblog.blogspot.com/2005/11/first-base.htmlhttp://base.google.com/http://www.google.com/intl/en/press/pressrel/dmarc.html
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    15/40

    SEARCH ENGINE

    reach by buying other high traffic properties, like the photo sharing site Flickr, and the socialbookmarking site del.icio.us.

    .

    Google AdSense

    On March 4, 2003 Google announced their content targeted ad network. In April 2003,Google bought Applied Semantics, which had CIRCA technology that allowed them to drasticallyimprove the targeting of those ads. Google adopted the name AdSense for the new ad program.

    AdSense allows web publishers large and small to automate the placement of relevant ads ontheir content. Google initially started off by allowing textual ads in numerous formats, buteventually added image ads and video ads. Advertisers could chose which keywords they wantedto target and which ad formats they wanted to market.

    To help grow the network and make the market more efficient Google added a link which allowsadvertisers to sign up for AdWords account from content websites, and Google allowedadvertisers to buy ads targeted to specific websites, pages, or demographic categories. Adstargeted on websites are sold on a cost per thousand impression (CPM) basis in an ad auctionagainst other keyword targeted and site targeted ads.

    Google also allows some publishers to place AdSense ads in their feeds, and some selectpublishers can place ads in emails.

    To prevent the erosion of value of search ads Google allows advertisers to opt out of placingtheir ads on content sites, and Google also introduced what they called smart pricing. Smart

    pricing automatically adjusts the click cost of an ad based on what Google perceives a click fromthat page to be worth. An ad on a digital camera review page would typically be worth more thana click from a page with pictures on it.

    Yahoo! Search Marketing

    Yahoo! Search Marketing is the rebranded name for Overture after Yahoo! bought them out. Asof September 2006 their platform is generally the exact same as the old Overture platform, withthe same flaws - ad CTR not factored into click cost, it's hard to run local ads, and it is justgenerally clunky.

    Microsoft AdCenter

    Microsoft AdCenterwas launched on May 3. 2006. While Microsoft has limited marketshare,they intend to increase their marketshare by baking search into Internet Explorer 7. On thefeatures front, Microsoft added demographic targeting and dayparting features to the pay perclick mix. Microsoft's ad algorithm includes both cost per click and ad clickthrough rate.

    15

    http://www.flickr.com/http://del.icio.us/http://www.google.com/press/pressrel/advertising.htmlhttp://www.infotoday.com/newsbreaks/nb030428-1.shtmlhttp://www.infotoday.com/newsbreaks/nb030428-1.shtmlhttp://www.infotoday.com/newsbreaks/nb030428-1.shtmlhttp://www.google.com/adsense/http://www.traffick.com/2006/09/whats-eating-yahoo.asphttp://adcenter.microsoft.com/http://www.microsoft.com/presspass/press/2006/may06/05-03SAS7PR.mspxhttp://www.microsoft.com/presspass/press/2006/may06/05-03SAS7PR.mspxhttp://www.flickr.com/http://del.icio.us/http://www.google.com/press/pressrel/advertising.htmlhttp://www.infotoday.com/newsbreaks/nb030428-1.shtmlhttp://www.infotoday.com/newsbreaks/nb030428-1.shtmlhttp://www.google.com/adsense/http://www.traffick.com/2006/09/whats-eating-yahoo.asphttp://adcenter.microsoft.com/http://www.microsoft.com/presspass/press/2006/may06/05-03SAS7PR.mspx
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    16/40

  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    17/40

    SEARCH ENGINE

    Later that year Andy Bechtolsheim gave them $100,000 seed funding, and Google received $25million Sequoia Capital and Kleiner Perkins Caufield & Byers the following year. In 1999 AOLselected Google as a search partner, and Yahoo! followed suit a year later. In 2000 Google alsolaunched their popularGoogle Toolbar. Google gained search market share year over year eversince.

    In 2000 Google relaunched their AdWords program to sell ads on a CPM basis. In 2002 theyretooled the service, selling ads in an auction which would factor in bid price and adclickthrough rate. On May 1, 2002, AOL announced they would use Google to deliver theirsearch related ads, which was a strong turning point in Google's battle against Overture.

    In 2003 Google also launched their AdSense program, which allowed them to expand their adnetwork by selling targeted ads on other websites.

    Going Public

    Google used a two class stock structure, decided not to give earnings guidance, and offeredshares of their stock in a Dutch auction. They received virtually limitless negative press for the perceived hubris they expressed in their "AN OWNER'S MANUAL" FOR GOOGLE'SSHAREHOLDERS. After some controversy surrounding an interview in Playboy, Googledropped their IPO offer range from $85 to $95 per share from $108 to $135. Google went publicat $85 a share on August 19, 2004 and its first trade was at 11:56 am ET at $100.01.

    Verticals Galore!

    In addition to running the world's most popular search service, Google also runs a large numberof vertical search services, including:

    Google News:Google News launched in beta in September 2002. OnSeptember 6, 2006,Google announced an expanded Google News Archive Search that goes back over 200years.

    Google Book Search: On October 6, 2004, Google launchedGoogle Book Search. Google Scholar: OnNovember 18, 2004, Google launched Google Scholar, an academic

    search program. Google Blog Search: On September 14, 2005, Google announced Google Blog Search. Google Base: OnNovember 15, 2005, Google announced the launch ofGoogle Base, a

    database of uploaded information describing online or offline content, products, orservices.

    Google Video: On January 6, 2006, Google announced Google Video. Google Universal Search: On May 16, 2007Google began mixing many of their vertical

    results into their organic search results.

    Just Search, We Promise!

    product.

    17

    http://toolbar.google.com/http://toolbar.google.com/http://www.google.com/press/pressrel/aol.htmlhttp://www.google.com/press/pressrel/aol.htmlhttp://www.newsday.com/business/ny-google-letter,0,6957254.storyhttp://www.newsday.com/business/ny-google-letter,0,6957254.storyhttp://news.google.com/http://news.google.com/http://searchenginewatch.com/showPage.html?page=3623345http://searchenginewatch.com/showPage.html?page=3623345http://news.google.com/archivesearchhttp://www.usatoday.com/tech/news/2004-10-06-google-print_x.htmhttp://www.usatoday.com/tech/news/2004-10-06-google-print_x.htmhttp://books.google.com/http://blogs.talis.com/panlibus/archives/2004/11/_googles_new_sc.phphttp://blogs.talis.com/panlibus/archives/2004/11/_googles_new_sc.phphttp://scholar.google.com/http://searchenginewatch.com/showPage.html?page=3548411http://blogsearch.google.com/http://blogsearch.google.com/http://battellemedia.com/archives/002033.phphttp://base.google.com/http://www.google.com/press/pressrel/video_marketplace.htmlhttp://video.google.com/http://searchengineland.com/070516-145325.phphttp://searchengineland.com/070516-145325.phphttp://toolbar.google.com/http://www.google.com/press/pressrel/aol.htmlhttp://www.newsday.com/business/ny-google-letter,0,6957254.storyhttp://www.newsday.com/business/ny-google-letter,0,6957254.storyhttp://news.google.com/http://searchenginewatch.com/showPage.html?page=3623345http://news.google.com/archivesearchhttp://www.usatoday.com/tech/news/2004-10-06-google-print_x.htmhttp://books.google.com/http://blogs.talis.com/panlibus/archives/2004/11/_googles_new_sc.phphttp://scholar.google.com/http://searchenginewatch.com/showPage.html?page=3548411http://blogsearch.google.com/http://battellemedia.com/archives/002033.phphttp://base.google.com/http://www.google.com/press/pressrel/video_marketplace.htmlhttp://video.google.com/http://searchengineland.com/070516-145325.php
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    18/40

    SEARCH ENGINE

    Microsoft

    In 1998 MSN Search waslaunched, but Microsoft did not get serious about search until after Google proved the businessmodel. Until Microsoft saw the light they primarily relied on partners like Overture, Looksmart,and Inktomi to power their search service.

    They launched their technology preview of their search engine around July 1st of 2004. Theyformally switched from Yahoo! organic search results to their own in house technology onJanuary 31st, 2005. MSN announced they dumped Yahoo!'s search ad program on May 4th,2006.

    On September 11, 2006, Microsoft announced they were launching theirLive Searchproduct.

    18

    http://www.webmasterworld.com/forum97/107.htmhttp://www.webmasterworld.com/forum97/107.htmhttp://news.com.com/Microsoft+kicks+off+search+effort/2100-1032_3-5557994.html?tag=nefd.tophttp://news.com.com/Microsoft+kicks+off+search+effort/2100-1032_3-5557994.html?tag=nefd.tophttp://blog.searchenginewatch.com/blog/060504-020318http://blog.searchenginewatch.com/blog/060504-020318http://blog.searchenginewatch.com/blog/060504-020318http://www.microsoft.com/presspass/press/2006/sep06/09-11WLFinalVersionsPR.mspxhttp://www.live.com/http://www.live.com/http://www.live.com/http://www.live.com/http://www.webmasterworld.com/forum97/107.htmhttp://news.com.com/Microsoft+kicks+off+search+effort/2100-1032_3-5557994.html?tag=nefd.tophttp://news.com.com/Microsoft+kicks+off+search+effort/2100-1032_3-5557994.html?tag=nefd.tophttp://blog.searchenginewatch.com/blog/060504-020318http://blog.searchenginewatch.com/blog/060504-020318http://www.microsoft.com/presspass/press/2006/sep06/09-11WLFinalVersionsPR.mspxhttp://www.live.com/
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    19/40

    SEARCH ENGINE

    2. Working of a search engine

    A search engine operates, in the following order

    1. Web crawling

    2. Indexing3. Searching

    Web search engines work by storing information about many web pages, which they retrievefrom the WWW itself. These pages are retrieved by a Web crawler(sometimes also known as aspider) an automated Web browser which follows every link it sees. Exclusions can be made bythe use ofrobots.txt. The contents of each page are then analyzed to determine how it should beindexed (for example, words are extracted from the titles, headings, or special fields called metatags). Data about web pages are stored in an index database for use in later queries. Some searchengines, such as Google, store all or part of the source page (referred to as a cache) as well asinformation about the web pages, whereas others, such as AltaVista, store every word of every

    page they find. This cached page always holds the actual search text since it is the one that wasactually indexed, so it can be very useful when the content of the current page has been updatedand the search terms are no longer in it.

    When a user enters a query into a search engine (typically by using key words), the engineexamines its index and provides a listing of best-matching web pages according to its criteria,usually with a short summary containing the document's title and sometimes parts of the text. Mostsearch engines support the use of the Boolean operators AND, OR and NOT to further specify thesearch query. Some search engines provide an advanced feature called proximity search whichallows users to define the distance between keywords.

    A web crawler (also known as a web spider, web robot, orespecially in the FOAFcommunityweb scutter) is a program or automated script which browses the WorldWide Web in a methodical, automated manner. Other less frequently used names for webcrawlers are ants, automatic indexers, bots, and worms.

    This process is called web crawling orspidering. Many sites, in particularsearch engines, usespidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copyof all the visited pages for later processing by a search engine that will index the downloadedpages to provide fast searches. Crawlers can also be used for automating maintenance tasks on awebsite, such as checking links or validating HTML code. Also, crawlers can be used to gatherspecific types of information from Web pages, such as harvesting e-mail addresses (usually for

    spam).

    A web crawler is one type ofbot, or software agent. In general, it starts with a list ofURLs tovisit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in thepage and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontierare recursively visited according to a set of policies.

    19

    http://en.wikipedia.org/wiki/Web_crawlinghttp://en.wikipedia.org/wiki/Index_(search_engine)http://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Robots.txthttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/Meta_tagshttp://en.wikipedia.org/wiki/Meta_tagshttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/AltaVistahttp://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Keyword_(Internet_search)http://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Boolean_operatorshttp://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Proximity_search_(text)http://en.wikipedia.org/wiki/FOAF_(software)http://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Index_(search_engine)http://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/Spamminghttp://en.wikipedia.org/wiki/Internet_bothttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Hyperlinkhttp://en.wikipedia.org/wiki/Web_crawlinghttp://en.wikipedia.org/wiki/Index_(search_engine)http://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Robots.txthttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/Meta_tagshttp://en.wikipedia.org/wiki/Meta_tagshttp://en.wikipedia.org/wiki/Googlehttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/AltaVistahttp://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Keyword_(Internet_search)http://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Boolean_operatorshttp://en.wikipedia.org/wiki/Web_search_queryhttp://en.wikipedia.org/wiki/Proximity_search_(text)http://en.wikipedia.org/wiki/FOAF_(software)http://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Index_(search_engine)http://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/Spamminghttp://en.wikipedia.org/wiki/Internet_bothttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Hyperlink
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    20/40

    SEARCH ENGINE

    Crawling policies

    , There are three important characteristics of the Web that make crawling it very difficult:

    its large volume,

    its fast rate of change, dynamic page generation

    which combine to produce a wide variety of possible crawlable URLs.

    The large volume implies that the crawler can only download a fraction of the web pages withina given time, so it needs to prioritize its downloads. The high rate of change implies that by thetime the crawler is downloading the last pages from a site, it is very likely that new pages havebeen added to the site, or that pages have already been updated or even deleted.

    The recent increase in the number of pages being generated by server-side scripting languages

    has also created difficulty in that endless combination ofHTTP GETparameters exist, only asmall selection of which will actually return unique content. For example, a simple online photogallery may offer three options to users, as specified through HTTP GET parameters. If thereexist four ways to sort images, three choices of thumbnail size, two file formats, and an option todisable user-provided contents, then that same set of content can be accessed with forty-eightdifferent URLs, all of which will be present on the site. This mathematical combination creates aproblem for crawlers, as they must sort through endless combinations of relatively minor scriptedchanges in order to retrieve unique content.

    As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite norfree, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some

    reasonable measure of quality or freshness is to be maintained. A crawler must carefully chooseat each step which pages to visit next.

    The behavior of a web crawler is the outcome of a combination of policies:

    Aselection policy that states which pages to download. A re-visit policy that states when to check for changes to the pages. Apoliteness policy that states how to avoid overloading websites. Aparallelization policy that states how to coordinate distributed web crawlers.

    Selection policy

    Given the current size of the Web, even large search engines cover only a portion of the publiclyavailable interne. As a crawler always downloads just a fraction of the Web pages, it is highlydesirable that the downloaded fraction contains the most relevant pages, and not just a randomsample of the Web.

    This requires a metric of importance for prioritizing Web pages. The importance of a page is afunction of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the

    20

    http://en.wikipedia.org/wiki/URLhttp://en.wikipedia.org/wiki/HTTPhttp://en.wikipedia.org/wiki/Mathematical_combinationhttp://en.wikipedia.org/wiki/Bandwidth_(computing)http://en.wikipedia.org/wiki/URLhttp://en.wikipedia.org/wiki/HTTPhttp://en.wikipedia.org/wiki/Mathematical_combinationhttp://en.wikipedia.org/wiki/Bandwidth_(computing)
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    21/40

    SEARCH ENGINE

    latter is the case of vertical search engines restricted to a single top-level domain, or searchengines restricted to a fixed Web site). Designing a good selection policy has an added difficulty:it must work with partial information, as the complete set of Web pages is not known duringcrawling.

    Abiteboul (Abiteboul et al., 2003) designed a crawling strategy based on an algorithm calledOPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of"cash" which is distributed equally among the pages it points to.

    Boldi et al. (Boldi et al., 2004) used simulation on subsets of the Web of 40 million pages fromthe .it domain and 100 million pages from the WebBase crawl, testing breadth-first againstdepth-first, random ordering and an omniscient strategy.

    Baeza-Yates et al. used simulation on two subsets of the Web of 3 million pages from the .gr and.cl domain, testing several crawling strategies. They showed that both the OPIC strategy and astrategy that uses the length of the per-site queues are both better than breadth-first crawling, and

    that it is also very effective to use a previous crawl, when it is available, to guide the current one.

    Daneshpajouh et al. designed a community based algorithm for discovering good seeds. Theirmethod crawls web pages with high PageRank from different communities in less iteration incomparison with crawl starting from random seeds. One can extract good seed from a previouslycrawled web graph using this new method. Using these seeds a new crawl can be very effective.

    A crawler may only want to seek out HTML pages and avoid all otherMIME types. In order torequest only HTML resources, a crawler may make an HTTP HEAD request to determine a Webresource's MIME type before requesting the entire resource with a GET request. To avoidmaking numerous HEAD requests, a crawler may alternatively examine the URL and only

    request the resource if the URL ends with .html, .htm or a slash. This strategy may causenumerous HTML Web resources to be unintentionally skipped. A similar strategy compares theextension of the web resource to a list of known HTML-page types: .html, .htm, .asp, .aspx, .php,and a slash.

    Some crawlers intend to download as many resources as possible from a particular Web site.Cothey (Cothey, 2004) introduced a path-ascending crawlerthat would ascend to every path ineach URL that it intends to crawl. For example, when given a seed URL ofhttp://llama.org/hamster/monkey/page.html, it will attempt to crawl /hamster/monkey/, /hamster/,and /. Cothey found that a path-ascending crawler was very effective in finding isolatedresources, or resources for which no inbound link would have been found in regular crawling.

    Many Path-ascending crawlers are also known as Harvester software, because they're used to"harvest" or collect all the content - perhaps the collection of photos in a gallery - from a specificpage or host.

    The main problem in focused crawling is that in the context of a web crawler, we would like tobe able to predict the similarity of the text of a given page to the query before actuallydownloading the page. A possible predictor is the anchor text of links; this was the approach

    21

    http://en.wikipedia.org/wiki/Vertical_searchhttp://en.wikipedia.org/wiki/2003http://en.wikipedia.org/wiki/2004http://en.wikipedia.org/wiki/Breadth-first_searchhttp://en.wikipedia.org/wiki/Breadth-first_searchhttp://en.wikipedia.org/wiki/MIME_typehttp://en.wikipedia.org/wiki/Harvesterhttp://en.wikipedia.org/wiki/Vertical_searchhttp://en.wikipedia.org/wiki/2003http://en.wikipedia.org/wiki/2004http://en.wikipedia.org/wiki/Breadth-first_searchhttp://en.wikipedia.org/wiki/Breadth-first_searchhttp://en.wikipedia.org/wiki/MIME_typehttp://en.wikipedia.org/wiki/Harvester
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    22/40

    SEARCH ENGINE

    taken by Pinkerton in a crawler developed in the early days of the Web. Diligenti et al. proposeto use the complete content of the pages already visited to infer the similarity between thedriving query and the pages that have not been visited yet. The performance of a focusedcrawling depends mostly on the richness of links in the specific topic being searched, and afocused crawling usually relies on a general Web search engine for providing starting points.

    Web 3.0 defines advanced technologies and new principles for the next generation searchtechnologies that is summarized in Semantic Web and Website Parse Template concepts for thepresent. Web 3.0 crawling and indexing technologies will be based on human-machine cleverassociations.

    The Web has a very dynamic nature, and crawling a fraction of the Web can take a really longtime, usually measured in weeks or months. By the time a web crawler has finished its crawl,many events could have happened. These events can include creations, updates and deletions.

    From the search engine's point of view, there is a cost associated with not detecting an event, and

    thus having an outdated copy of a resource. The most used cost functions, introduced in (Choand Garcia-Molina, 2000), are freshness and age.

    Freshness: This is a binary measure that indicates whether the local copy is accurate or not. Thefreshness of a pagep in the repository at time tis defined as:

    Age: This is a measure that indicates how outdated the local copy is. The age of a page p in therepository, at time tis defined as:

    Image:Web Crawling Freshness Age.svgEvolution of freshness and age in Web crawling

    The objective of the crawler is to keep the average freshness of pages in its collection as high aspossible, or to keep the average age of pages as low as possible. These objectives are notequivalent: in the first case, the crawler is just concerned with how many pages are out-dated,while in the second case, the crawler is concerned with how old the local copies of pages are.

    Two simple re-visiting policies were studied by Cho and Garcia-Molina:

    Uniform policy: This involves re-visiting all pages in the collection with the same frequency,regardless of their rates of change.

    Proportional policy: This involves re-visiting more often the pages that change more frequently.The visiting frequency is directly proportional to the (estimated) change frequency.

    (In both cases, the repeated crawling order of pages can be done either at random or with a fixedorder.)

    To improve freshness, we should penalize the elements that change too often (Cho and Garcia-Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional

    22

    http://en.wikipedia.org/wiki/Web_3.0http://en.wikipedia.org/wiki/Search_engine_technologyhttp://en.wikipedia.org/wiki/Search_engine_technologyhttp://en.wikipedia.org/wiki/Semantic_Webhttp://en.wikipedia.org/wiki/Website_Parse_Templatehttp://en.wikipedia.org/wiki/ICDL_crawlinghttp://en.wikipedia.org/wiki/Web_indexinghttp://en.wikipedia.org/w/index.php?title=Special:Upload&wpDestFile=Web_Crawling_Freshness_Age.svghttp://en.wikipedia.org/wiki/Web_3.0http://en.wikipedia.org/wiki/Search_engine_technologyhttp://en.wikipedia.org/wiki/Search_engine_technologyhttp://en.wikipedia.org/wiki/Semantic_Webhttp://en.wikipedia.org/wiki/Website_Parse_Templatehttp://en.wikipedia.org/wiki/ICDL_crawlinghttp://en.wikipedia.org/wiki/Web_indexinghttp://en.wikipedia.org/w/index.php?title=Special:Upload&wpDestFile=Web_Crawling_Freshness_Age.svg
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    23/40

    SEARCH ENGINE

    policy. The optimal method for keeping average freshness high includes ignoring the pages thatchange too often, and the optimal for keeping average age low is to use access frequencies thatmonotonically (and sub-linearly) increase with the rate of change of each page..

    Politeness policy

    Crawlers can retrieve data much quicker and in greater depth than human searchers, so they canhave a crippling impact on the performance of a site. Needless to say if a single crawler isperforming multiple requests per second and/or downloading large files, a server would have ahard time keeping up with requests from multiple crawlers.

    Network resources, as crawlers require considerable bandwidth and operate with a highdegree of parallelism during a long period of time.

    Server overload, especially if the frequency of accesses to a given server is too high. Poorly written crawlers, which can crash servers or routers, or which download pages

    they cannot handle.

    Personal crawlers that, if deployed by too many users, can disrupt networks and Webservers.

    A partial solution to these problems is the robots exclusion protocol, also known as the robots.txtprotocol that is a standard for administrators to indicate which parts of their Web servers shouldnot be accessed by crawlers. This standard does not include a suggestion for the interval of visitsto the same server, even though this interval is the most effective way of avoiding serveroverload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to usean extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds todelay between requests.

    The first proposal for the interval between connections was given in and was 60 seconds.However, if pages were downloaded at this rate from a website with more than 100,000 pagesover a perfect connection with zero latency and infinite bandwidth, it would take more than 2months to download only that entire website; also, only a fraction of the resources from that Webserver would be used. This does not seem acceptable.

    For those using web crawlers for research purposes, a more detailed cost-benefit analysis isneeded and ethical considerations should be taken into account when deciding where to crawland how fast to crawl.

    Anecdotal evidence from access logs shows that access intervals from known crawlers vary

    between 20 seconds and 34 minutes. It is worth noticing that even when being very polite, andtaking all the safeguards to avoid overloading web servers, some complaints from Web serveradministrators are received. Brin and Page note that: "... running a crawler which connects tomore than half a million servers (...) generates a fair amount of email and phone calls. Because ofthe vast number of people coming on line, there are always those who do not know what acrawler is, because this is the first one they have seen.".

    Parallelization policy

    23

    http://en.wikipedia.org/wiki/Robots_Exclusion_Standardhttp://en.wikipedia.org/wiki/Ask_Jeeveshttp://en.wikipedia.org/wiki/MSNhttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/Sergey_Brinhttp://en.wikipedia.org/wiki/Larry_Pagehttp://en.wikipedia.org/wiki/Robots_Exclusion_Standardhttp://en.wikipedia.org/wiki/Ask_Jeeveshttp://en.wikipedia.org/wiki/MSNhttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/Sergey_Brinhttp://en.wikipedia.org/wiki/Larry_Page
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    24/40

    SEARCH ENGINE

    Main article:Distributed web crawling

    A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximizethe download rate while minimizing the overhead from parallelization and to avoid repeateddownloads of the same page. To avoid downloading the same page more than once, the crawling

    system requires a policy for assigning the new URLs discovered during the crawling process, asthe same URL can be found by two different crawling processes.

    Web crawler architectures

    High-level architecture of a standard Web crawler

    A crawler must not only have a good crawling strategy, as noted in the previous sections, but itshould also have a highly optimized architecture.

    Web crawlers are a central part of search engines, and details on their algorithms and architectureare kept as business secrets. When crawler designs are published, there is often an important lackof detail that prevents others from reproducing the work. There are also emerging concerns about"search engine spamming", which prevent major search engines from publishing their rankingalgorithms.

    URL normalization

    Crawlers usually perform some type ofURL normalization in order to avoid crawling the sameresource more than once. The term URL normalization, also called URL canonicalization, refers

    to the process of modifying and standardizing a URL in a consistent manner. There are severaltypes of normalization that may be performed including conversion of URLs to lowercase,removal of "." and ".." segments, and adding trailing slashes to the non-empty path component(Pant et al., 2004).

    Crawler identification

    Web crawlers typically identify themselves to a web server by using the User-agent field of anHTTP request. Web site administrators typically examine theirweb servers log and use the useragent field to determine which crawlers have visited the web server and how often. The useragent field may include a URL where the Web site administrator may find out more information

    about the crawler. Spambots and other malicious Web crawlers are unlikely to place identifyinginformation in the user agent field, or they may mask their identity as a browser or other well-known crawler.

    It is important for web crawlers to identify themselves so Web site administrators can contact theowner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or theymay be overloading a web server with requests, and the owner needs to stop the crawler.

    24

    http://en.wikipedia.org/wiki/Distributed_web_crawlinghttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Spamdexinghttp://en.wikipedia.org/wiki/URL_normalizationhttp://en.wikipedia.org/wiki/User_agenthttp://en.wikipedia.org/wiki/HTTPhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Spambotshttp://en.wikipedia.org/wiki/Spider_traphttp://en.wikipedia.org/wiki/Distributed_web_crawlinghttp://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Spamdexinghttp://en.wikipedia.org/wiki/URL_normalizationhttp://en.wikipedia.org/wiki/User_agenthttp://en.wikipedia.org/wiki/HTTPhttp://en.wikipedia.org/wiki/Web_serverhttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Spambotshttp://en.wikipedia.org/wiki/Spider_trap
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    25/40

    SEARCH ENGINE

    Identification is also useful for administrators that are interested in knowing when they mayexpect their Web pages to be indexed by a particularsearch engine.

    RBSE was the first published web crawler. It was based on two programs: the firstprogram, "spider" maintains a queue in a relational database, and the second program

    "mite", is a modified www ASCII browser that downloads the pages from the Web. WebCrawler was used to build the first publicly-available full-text index of a subset of

    the Web. It was based on lib-WWW to download pages, and another program to parseand order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with theprovided query.

    Search engineindexing collects, parses, and stores data to facilitate fast and accurate informationretrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitivepsychology, mathematics, informatics, physics and computer science. An alternate name for theprocess in the context of search engines designed to find web pages on the Internet is Web

    indexing.

    Popular engines focus on the full-text indexing of online, natural language documents. Mediatypes such as video and audio and graphics are also searchable.

    Meta search engines reuse the indices of other services and do not store a local index, whereascache-based search engines permanently store the index along with the corpus. Unlike full-textindices, partial-text services restrict the depth indexed to reduce index size. Larger servicestypically perform indexing at a predetermined time interval due to the required time andprocessing costs, while agent-based search engines index in real time.

    Indexing

    The purpose of storing an index is to optimize speed and performance in finding relevantdocuments for a search query. Without an index, the search engine would scan every documentin the corpus, which would require considerable time and computing power. For example, whilean index of 10,000 documents can be queried within milliseconds, a sequential scan of everyword in 10,000 large documents could take hours. The additional computer storage required tostore the index, as well as the considerable increase in the time required for an update to takeplace, are traded off for the time saved during information retrieval.

    Index Design Factors

    Major factors in designing a search engine's architecture include:

    Merge factorsHow data enters the index, or how words or subject features are added to the index duringtext corpus traversal, and whether multiple indexers can work asynchronously. Theindexer must first check whether it is updating old content or adding new content.

    25

    http://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/ASCIIhttp://en.wikipedia.org/wiki/Search_enginehttp://en.wikipedia.org/wiki/Index_(information_technology)http://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Informaticshttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Metasearch_enginehttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Intelligent_agenthttp://en.wikipedia.org/wiki/Real_timehttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/ASCIIhttp://en.wikipedia.org/wiki/Search_enginehttp://en.wikipedia.org/wiki/Index_(information_technology)http://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Informaticshttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Metasearch_enginehttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Intelligent_agenthttp://en.wikipedia.org/wiki/Real_timehttp://en.wikipedia.org/wiki/Lexical_analysis
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    26/40

    SEARCH ENGINE

    Traversal typically correlates to the data collectionpolicy. Search engine index mergingis similar in concept to the SQL Merge command and other merge algorithms.

    Storage techniques

    How to store the index data, that is, whether information should be data compressed orfiltered.

    Index sizeHow much computer storage is required to support the index.Lookup speed

    How quickly a word can be found in the inverted index. The speed of finding an entry ina data structure, compared with how quickly it can be updated or removed, is a centralfocus of computer science.

    Maintenance

    How the index is maintained over time.Fault tolerance

    How important it is for the service to be reliable. Issues include dealing with indexcorruption, determining whether bad data can be treated in isolation, dealing with bad

    hardware,partitioning, and schemes such as hash-based or composite partitioning, as wellas replication.

    Index Data Structures

    Search engine architectures vary in the way indexing is performed and in methods of indexstorage to meet the various design factors. Types of indices include:

    Challenges in Parallelism

    A major challenge in the design of search engines is the management of parallel computing

    processes. There are many opportunities forrace conditions and coherent faults. For example, anew document is added to the corpus and the index must be updated, but the indexsimultaneously needs to continue responding to search queries. This is a collision between twocompeting tasks. Consider that authors are producers of information, and a web crawler is theconsumer of this information, grabbing the text and storing it in a cache (or corpus). The forwardindex is the consumer of the information produced by the corpus, and the inverted index is theconsumer of information produced by the forward index. This is commonly referred to as aproducer-consumer model. The indexer is the producer of searchable information and users arethe consumers that need to search. The challenge is magnified when working with distributedstorage and distributed processing. In an effort to scale with larger amounts of indexedinformation, the search engine's architecture may involve distributed computing, where thesearch engine consists of several machines operating in unison. This increases the possibilitiesfor incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallelarchitecture.

    Inverted indices

    26

    http://en.wikipedia.org/wiki/Web_crawlinghttp://en.wikipedia.org/wiki/Merge_(SQL)http://en.wikipedia.org/wiki/Datahttp://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Replication_(computer_science)http://en.wikipedia.org/wiki/Race_conditionshttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Distributed_computinghttp://en.wikipedia.org/wiki/Web_crawlinghttp://en.wikipedia.org/wiki/Merge_(SQL)http://en.wikipedia.org/wiki/Datahttp://en.wikipedia.org/wiki/Partition_(database)http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Replication_(computer_science)http://en.wikipedia.org/wiki/Race_conditionshttp://en.wikipedia.org/wiki/Text_corpushttp://en.wikipedia.org/wiki/Distributed_computing
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    27/40

    SEARCH ENGINE

    Many search engines incorporate an inverted index when evaluating a search query to quicklylocate documents containing the words in a query and then rank these documents by relevance.Because the inverted index stores a list of the documents containing each word, the search enginecan use direct access to find the documents associated with each word in the query in order toretrieve the matching documents quickly. The following is a simplified illustration of an inverted

    index:

    Inverted Index

    Word Documents

    the Document 1, Document 3, Document 4, Document 5

    cow Document 2, Document 3, Document 4

    says Document 5

    moo Document 7

    This index can only determine whether a word exists within a particular document, since it storesno information regarding the frequency and position of the word; it is therefore considered to bea boolean index. Such an index determines which documents match a query but does not rankmatched documents. In some designs the index includes additional information such as thefrequency of each word in each document or the positions of a word in each document. Positioninformation enables the search algorithm to identify word proximity to support searching forphrases; frequency can be used to help in ranking the relevance of documents to the query. Suchtopics are the central research focus ofinformation retrieval.

    The inverted index is a sparse matrix, since not all words are present in each document. Toreduce computer storage memory requirements, it is stored differently from a two dimensional

    array. The index is similar to the term document matrices employed by latent semantic analysis.The inverted index can be considered a form of a hash table. In some cases the index is a form ofa binary tree, which requires additional storage but may reduce the lookup time. In larger indicesthe architecture is typically a distributed hash table. Inverted indices can be programmed inseveral computer programming languages.

    Index Merging

    The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but firstdeletes the contents of the inverted index. The architecture may be designed to supportincremental indexing, where a merge identifies the document or documents to be added or

    updated and then parses each document into words. For technical accuracy, a merge conflatesnewly indexed documents, typically residing in virtual memory, with the index cache residing onone or more computer hard drives.

    After parsing, the indexer adds the referenced document to the document list for the appropriatewords. In a larger search engine, the process of finding each word in the inverted index (in orderto report that it occurred within a document) may be too time consuming, and so this process iscommonly split up into two parts, the development of a forward index and a process which sorts

    27

    http://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Search_queryhttp://en.wikipedia.org/wiki/Random_accesshttp://en.wikipedia.org/wiki/Boolean_datatypehttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Sparse_matrixhttp://en.wikipedia.org/wiki/Arrayhttp://en.wikipedia.org/wiki/Document-term_matrixhttp://en.wikipedia.org/wiki/Latent_semantic_analysishttp://en.wikipedia.org/wiki/Binary_treehttp://en.wikipedia.org/wiki/Inverted_indexhttp://en.wikipedia.org/wiki/Search_queryhttp://en.wikipedia.org/wiki/Random_accesshttp://en.wikipedia.org/wiki/Boolean_datatypehttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Sparse_matrixhttp://en.wikipedia.org/wiki/Arrayhttp://en.wikipedia.org/wiki/Document-term_matrixhttp://en.wikipedia.org/wiki/Latent_semantic_analysishttp://en.wikipedia.org/wiki/Binary_tree
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    28/40

    SEARCH ENGINE

    the contents of the forward index into the inverted index. The inverted index is so named becauseit is an inversion of the forward index.

    The Forward Index

    The forward index stores a list of words for each document. The following is a simplified formof the forward index:

    Forward Index

    Document Words

    Document 1 the,cow,says,moo

    Document 2 the,cat,and,the,hat

    Document 3 the,dish,ran,away,with,the,spoon

    The rationale behind developing a forward index is that as documents are parsing, it is better toimmediately store the words per document. The delineation enables Asynchronous systemprocessing, which partially circumvents the inverted index updatebottleneck. The forward indexis sorted to transform it to an inverted index. The forward index is essentially a list of pairsconsisting of a document and a word, collated by the document. Converting the forward index toan inverted index is only a matter of sorting the pairs by the words. In this regard, the invertedindex is a word-sorted forward index.

    Compression

    Generating or maintaining a large-scale search engine index represents a significant storage andprocessing challenge. Many search engines utilize a form of compression to reduce the size ofthe indices on disk. Consider the following scenario for a full text, Internet search engine.

    An estimated 2,000,000,000 different web pages exist as of the year 2000Suppose thereare 250 words on each webpage (based on the assumption they are similar to the pages ofa novel.

    It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes percharacter

    The average number of characters in any given word on a page may be estimated at 5(Wikipedia:Size comparisons)

    The averagepersonal computercomes with 100 to 250 gigabytes of usable space

    Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5bytes per word, this would require 2500 gigabytes of storage space alone, more than the averagefree disk space of 25 personal computers. This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the

    28

    http://en.wikipedia.org/wiki/Bottleneckhttp://en.wikipedia.org/wiki/Sorting_algorithmhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Character_encodinghttp://en.wikipedia.org/wiki/Wikipedia:Size_comparisonshttp://en.wikipedia.org/wiki/Personal_computerhttp://en.wikipedia.org/wiki/Gigabytehttp://en.wikipedia.org/wiki/Conflationhttp://en.wikipedia.org/wiki/Bottleneckhttp://en.wikipedia.org/wiki/Sorting_algorithmhttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Character_encodinghttp://en.wikipedia.org/wiki/Wikipedia:Size_comparisonshttp://en.wikipedia.org/wiki/Personal_computerhttp://en.wikipedia.org/wiki/Gigabytehttp://en.wikipedia.org/wiki/Conflation
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    29/40

    SEARCH ENGINE

    index can be reduced to a fraction of this size. The tradeoff is the time and processing powerrequired to perform compression and decompression.

    Notably, large scale search engine designs incorporate the cost of storage as well as the costs ofelectricity to power the storage. Thus compression is a measure of cost.

    Document Parsing

    Document parsing breaks apart the components (words) of a document or other form of mediafor insertion into the forward and inverted indices. The words found are called tokens, and so, inthe context of search engine indexing and natural language processing, parsing is morecommonly referred to as tokenization. It is also sometimes called word boundarydisambiguation, tagging, text segmentation, content analysis, text analysis, text mining,concordance generation, speech segmentation, lexing, or lexical analysis. The terms 'indexing','parsing', and 'tokenization' are used interchangeably in corporate slang.

    Natural language processing, as of 2006, is the subject of continuous research and technologicalimprovement. Tokenization presents many challenges in extracting the necessary informationfrom documents for indexing to support quality searching. Tokenization for indexing involvesmultiple technologies, the implementation of which are commonly kept as corporate secrets.

    Challenges in Natural Language Processing

    Word Boundary Ambiguity

    Native English speakers may at first consider tokenization to be a straightforward task,but this is not the case with designing a multilingual indexer. In digital form, the texts ofother languages such as Chinese, Japanese or Arabic represent a greater challenge, as

    words are not clearly delineated by whitespace. The goal during tokenization is toidentify words for which users will search. Language-specific logic is employed toproperly identify the boundaries of words, which is often the rationale for designing aparser for each language supported (or for groups of languages with similar boundarymarkers and syntax).

    Language Ambiguity

    To assist with properly ranking matching documents, many search engines collectadditional information about each word, such as its language orlexical category (part ofspeech). These techniques are language-dependent, as the syntax varies amonglanguages. Documents do not always clearly identify the language of the document orrepresent it accurately. In tokenizing the document, some search engines attempt to

    automatically identify the language of the document.Diverse File FormatsIn order to correctly identify which bytes of a document represent characters, the fileformat must be correctly handled. Search engines which support multiple file formatsmust be able to correctly open and access the document and be able to tokenize thecharacters of the document.

    Faulty Storage

    29

    http://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Tokenizationhttp://en.wikipedia.org/w/index.php?title=Word_boundary_disambiguation&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Word_boundary_disambiguation&action=edit&redlink=1http://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Text_segmentationhttp://en.wikipedia.org/wiki/Content_analysishttp://en.wikipedia.org/wiki/Text_mininghttp://en.wikipedia.org/wiki/Agreement_(linguistics)http://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Multilingualhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/Whitespace_(computer_science)http://en.wikipedia.org/wiki/Languagehttp://en.wikipedia.org/wiki/Lexical_categoryhttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Tokenizationhttp://en.wikipedia.org/w/index.php?title=Word_boundary_disambiguation&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Word_boundary_disambiguation&action=edit&redlink=1http://en.wikipedia.org/wiki/Part-of-speech_tagginghttp://en.wikipedia.org/wiki/Text_segmentationhttp://en.wikipedia.org/wiki/Content_analysishttp://en.wikipedia.org/wiki/Text_mininghttp://en.wikipedia.org/wiki/Agreement_(linguistics)http://en.wikipedia.org/wiki/Speech_segmentationhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/English_languagehttp://en.wikipedia.org/wiki/Multilingualhttp://en.wikipedia.org/wiki/Chinese_languagehttp://en.wikipedia.org/wiki/Japanese_languagehttp://en.wikipedia.org/wiki/Arabic_languagehttp://en.wikipedia.org/wiki/Whitespace_(computer_science)http://en.wikipedia.org/wiki/Languagehttp://en.wikipedia.org/wiki/Lexical_categoryhttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Part_of_speech
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    30/40

    SEARCH ENGINE

    The quality of the natural language data may not always be perfect. An unspecifiednumber of documents, particular on the Internet, do not closely obey proper file protocol.binary characters may be mistakenly encoded into various parts of a document. Withoutrecognition of these characters and appropriate handling, the index quality or indexerperformance could degrade.

    Tokenization

    Unlike literate humans, computers do not understand the structure of a natural languagedocument and cannot automatically recognize words and sentences. To a computer, a documentis only a sequence of bytes. Computers do not 'know' that a space character separates words in adocument. Instead, humans must program the computer to identify what constitutes an individualor distinct word, referred to as a token. Such a program is commonly called a tokenizerorparserorlexer. Many search engines, as well as other natural language processing software, incorporatespecialized programs for parsing, such as YACC orLex.

    During tokenization, the parser identifies sequences of characters which represent words andother elements, such as punctuation, which are represented by numeric codes, some of which arenon-printing control characters. The parser can also identify entities such as email addresses,phone numbers, and URLs. When identifying each token, several characteristics may be stored,such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category(part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, andline number.

    Language Recognition

    If the search engine supports multiple languages, a common initial step during tokenization is toidentify each document's language; many of the subsequent steps are language dependent (suchas stemming and part of speech tagging). Language recognition is the process by which acomputer program attempts to automatically identify, or categorize, the language of a document.Other names for language recognition include language classification, language analysis,language identification, and language tagging. Automated language recognition is the subject ofongoing research in natural language processing. Finding which language the words belongs tomay involve the use of a language recognition chart.

    Format Analysis

    If the search engine supports multiple document formats, documents must be prepared fortokenization. The challenge is that many document formats contain formatting information inaddition to textual content. For example, HTML documents contain HTML tags, which specifyformatting information such as new line starts, bold emphasis, and font size or style. If thesearch engine were to ignore the difference between content and 'markup', extraneousinformation would be included in the index, leading to poor search results. Format analysis is theidentification and handling of the formatting content embedded within documents which controlsthe way the document is rendered on a computer screen or interpreted by a software program.

    30

    http://en.wikipedia.org/wiki/Binary_datahttp://en.wikipedia.org/wiki/Literacyhttp://en.wikipedia.org/wiki/Tokenizerhttp://en.wikipedia.org/wiki/Parserhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Comparison_of_parser_generatorshttp://en.wikipedia.org/wiki/YACChttp://en.wikipedia.org/wiki/Lex_programming_toolhttp://en.wikipedia.org/wiki/Entity_extractionhttp://en.wikipedia.org/wiki/Emailhttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Stemminghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Language_recognitionhttp://en.wikipedia.org/wiki/Languagehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Language_recognition_charthttp://en.wikipedia.org/wiki/File_formathttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/Fonthttp://en.wikipedia.org/wiki/Font_familyhttp://en.wikipedia.org/wiki/Binary_datahttp://en.wikipedia.org/wiki/Literacyhttp://en.wikipedia.org/wiki/Tokenizerhttp://en.wikipedia.org/wiki/Parserhttp://en.wikipedia.org/wiki/Lexical_analysishttp://en.wikipedia.org/wiki/Comparison_of_parser_generatorshttp://en.wikipedia.org/wiki/YACChttp://en.wikipedia.org/wiki/Lex_programming_toolhttp://en.wikipedia.org/wiki/Entity_extractionhttp://en.wikipedia.org/wiki/Emailhttp://en.wikipedia.org/wiki/Uniform_Resource_Locatorhttp://en.wikipedia.org/wiki/Stemminghttp://en.wikipedia.org/wiki/Part_of_speechhttp://en.wikipedia.org/wiki/Language_recognitionhttp://en.wikipedia.org/wiki/Languagehttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Language_recognition_charthttp://en.wikipedia.org/wiki/File_formathttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/Fonthttp://en.wikipedia.org/wiki/Font_family
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    31/40

    SEARCH ENGINE

    Format analysis is also referred to as structure analysis, format parsing, tag stripping, formatstripping, text normalization, text cleaning, and text preparation. The challenge of formatanalysis is further complicated by the intricacies of various file formats. Certain file formats areproprietary with very little information disclosed, while others are well documented. Common,well-documented file formats that many search engines support include:

    Microsoft Word Microsoft Excel Microsoft Powerpoint IBM Lotus Notes HTML ASCII text files (a text document without any formatting) Adobe's Portable Document Format (PDF) PostScript (PS)

    LaTex The UseNet archive (NNTP) and other deprecated bulletin board formats XML and derivatives like RSS SGML (this is more of a general protocol) Multimediameta data formats like ID3

    Options for dealing with various formats include using a publicly available commercial parsingtool that is offered by the organization which developed, maintains, or owns the format, andwriting a customparser.

    Some search engines support inspection of files that are stored in a compressed or encrypted file

    format. When working with a compressed format, the indexer first decompresses the document;this step may result in one or more files, each of which must be indexed separately. Commonlysupported compressed file formats include:

    ZIP - Zip File RAR- Archive File CAB - Microsoft Windows Cabinet File Gzip - Gzip file BZIP - Bzip file TAR, and TAR.GZ - Unix Gzip'ped Archives

    Format analysis can involve quality improvement methods to avoid including 'bad information'in the index. Content can manipulate the formatting information to include additional content.Examples of abusing document formatting forspamdexing:

    Including hundreds or thousands of words in a section which is hidden from view on thecomputer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag inHTML, which may incorporate the use ofCSS orJavascript to do so).

    31

    http://en.wikipedia.org/wiki/Microsoft_Wordhttp://en.wikipedia.org/wiki/Microsoft_Excelhttp://en.wikipedia.org/wiki/Microsoft_Powerpointhttp://en.wikipedia.org/wiki/Lotus_Noteshttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/ASCIIhttp://en.wikipedia.org/wiki/Adobe_Systemshttp://en.wikipedia.org/wiki/PDFhttp://en.wikipedia.org/wiki/PostScripthttp://en.wikipedia.org/wiki/LaTexhttp://en.wikipedia.org/wiki/UseNethttp://en.wikipedia.org/wiki/XMLhttp://en.wikipedia.org/wiki/RSShttp://en.wikipedia.org/wiki/SGMLhttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Meta_datahttp://en.wikipedia.org/wiki/ID3http://en.wikipedia.org/wiki/Parserhttp://en.wikipedia.org/wiki/Compressor_(software)http://en.wikipedia.org/wiki/List_of_archive_formatshttp://en.wikipedia.org/wiki/ZIP_(file_format)http://en.wikipedia.org/wiki/RARhttp://en.wikipedia.org/wiki/Cabinet_(file_format)http://en.wikipedia.org/wiki/Microsoft_Windowshttp://en.wikipedia.org/wiki/Gziphttp://en.wikipedia.org/wiki/Bzip2http://en.wikipedia.org/wiki/Tar_(file_format)http://en.wikipedia.org/wiki/Unixhttp://en.wikipedia.org/wiki/Spamdexinghttp://en.wikipedia.org/wiki/Span_and_divhttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/CSShttp://en.wikipedia.org/wiki/Javascripthttp://en.wikipedia.org/wiki/Microsoft_Wordhttp://en.wikipedia.org/wiki/Microsoft_Excelhttp://en.wikipedia.org/wiki/Microsoft_Powerpointhttp://en.wikipedia.org/wiki/Lotus_Noteshttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/ASCIIhttp://en.wikipedia.org/wiki/Adobe_Systemshttp://en.wikipedia.org/wiki/PDFhttp://en.wikipedia.org/wiki/PostScripthttp://en.wikipedia.org/wiki/LaTexhttp://en.wikipedia.org/wiki/UseNethttp://en.wikipedia.org/wiki/XMLhttp://en.wikipedia.org/wiki/RSShttp://en.wikipedia.org/wiki/SGMLhttp://en.wikipedia.org/wiki/Multimediahttp://en.wikipedia.org/wiki/Meta_datahttp://en.wikipedia.org/wiki/ID3http://en.wikipedia.org/wiki/Parserhttp://en.wikipedia.org/wiki/Compressor_(software)http://en.wikipedia.org/wiki/List_of_archive_formatshttp://en.wikipedia.org/wiki/ZIP_(file_format)http://en.wikipedia.org/wiki/RARhttp://en.wikipedia.org/wiki/Cabinet_(file_format)http://en.wikipedia.org/wiki/Microsoft_Windowshttp://en.wikipedia.org/wiki/Gziphttp://en.wikipedia.org/wiki/Bzip2http://en.wikipedia.org/wiki/Tar_(file_format)http://en.wikipedia.org/wiki/Unixhttp://en.wikipedia.org/wiki/Spamdexinghttp://en.wikipedia.org/wiki/Span_and_divhttp://en.wikipedia.org/wiki/HTMLhttp://en.wikipedia.org/wiki/CSShttp://en.wikipedia.org/wiki/Javascript
  • 8/7/2019 CSCI 587 SEC 1220_final project_kotha0746

    32/40

    SEARCH ENGINE

    Setting the foreground font color of words to the same as the background color, makingwords hidden on the computer screen to a person viewing the document, but not hiddento the indexer.

    Section Recognition

    Some search engines incorporate section recognition, the identification of major parts of adocument, prior to tokenization. Not all the documents in a corpus read like a well-written book,divided into organized chapters and pages. Many documents on the web, such as newsletters andcorporate reports, contain erroneous content and side-sections which do not contain primarymaterial (that which the document is about). For example, this article displays a side menu withlinks to other web pages. Some file formats, like HTML or PDF, allow for content to bedisplayed in columns. Even though the content is displayed, or rendered, in different areas of theview, the raw markup content may store this information sequentially. Words that appearsequentially in the raw source content are indexed sequentially, even though these sentences andparagraphs are rendered in different parts of the computer screen. If search engines index this

    content as if it were normal content, the quality of the index and search quality may be degradeddue to the mixed content and improper word proximity. Two primary problems are noted:

    Content in di