Mine Report1

download Mine Report1

of 25

Transcript of Mine Report1

  • 8/8/2019 Mine Report1

    1/25

    CHAPTER 1

    INTRODUCTION

    A domain-specific search engine is An information access system that allows access to

    all the information on the web that is relevant to a particular domain. As the web

    continues to grow exponentially, the idea of crawling the entire web on a regular basis

    becomes less and less feasible, so the need to include information on specific domain,

    domain-specific search engines was proposed. As more information becomes available

    on the World Wide Web, it becomes more difficult to provide effective search tools for

    information access.Today, people access web information through two main kinds of search

    interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in

    the form of a set of keywords showing the topic of interest). Better support is needed for

    expressing one's information need and returning high quality search results by web

    search tools. There appears to be a need for systems that do reasoning under uncertainty

    and are flexible enough to recover from the contradictions, inconsistencies, and

    irregularities that such reasoning involves.

    In a multi-view problem, the features of the domain can be partitioned into

    disjoint subsets (views) that are sufficient to learn the target concept. Semi-supervised,

    multi-view algorithms, which reduce the amount of labeled data required for learning,

    rely on the assumptions that the views are compatible and uncorrelated.

    This paper describes the use of semi-structured machine learning approach with

    Active learning for the Domain Specific Search Engines. The proposed work shows

    that with the help of this approach relevant data can be extracted with the minimum

    queries fired by the user. It requires small number of labeled data and pool of unlabelled

    data on which the learning algorithm is applied to extract the required data.

    1

  • 8/8/2019 Mine Report1

    2/25

    CHAPTER 2

    WWW

    World Wide Web is a vast source of information organized in the form of a large

    distributed hypertext system. Every Hypertext link on every web page in the world

    contains one of the URLs. When you click on a link of any kind on a Webpage, you

    send a request to retrieve the unique document on some computer in the world that is

    uniquely identified by that URL. URLs are like addresses of web page.

    2.1 INFORMATION RETRIEVAL

    As we are moving into the information age there is a need to secure data transmission

    and also to store and retrieve the data and the information. Information Retrieval

    Systems are increasing with the advent of the technology.

    Let us fist talk about the basic differences between the data and the information.

    2.1.1 DATA:

    It is the raw fact.

    For its retrieval it needs to be fully mentioned. If the file name or the document

    name is not known or is case sensitive, there are chances for the system to fail

    and do not retrieve any document.

    Examples of data are a piece of paper, a book, an algorithm.

    In the above examples their location is not known and hence the meaning cannot

    be given to this data

    2.1.2 INFORMATION:

    Information is processed data.

    2

  • 8/8/2019 Mine Report1

    3/25

    For its retrieval a partial information is enough for its evaluation. Hence, the

    system never fails.

    Examples of information are a piece of paper on a table, a book in the shelf, abubble-sort algorithm.

    In the above examples their location are known and hence they have a specified

    meaning.

    Data retrieving systems can be found in the operating system search. Windows search is

    a best example for the data retrieval system. You would have to specify the exact name

    of the file that you desire. Where information retrieval systems are are like web search

    engines. The best known is Google. It processes the natural language and produces the

    output in encompassing the entire set of documents matching the query.

    It is very much important in these days to retrieve the data at a faster rate. Earlier

    there used to linear search mechanisms where in the entire set of documents in the

    database are read and then sorted according to the query and displayed. This had varied

    complexities and took greater amount of time compared to the advanced techniques

    available these days.

    In an information retrieval system, the documents are scanned for the query. To

    decrease the computation time of the system, the documents are scanned only for the

    repeated key words which are considered relevant to the document. The output displayed

    sends a feedback as an input for the next query. In this way with every query there is an

    increase in the performance of the system. There should be more and more information

    retrieving systems to reduce the time of searching a document and this is done with the

    help of indexing.

    3

  • 8/8/2019 Mine Report1

    4/25

    2.2 AN INFORMATION RETRIEVAL SYSTEM

    Let me illustrate by means of a black box what a typical IR system would look like. The

    diagram shows three components: input, processor and output. Such a trichotomy may

    seem a little trite, but the components constitute a convenient set of pegs upon which to

    hang a discussion.

    Starting with the input side of things. The main problem here is to obtain a

    representation of each document and query suitable for a computer to use. Let me

    emphasize that most computer-based retrieval systems store only a representation of the

    document (or query) which means that the text of a document is lost once it has beenprocessed for the purpose of generating its representation. A document representative

    could, for example, be a list of extracted words considered to be significant. Rather than

    have the computer process the natural language, an alternative approach is to have an

    artificial language within which all queries and documents can be formulated.

    Figure 2.3 A typical IR system

    4

  • 8/8/2019 Mine Report1

    5/25

    When the retrieval system is on-line, it is possible for the user to change his request

    during one search session in the light of a sample retrieval, thereby, it is hoped,

    improving the subsequent retrieval run. Such a procedure is commonly referred to asfeedback.

    Secondly, the processor, that part of the retrieval system concerned with the

    retrieval process. The process may involve structuring the information in some

    appropriate way, such as classifying it. It will also involve performing the actual

    retrieval function, that is, executing the search strategy in response to a query. In the

    diagram, the documents have been placed in a separate box to emphasize the fact that

    they are not just input but can be used during the retrieval process in such a way that

    their structure is more correctly seen as part of the retrieval process. Finally, we come to

    the output, which is usually a set of citations or document numbers. In an operational

    system the story ends here. However, in an experimental system it leaves the evaluation

    to be done.

    5

  • 8/8/2019 Mine Report1

    6/25

    CHAPTER 3

    SEARCHENGINE

    3.1 INFORMATION RETRIEVAL ARCHITECTURE

    A search engine is a program that can search the Web on a specific topic. Bytyping in a word orphrase (known as a keyword), the search engine will produce

    pages of links on that topic. The more relevant links are at the top of the list, but that

    is not always true. Information retrieval system works as shown in the Fig.

    We can classify search engines as general-purpose search engines and domain-

    specific search engines. Search engines come in three major flavors: Web

    crawlers, Web portals, Meta-Search engines. A search engine or a web crawler has

    Fig. 2 Information Retrieval Architecture

    6

  • 8/8/2019 Mine Report1

    7/25

    several parts: Firstly, a crawler that traverses the web graph and downloads web

    documents; Secondly, an indexer that processes and indexes the downloaded

    documents; Thirdly, a query manager that handles the user query and returns relevantdocuments indexed in the database to the user. The Web is similar to a graph, in

    that links are like edges and web pages are like nodes. Crawler starts from a seed

    page and then uses the external links within it to attend to other pages. Web

    crawlers retrieve Web pages and add them or their representations to a local

    repository. Web Content Analysis is done either by Document-query similarity which

    is important for identifying similarity of user query to documents or Document-

    document similarity which is important for finding similar documents in the

    document pool of a search engine. The analysis Link Structure of Web is done by

    ranking or Hub and Authority pages. In the former web pages differ from each

    other in the number of in-links that they have. Page Rank Algorithm is used for the

    purpose and in later the importance of pages can be extracted from the link

    structure of web. In this approach two kindsofpages are identified from web page

    links: first pages that are very important and authorities in a special topic, second

    pages that have great number of links to authoritypages.

    3.2 DOMAIN SPECIFIC SEARCH ENGINE

    A domain-specific search engine is An information access system that allows

    access to all the information on the web that is relevant to a particular domain.

    Domain specific search engine a s distinct from a general Web search engine, focuses

    on a specific segment of online content. The vertical content area may be based on

    topicality, media type, or genre of content. Common examples include legal, medical,

    patent (intellectual property), travel, and automobile search engines. In contrast to

    general Web search engines, which attempt to index large portions of the World Wide

    Webusing a Web crawler, vertical search engines typically use a focused crawlerthat

    attempts to index only Web pages that are relevant to a pre-defined topic or set of

    7

    http://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Web_search_engine
  • 8/8/2019 Mine Report1

    8/25

    topics.

    Domain specific search offers several potential benefits over general search engines:

    Greater precision due to limited scope

    Leverage domain knowledge including taxonomies and ontologies

    Support specific unique user tasks

    Domain-specific verticals focus on a specific topic. Domain-specific search solutions

    focus on one area of knowledge, creating customized search experiences, that because of

    the domain's limited corpus and clear relationships between concepts, provide extremely

    relevant results for searchers.

    3.3 WEB CRAWLER

    WebCrawler, the first comprehensive full-text search engine for the World-Wide Web,

    has played a fundamental role in making the Web easier to use for millions of people. Its

    invention and subsequent evolution, spanning a three-year period, helped fuel the Websgrowth by creating a new way of navigating hypertext.

    Before search engines like WebCrawler, users found Web documents by following

    hypertext links from one document to another. When the Web was small and its

    documents shared the same fundamental purpose, users could find documents with

    relative ease. However, the Web quickly grew to millions of pages making navigation

    difficult. WebCrawler assists users in their Web navigation by automating the task of

    link traversal, creating a searchable index of the web, and fulfilling searchers queries

    from the index. To use WebCrawler, a user issues a query to a pre-computed index,

    quickly retrieving a list of documents that match the query.

    This dissertation describes WebCrawlers scientific contributions: a method for

    choosing a subset of the Web to index; an approach to creating a search service that is

    easy to use; a new way to rank search results that can generate highly effective results

    8

  • 8/8/2019 Mine Report1

    9/25

    for both naive and expert searchers; and an architecture for the service that has

    effectively handled a three-order-of-magnitude increase in load.

    This dissertation also describes how WebCrawler evolved to accommodate theextraordinary growth of the Web. This growth affected WebCrawler not only by

    increasing the size and scope of its index, but also by increasing the demand for its

    service. Each of WebCrawlers components had to change to accommodate this growth:

    the crawler had to download more documents, the full-text index had to become more

    efficient at storing and finding those documents, and the service had to accommodate

    heavier demand. Such changes were not only related to scale, however: the evolving

    nature of the Web meant that functional changes were necessary, too, such as the ability

    to handle naive queries from searchers.

    At its heart, WebCrawler is a full-text retrieval engine: it takes queries expressed as

    strings of words and looks them up in a full-text index. The field of full-text indexing

    and information retrieval (IR) has a long history: many of the fundamental underlying

    algorithms were first described over 25 years ago[Salton 1975, Salton 1989]. Much of

    this early work is relevant to Web search engines; however, key differences not only

    between the Web and typical collections but also between the Webs users and typical

    searchers challenge many of the assumptions underlying conventional IR work.

    The basic problem in IR is to find relevant documents from a collection of documents in

    response to a searchers query. Typically, the query is expressed as a string of words and

    a set of constraints on metadata, and the results are a selection of documents from a large

    collection. Most of the work in IR has focused on systems for storing and viewing

    documents, methods for processing queries and determining the relevant results, and

    user interfaces for querying, viewing, and refining results.

    3.3.1 PARTS

    9

  • 8/8/2019 Mine Report1

    10/25

    A search engine or a web crawler has several parts: Firstly, a crawler that traverses

    the web graph and downloads web documents; Secondly, an indexer thatprocesses and indexes the downloaded documents; Thirdly, a query manager that

    handles the user query and returns relevant documents indexed in the database to the

    user. The Web is similar to a graph, in that links are like edges and web pages are

    like nodes. Crawler starts from a seed page and then uses the external links within

    it to attend to other pages. Web crawlers o retrieve Web pages and add

    them or their representations to a local repository.

    Fig WebCrawlers overall architecture. The Crawler retrieves and processes documents

    from the Web, creating an index that the server uses to answer queries.

    3.4 META SEARCH ENGINE

    From a searchers perspective, meta-search is currently desirable because each search

    engine, although overlapping in content with others, has some unique documents in its

    index. Also, each search engine updates documents with different frequencies; thus,

    some will have a more up-to-date picture of the Web. Meta search engines also

    synthesize the relevance ranking schemes of these search engines and, as a result, can

    theoretically produce better results. The drawback to using meta-search engines is the

    same as mentioned above for the widely distributed retrieval models: searchers

    encounter a slowdown in performance, though Meta Crawler made heavy use of caching

    to eliminate the performance problems.

    10

  • 8/8/2019 Mine Report1

    11/25

    3.5 WEB PORTALS

    A web portal, also known as a links page, presents information from diverse sources in aunified way. Apart from the standard search engine feature, web portals offer other

    services such as e-mail, news, stock prices, information, databases and entertainment.

    Portals provide a way for enterprises to provide a consistent look and feel with access

    control and procedures for multiple applications and databases, which otherwise would

    have been different entities altogether. Examples of public web portals are AOL,

    iGoogle, MSNBC, Netvibes, and Yahoo!.

    3.6 ANALYSIS LINK STRUCTURE OF WEB

    Hubs and Authorities (also known as HITS algorithm) is a scheme used for ranking web

    pages based on relevance that existed on the web as a precursor to PageRank. The idea

    behind Hubs and Authorities stemmed from a particular insight into the creation of web

    pages when the Internet was originally forming; that is, certain web pages, known as

    hubs, served as large directories that were not actually authoritative in the information

    that it held, but were used as compilations of a broad catalog of information that led

    users directly to other authoritative pages. In other words, a good hub represented a page

    that pointed to many other pages, and a good authority represented a page that was

    linked by many different hubs.[1] The scheme therefore assigns two scores for each page:

    a hub score and an authority score.

    3.6.1 HUBS

    Hubs are highly-valued lists for a given query. For example, a directory page from a

    major encyclopedia or paper that links to many different highly-linked pages would

    typically have a higher hub score than a page that links to relatively few other sources.

    11

    http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0
  • 8/8/2019 Mine Report1

    12/25

    3.6.2 AUTHORITIES

    Authorities are highly endorsed answers to the query. A page that is particularly popular

    and linked by many different directories will typically have a higher authority score than

    a page that is unpopular.

    CHAPTER 4

    MACHINE LEARNING

    4.1 MACHINE LEARNING

    Machine learning is the study of how to make computers learn; the goal

    12

  • 8/8/2019 Mine Report1

    13/25

    is to make computers improve their performance through experience. As shown

    in Fig. 3 computer is getting the input as a class of tasks and with the help of

    algorithm, the performance ofthe system can be improved.Machine learning approaches in Information Retrieval is the ability of a machine to

    improve its performance based on previous results. Machine learning is an area of

    artificial intelligence concerned with the development of techniques, which allow

    computers to "learn". More specifically, machine learning is a method for creating

    computer programs by the analysis of data sets.

    Figure 2.3 Machine Learning

    The Machine Learning is of three types:

    Supervised learning

    Unsupervised Learning

    Semi-supervised learning

    4.1.1 SUPERVISED LEARNING: it is also called classification is based on

    learning from a training data set. From a theoretical point of view, supervised and

    unsupervised learning differ only in the causal structure of the model. In supervised

    learning, the model defines the effect one set of observations, called inputs, has on

    13

  • 8/8/2019 Mine Report1

    14/25

    another set of observations, called outputs. In other words, the inputs are assumed to

    be at the beginning and outputs at the end of the causal chain. The models can

    include mediating variables between the inputs and outputs.4.1.2 UNSUPERVISED LEARNING: In this approach data exists but there is no

    knowledge of the correct answer of applying the model to the data. An

    unsupervised learner agent is neither told the correct action in each state, nor

    is an evaluation of action taken in each state. In unsupervised learning, all the

    observations are assumed to be caused by latent variables, that is, the observations

    are assumed to be at the end of the causal chain. In practice, models for supervised

    learning often leave the probability for inputs undefined. This model is not needed

    as long as the inputs are available, but if some of the input values are missing, it is

    not possible to infer anything about the outputs. If the inputs are also modeled, then

    missing inputs cause no problem since they can be considered latent variables as in

    unsupervised learning.

    4.1.3 SEMI SUPERVISED LEARNING: it is a goal-directed activity, which can

    be precisely evaluated either Learning from labeled and unlabeled documents,

    Reinforcement learning and Case Based Reasoning. Semi-supervised learning refers

    to the use of both labeled and unlabeled data for training. It contrasts supervised learning

    (data all labeled) or unsupervised learning (data all unlabeled). Other names are

    learning from labeled and unlabeled data or learning from partially labeled/classified

    data. Notice semi-supervised learning can be either transductive or inductive.

    Here we focus on semi-supervised classification. It is a special form of classification.

    Traditional classifiers use only labeled data (feature / label pairs) to train. Labeled

    instances however are often difficult, expensive, or time consuming to obtain, as they

    require the efforts of experienced human annotators. Mean while unlabeled data may be

    relatively easy to collect, but there has been few ways to use them. Semi-supervised

    learning addresses this problem by using large amount of unlabeled data, together with

    the labeled data, to build better classifiers. Because semi-supervised learning requires

    less human effort and gives higher accuracy, it is of great interest both in theory and in

    practice.

    14

  • 8/8/2019 Mine Report1

    15/25

    Labels are hard to obtain while unlabeled data are abundant, therefore semi-

    supervised learning is a good idea to reduce human labor and improve accuracy. Do not

    take it for granted. Even though you (or your domain expert) do not spend as much timein labeling the training data, you need to spend reasonable amount of effort to design

    good models / features / kernels / similarity functions for semi-supervised learning.

    One example of a semi-supervised learning technique is co-training, in which

    two or possibly more learners are each trained on a set of examples, but with each

    learner using a different, and ideally independent, set of features for each example.

    An alternative approach is to model the joint probability distribution of the features and

    the labels. For the unlabelled data the labels can then be treated as 'missing data'. It is

    common to use the EM algorithm to maximize the likelihood of the model.

    CHAPTER 5

    ACTIVELEARNING

    5.1 ACTIVE LEARNING

    15

    http://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithmhttp://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithm
  • 8/8/2019 Mine Report1

    16/25

    Active learning is to design and analyze learning algorithms that can

    effectively filter or choose the samples for which they predict the hypothesis. Active

    learning is a method of online learning, where a learner strategically selects newtraining examples that provide maximal information about the unlabeled dataset,

    resulting in higher classification accuracy for a given training set size as compared to

    using randomly selected examples. Active learning is most useful when there are

    sufficient a number of unlabeled samples but it is expensive to obtain class labels.

    Many active learning algorithms however assume that the model built upon labeled data

    is not biased and the probability distribution of the unlabeled and existing datasets are

    identical.

    Active learning (sometimes called query learning or optimal experimental

    design in the statistics literature) is a subfield of machine learning and, more generally,

    artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to

    choose the data from which it learnsto be curious, if you willit will perform better

    with less training. Why is this a desirable property for learning algorithms to have?

    Consider that, for any supervised learning system to perform well, it must often be

    trained on hundreds (even thousands) of labeled instances. Sometimes these labels come

    at little or no cost, such as the the spam flag you mark on unwanted email messages,

    or the five-star rating you might give to films on a social networking website. Learning

    systems use these flags and ratings to better filter your junk email and suggest movies

    you might enjoy. In these cases you provide such labels for free, but for many other

    more sophisticated supervised learning tasks, labeled instances are very difficult, time-

    consuming, or expensive to obtain. Here are a few examples:

    Speech recognition. Accurate labeling of speech utterances is extremely time

    consuming and requires trained linguists. Reports that annotation at the word

    level can take ten times longer than the actual audio (e.g., one minute of speech

    takes ten minutes to label), and annotating phonemes can take 400 times as long

    (e.g., nearly seven hours). The problem is compounded for rare languages or

    dialects.

    16

  • 8/8/2019 Mine Report1

    17/25

    Information extraction. Good information extraction systems must be trained

    using labeled documents with detailed annotations. Users highlight entities or

    relations of interest in text, such as person and organization names, or whether aperson works for a particular organization. Locating entities and relations can

    take a half-hour or more for even simple newswire stories. Annotations for other

    knowledge domains may require additional expertise, e.g., annotating gene and

    disease mentions for biomedical information extraction usually requires PhD-

    level biologists.

    Classification and filtering. Learning to classify documents (e.g., articles or web

    pages) or any other kind of media (e.g., image, audio, and video files) requires

    that users label each document or media file with particular labels, like

    relevant or not relevant. Having to annotate thousands of these instances can

    be tedious and even redundant.

    Active learning systems attempt to overcome the labeling bottleneck by asking queries

    in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).

    In this way, the active learner aims to achieve high accuracy using as few labeled

    instances as possible, thereby minimizing the cost of obtaining labeled data. Active

    learning is well-motivated in many modern machine learning problems where data may

    be abundant but labels are scarce or expensive to obtain.

    CHAPTER 6

    LIMITATIONS OF EXISTINGSYSTEM

    17

  • 8/8/2019 Mine Report1

    18/25

    As the web continues to grow exponentially, the idea of crawling the entire

    web on a regular basis becomes less and less feasible. General-purpose searchengines have certain limitations like Querying Mechanism, Keyword Exact

    Matching, Low web Coverage Rate, Long result list with low relevancy to user

    query. Due to these basic problems and also because of the need to include

    information on specific domain, domain-specific search engines were proposed. A

    domain- specific search engine is defined as: an information access system that

    allows access to all the information on the web that is relevant to a particular domain.

    Search engines can be classified as general-purpose search engines and domain-

    specific search engines. Domain in domain-specific search engines can be

    specialized with two existing kinds of search engines, search engines which focus

    on specific document type such as resumes, homepages, movies, etc. And search

    engines focus on specific topic like computer science, climbing, sport, etc

    Domain Specific search engines can be IBM- Focused Crawler, Cora Domain

    Specific Search Engine

    6.1 IBM- Focused Crawler

    Focused crawling, as Soumen Chakrabati et al [7, 14] proposed a new

    approach to topic specific resource discovery at IBMs Almaden center. The focus

    topic in this system is represented by a set of example pages that is provided by a

    user to the system. In the system described there is a user- browsable topic

    taxonomy where the user can mark some of the documents as good and select them

    as the focus topic. The system has three main components: A classifier that makes

    judgments on the relevancy of crawled documents, and decides on following the

    links within pages. The classifier is an extended version of the Nave Bayes

    classifier. The second component is a distiller that evaluates the centrality of each

    page to determine crawling priority of links within it. The distiller uses the

    18

  • 8/8/2019 Mine Report1

    19/25

    bibliometric concepts of hub and authority pages as an approximate social

    judgment of web page quality. For each page it calculates the hub and authority

    scores of each web page with an extended version of the HITS algorithm. Ittries to crawl the links within pages with the highest hub score first in order to

    find new authorities. The third component of the system is a dynamic crawler

    that crawls the web according to a re-orderable priority queue. The system works in

    two phases: training and testing. In the training phase, the classifier is trained

    with some labeled data relevant to the focus topic. The training data set is acquired

    from existing taxonomy-like search engines (portal) such as Yahoo! and Open

    DirectoryProject.

    Figure 2.3 IBM- Focused Crawler

    Problems associated with this search engines are:

    Fixed model of classifier, IBM focused crawleruses a fixed model of

    relevancy class as a classifier to evaluate topical relevancy of documents.

    A more adaptive classifieruses documents that are marked as relevant by the

    19

  • 8/8/2019 Mine Report1

    20/25

    classifier to update the classifier. However, ensuring flexibility in the

    classifierwithout simultaneously corrupting the classifier is difficult.

    Does not model future rewards. One major problem faced by this

    focused crawler is that it is does not learn that some sets of off-topic

    documents often lead reliably to highly relevant documents. In other words it

    does not model the future reward of links. For example, a home page

    of a computer science department is not relevant to Reinforcement

    Learning, but links from that home page eventually lead to the home page

    of the Machine Learning Laboratory or the home page of a researcher

    where it may find valuable targetpages.

    Lack of comparison of results. The results reported in papers of this

    approach are not compared with results of human-maintained portals like

    the Open Directory Project. However, judgments on the quality of gathered

    pages in this approach are hard.

    Exemplary documents. Representation of focus topic in the form of

    some high quality documents related to topic is sometimes hard for the user.

    CHAPTER 7

    PROPOSED WORK

    20

  • 8/8/2019 Mine Report1

    21/25

    The problems that have been discussed in section 6 can be removed with the help of

    machine learning approaches along with the active learning approach. As discussed in

    the section that machine learning can be of three types, by employing the semi

    structured machine-learning approach to the IBM focused crawlers the system

    performance can comparatively be improved. Semi-supervised learning is a goal-

    directed activity, which can be precisely evaluated. In the web context our training

    data is a small set of labeled documents. The label is document class, and our goal is

    to guess the label of an un- seen document. In this category we review learning

    from labeled and unlabeled documents. In some semi-supervised approaches, a

    learner agent learns from interaction with a dynamic environment. In these

    environments, providing a set of training data for the agent is very difficult or

    even impossible, because of the dynamics inherent in the environment and

    correspondingly huge number of states and actions. One requirement of this model

    is a measure of the goodness of action that the agent takes in a state.

    7.1 Active Learner in Domain Specific SearchEngine

    To overcome the problems that occur in IBM, focused crawler, one technique

    is to replace the classifier with active learner. As discussed in section V, Active

    learner is capable of predicting the hypothesis [11]; it will generate the hypothesis

    with the experiences. Semi-supervised learning is used because it is veryexpensive to

    generate labeled data for every set. To use active learning approaches with semi-

    supervised learning to improve the efficiency of the system. Semi-supervised

    algorithms are used in text classification:

    Co-Training

    Semi-supervised EM

    Co-EM

    Co-EMT uses a multi-view active learning algorithm

    21

  • 8/8/2019 Mine Report1

    22/25

    Multi-view setting applies to learning problems that have a natural way to divide

    their features into subsets (views) each of which are sufficient to learn the targetconcept. Multi-view active learning maximizes the accuracy of the learned

    hypotheses while minimizing the amount of labeled training data.

    The algorithm to improve the efficiency is discussed as follows:

    Given

    A learning problem withview

    A learningalgorithm

    Set of T and U labeled and unlabeled examples

    No. of k iterations to beperformed

    The number N of queries to be made

    CO-EMT for domain specific engine

    Let iter be the number of iterationsto be made

    LOOP for kiterations

    - use L, V1(T), and V2(T) to create classifiers h1 and h2

    - FOR EACH class Ci DO

    - let E1 and E2 be the e unlabeled examples on which h1and h2 make

    the most confident predictions for Ci(mean the relevant data

    according to the userquery)

    - label E1 and E2 according to h1and h2,respectiveiy, and add

    them to T

    combine the prediction of h1andh2

    this combination will be the combination of the required documents.

    22

  • 8/8/2019 Mine Report1

    23/25

    Fig Improvements in IBM- Focused Crawler

    The system has three main components: (1) Active learner makes judgments on the

    relevancy of crawled documents, the tasks that are being done by active learner:

    Firstly, it maintains the history and makes hypothesis based on them, Secondly it

    allows meta- reasoning to be done. (2) Distiller that evaluates the centrality of each

    page to determine crawling priority of links within it. The distiller uses the

    bibiliometric concepts of hub and authoritypages as an approximate social judgment

    of web page quality. For each page it calculates the hub and authority scores of each

    web page with an extended version of the HITS algorithm. (3) Dynamic crawler that

    crawls the web according to a re-orderable priority queue.

    CHAPTER 8

    CONCLUSION

    23

  • 8/8/2019 Mine Report1

    24/25

    Search engine technology has gone through several evolutions and finally reached

    the point where Artificial Intelligence can offer tremendous help. We have reviewed

    this evolution from the beginning up to now and surveyed several differenttechniques that have been developed to improve search engine functionality. In

    particular we highlighted some machine learningapproaches to information retrieval

    on the web and concentrated on topic-specific search engines. Finally, we proposed

    an information integration environmentbased on active learning. Our approach uses

    current technology in a better manner to provide appropriate results.

    REFERENCES

    24

  • 8/8/2019 Mine Report1

    25/25

    1. http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html. INFORMATION

    RETRIEVAL WIKI

    2. WEB CRAWLER (http://www.chato.cl/papers/crawling_thesis/effective_web_crawling.pdf)

    3. INFORMATION RETRIEVAL IN DOMAIN SPECIFIC SEARCH ENGINEWITH MACHINE LEARNING APPROACHES

    http://www.waset.org/journals/waset/v42/v42-26.pdf

    4. S. Lawrence, Context in web Search, IEEE Data Engineering Bulletin, [11]

    Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-

    training. Proc. of the Conference on Computational Learning Theory (pp. 92100).

    http://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdf

    5. S. Chakrabarti, Data mining for hypertext: A tutorial survey, SIGKDD: SIGKDD

    Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge

    Discovery&DataMining,ACM1(2):1.11,2000.http://www.cs.uic.edu/~liub/WebMini

    ngBook.html

    6. S. Lawrence and C.L. Giles, Searching the World Wide Web, Science 80:98-100,1998.

    25

    http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.html