Mine Report1

8/8/2019 Mine Report1

1/25

CHAPTER 1

INTRODUCTION

A domain-specific search engine is An information access system that allows access to

all the information on the web that is relevant to a particular domain. As the web

continues to grow exponentially, the idea of crawling the entire web on a regular basis

becomes less and less feasible, so the need to include information on specific domain,

domain-specific search engines was proposed. As more information becomes available

on the World Wide Web, it becomes more difficult to provide effective search tools for

information access.Today, people access web information through two main kinds of search

interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in

the form of a set of keywords showing the topic of interest). Better support is needed for

expressing one's information need and returning high quality search results by web

search tools. There appears to be a need for systems that do reasoning under uncertainty

and are flexible enough to recover from the contradictions, inconsistencies, and

irregularities that such reasoning involves.

In a multi-view problem, the features of the domain can be partitioned into

disjoint subsets (views) that are sufficient to learn the target concept. Semi-supervised,

multi-view algorithms, which reduce the amount of labeled data required for learning,

rely on the assumptions that the views are compatible and uncorrelated.

This paper describes the use of semi-structured machine learning approach with

Active learning for the Domain Specific Search Engines. The proposed work shows

that with the help of this approach relevant data can be extracted with the minimum

queries fired by the user. It requires small number of labeled data and pool of unlabelled

data on which the learning algorithm is applied to extract the required data.

1


2/25

CHAPTER 2

WWW

World Wide Web is a vast source of information organized in the form of a large

distributed hypertext system. Every Hypertext link on every web page in the world

contains one of the URLs. When you click on a link of any kind on a Webpage, you

send a request to retrieve the unique document on some computer in the world that is

uniquely identified by that URL. URLs are like addresses of web page.

2.1 INFORMATION RETRIEVAL

As we are moving into the information age there is a need to secure data transmission

and also to store and retrieve the data and the information. Information Retrieval

Systems are increasing with the advent of the technology.

Let us fist talk about the basic differences between the data and the information.

2.1.1 DATA:

It is the raw fact.

For its retrieval it needs to be fully mentioned. If the file name or the document

name is not known or is case sensitive, there are chances for the system to fail

and do not retrieve any document.

Examples of data are a piece of paper, a book, an algorithm.

In the above examples their location is not known and hence the meaning cannot

be given to this data

2.1.2 INFORMATION:

Information is processed data.

2


3/25

For its retrieval a partial information is enough for its evaluation. Hence, the

system never fails.

Examples of information are a piece of paper on a table, a book in the shelf, abubble-sort algorithm.

In the above examples their location are known and hence they have a specified

meaning.

Data retrieving systems can be found in the operating system search. Windows search is

a best example for the data retrieval system. You would have to specify the exact name

of the file that you desire. Where information retrieval systems are are like web search

engines. The best known is Google. It processes the natural language and produces the

output in encompassing the entire set of documents matching the query.

It is very much important in these days to retrieve the data at a faster rate. Earlier

there used to linear search mechanisms where in the entire set of documents in the

database are read and then sorted according to the query and displayed. This had varied

complexities and took greater amount of time compared to the advanced techniques

available these days.

In an information retrieval system, the documents are scanned for the query. To

decrease the computation time of the system, the documents are scanned only for the

repeated key words which are considered relevant to the document. The output displayed

sends a feedback as an input for the next query. In this way with every query there is an

increase in the performance of the system. There should be more and more information

retrieving systems to reduce the time of searching a document and this is done with the

help of indexing.

3


4/25

2.2 AN INFORMATION RETRIEVAL SYSTEM

Let me illustrate by means of a black box what a typical IR system would look like. The

diagram shows three components: input, processor and output. Such a trichotomy may

seem a little trite, but the components constitute a convenient set of pegs upon which to

hang a discussion.

Starting with the input side of things. The main problem here is to obtain a

representation of each document and query suitable for a computer to use. Let me

emphasize that most computer-based retrieval systems store only a representation of the

document (or query) which means that the text of a document is lost once it has beenprocessed for the purpose of generating its representation. A document representative

could, for example, be a list of extracted words considered to be significant. Rather than

have the computer process the natural language, an alternative approach is to have an

artificial language within which all queries and documents can be formulated.

Figure 2.3 A typical IR system

4


5/25

When the retrieval system is on-line, it is possible for the user to change his request

during one search session in the light of a sample retrieval, thereby, it is hoped,

improving the subsequent retrieval run. Such a procedure is commonly referred to asfeedback.

Secondly, the processor, that part of the retrieval system concerned with the

retrieval process. The process may involve structuring the information in some

appropriate way, such as classifying it. It will also involve performing the actual

retrieval function, that is, executing the search strategy in response to a query. In the

diagram, the documents have been placed in a separate box to emphasize the fact that

they are not just input but can be used during the retrieval process in such a way that

their structure is more correctly seen as part of the retrieval process. Finally, we come to

the output, which is usually a set of citations or document numbers. In an operational

system the story ends here. However, in an experimental system it leaves the evaluation

to be done.

5


6/25

CHAPTER 3

SEARCHENGINE

3.1 INFORMATION RETRIEVAL ARCHITECTURE

A search engine is a program that can search the Web on a specific topic. Bytyping in a word orphrase (known as a keyword), the search engine will produce

pages of links on that topic. The more relevant links are at the top of the list, but that

is not always true. Information retrieval system works as shown in the Fig.

We can classify search engines as general-purpose search engines and domain-

specific search engines. Search engines come in three major flavors: Web

crawlers, Web portals, Meta-Search engines. A search engine or a web crawler has

Fig. 2 Information Retrieval Architecture

6


7/25

several parts: Firstly, a crawler that traverses the web graph and downloads web

documents; Secondly, an indexer that processes and indexes the downloaded

documents; Thirdly, a query manager that handles the user query and returns relevantdocuments indexed in the database to the user. The Web is similar to a graph, in

that links are like edges and web pages are like nodes. Crawler starts from a seed

page and then uses the external links within it to attend to other pages. Web

crawlers retrieve Web pages and add them or their representations to a local

repository. Web Content Analysis is done either by Document-query similarity which

is important for identifying similarity of user query to documents or Document-

document similarity which is important for finding similar documents in the

document pool of a search engine. The analysis Link Structure of Web is done by

ranking or Hub and Authority pages. In the former web pages differ from each

other in the number of in-links that they have. Page Rank Algorithm is used for the

purpose and in later the importance of pages can be extracted from the link

structure of web. In this approach two kindsofpages are identified from web page

links: first pages that are very important and authorities in a special topic, second

pages that have great number of links to authoritypages.

3.2 DOMAIN SPECIFIC SEARCH ENGINE

A domain-specific search engine is An information access system that allows

access to all the information on the web that is relevant to a particular domain.

Domain specific search engine a s distinct from a general Web search engine, focuses

on a specific segment of online content. The vertical content area may be based on

topicality, media type, or genre of content. Common examples include legal, medical,

patent (intellectual property), travel, and automobile search engines. In contrast to

general Web search engines, which attempt to index large portions of the World Wide

Webusing a Web crawler, vertical search engines typically use a focused crawlerthat

attempts to index only Web pages that are relevant to a pre-defined topic or set of

7
http://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Web_search_engine


8/25

topics.

Domain specific search offers several potential benefits over general search engines:

Greater precision due to limited scope

Leverage domain knowledge including taxonomies and ontologies

Support specific unique user tasks

Domain-specific verticals focus on a specific topic. Domain-specific search solutions

focus on one area of knowledge, creating customized search experiences, that because of

the domain's limited corpus and clear relationships between concepts, provide extremely

relevant results for searchers.

3.3 WEB CRAWLER

WebCrawler, the first comprehensive full-text search engine for the World-Wide Web,

has played a fundamental role in making the Web easier to use for millions of people. Its

invention and subsequent evolution, spanning a three-year period, helped fuel the Websgrowth by creating a new way of navigating hypertext.

Before search engines like WebCrawler, users found Web documents by following

hypertext links from one document to another. When the Web was small and its

documents shared the same fundamental purpose, users could find documents with

relative ease. However, the Web quickly grew to millions of pages making navigation

difficult. WebCrawler assists users in their Web navigation by automating the task of

link traversal, creating a searchable index of the web, and fulfilling searchers queries

from the index. To use WebCrawler, a user issues a query to a pre-computed index,

quickly retrieving a list of documents that match the query.

This dissertation describes WebCrawlers scientific contributions: a method for

choosing a subset of the Web to index; an approach to creating a search service that is

easy to use; a new way to rank search results that can generate highly effective results

8


9/25

for both naive and expert searchers; and an architecture for the service that has

effectively handled a three-order-of-magnitude increase in load.

This dissertation also describes how WebCrawler evolved to accommodate theextraordinary growth of the Web. This growth affected WebCrawler not only by

increasing the size and scope of its index, but also by increasing the demand for its

service. Each of WebCrawlers components had to change to accommodate this growth:

the crawler had to download more documents, the full-text index had to become more

efficient at storing and finding those documents, and the service had to accommodate

heavier demand. Such changes were not only related to scale, however: the evolving

nature of the Web meant that functional changes were necessary, too, such as the ability

to handle naive queries from searchers.

At its heart, WebCrawler is a full-text retrieval engine: it takes queries expressed as

strings of words and looks them up in a full-text index. The field of full-text indexing

and information retrieval (IR) has a long history: many of the fundamental underlying

algorithms were first described over 25 years ago[Salton 1975, Salton 1989]. Much of

this early work is relevant to Web search engines; however, key differences not only

between the Web and typical collections but also between the Webs users and typical

searchers challenge many of the assumptions underlying conventional IR work.

The basic problem in IR is to find relevant documents from a collection of documents in

response to a searchers query. Typically, the query is expressed as a string of words and

a set of constraints on metadata, and the results are a selection of documents from a large

collection. Most of the work in IR has focused on systems for storing and viewing

documents, methods for processing queries and determining the relevant results, and

user interfaces for querying, viewing, and refining results.

3.3.1 PARTS

9


10/25

A search engine or a web crawler has several parts: Firstly, a crawler that traverses

the web graph and downloads web documents; Secondly, an indexer thatprocesses and indexes the downloaded documents; Thirdly, a query manager that

handles the user query and returns relevant documents indexed in the database to the

user. The Web is similar to a graph, in that links are like edges and web pages are

like nodes. Crawler starts from a seed page and then uses the external links within

it to attend to other pages. Web crawlers o retrieve Web pages and add

them or their representations to a local repository.

Fig WebCrawlers overall architecture. The Crawler retrieves and processes documents

from the Web, creating an index that the server uses to answer queries.

3.4 META SEARCH ENGINE

From a searchers perspective, meta-search is currently desirable because each search

engine, although overlapping in content with others, has some unique documents in its

index. Also, each search engine updates documents with different frequencies; thus,

some will have a more up-to-date picture of the Web. Meta search engines also

synthesize the relevance ranking schemes of these search engines and, as a result, can

theoretically produce better results. The drawback to using meta-search engines is the

same as mentioned above for the widely distributed retrieval models: searchers

encounter a slowdown in performance, though Meta Crawler made heavy use of caching

to eliminate the performance problems.

10


11/25

3.5 WEB PORTALS

A web portal, also known as a links page, presents information from diverse sources in aunified way. Apart from the standard search engine feature, web portals offer other

services such as e-mail, news, stock prices, information, databases and entertainment.

Portals provide a way for enterprises to provide a consistent look and feel with access

control and procedures for multiple applications and databases, which otherwise would

have been different entities altogether. Examples of public web portals are AOL,

iGoogle, MSNBC, Netvibes, and Yahoo!.

3.6 ANALYSIS LINK STRUCTURE OF WEB

Hubs and Authorities (also known as HITS algorithm) is a scheme used for ranking web

pages based on relevance that existed on the web as a precursor to PageRank. The idea

behind Hubs and Authorities stemmed from a particular insight into the creation of web

pages when the Internet was originally forming; that is, certain web pages, known as

hubs, served as large directories that were not actually authoritative in the information

that it held, but were used as compilations of a broad catalog of information that led

users directly to other authoritative pages. In other words, a good hub represented a page

that pointed to many other pages, and a good authority represented a page that was

linked by many different hubs.[1] The scheme therefore assigns two scores for each page:

a hub score and an authority score.

3.6.1 HUBS

Hubs are highly-valued lists for a given query. For example, a directory page from a

major encyclopedia or paper that links to many different highly-linked pages would

typically have a higher hub score than a page that links to relatively few other sources.

11
http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0


12/25

3.6.2 AUTHORITIES

Authorities are highly endorsed answers to the query. A page that is particularly popular

and linked by many different directories will typically have a higher authority score than

a page that is unpopular.

CHAPTER 4

MACHINE LEARNING

4.1 MACHINE LEARNING

Machine learning is the study of how to make computers learn; the goal

12


13/25

is to make computers improve their performance through experience. As shown

in Fig. 3 computer is getting the input as a class of tasks and with the help of

algorithm, the performance ofthe system can be improved.Machine learning approaches in Information Retrieval is the ability of a machine to

improve its performance based on previous results. Machine learning is an area of

artificial intelligence concerned with the development of techniques, which allow

computers to "learn". More specifically, machine learning is a method for creating

computer programs by the analysis of data sets.

Figure 2.3 Machine Learning

The Machine Learning is of three types:

Supervised learning

Unsupervised Learning

Semi-supervised learning

4.1.1 SUPERVISED LEARNING: it is also called classification is based on

learning from a training data set. From a theoretical point of view, supervised and

unsupervised learning differ only in the causal structure of the model. In supervised

learning, the model defines the effect one set of observations, called inputs, has on

13


14/25

another set of observations, called outputs. In other words, the inputs are assumed to

be at the beginning and outputs at the end of the causal chain. The models can

include mediating variables between the inputs and outputs.4.1.2 UNSUPERVISED LEARNING: In this approach data exists but there is no

knowledge of the correct answer of applying the model to the data. An

unsupervised learner agent is neither told the correct action in each state, nor

is an evaluation of action taken in each state. In unsupervised learning, all the

observations are assumed to be caused by latent variables, that is, the observations

are assumed to be at the end of the causal chain. In practice, models for supervised

learning often leave the probability for inputs undefined. This model is not needed

as long as the inputs are available, but if some of the input values are missing, it is

not possible to infer anything about the outputs. If the inputs are also modeled, then

missing inputs cause no problem since they can be considered latent variables as in

unsupervised learning.

4.1.3 SEMI SUPERVISED LEARNING: it is a goal-directed activity, which can

be precisely evaluated either Learning from labeled and unlabeled documents,

Reinforcement learning and Case Based Reasoning. Semi-supervised learning refers

to the use of both labeled and unlabeled data for training. It contrasts supervised learning

(data all labeled) or unsupervised learning (data all unlabeled). Other names are

learning from labeled and unlabeled data or learning from partially labeled/classified

data. Notice semi-supervised learning can be either transductive or inductive.

Here we focus on semi-supervised classification. It is a special form of classification.

Traditional classifiers use only labeled data (feature / label pairs) to train. Labeled

instances however are often difficult, expensive, or time consuming to obtain, as they

require the efforts of experienced human annotators. Mean while unlabeled data may be

relatively easy to collect, but there has been few ways to use them. Semi-supervised

learning addresses this problem by using large amount of unlabeled data, together with

the labeled data, to build better classifiers. Because semi-supervised learning requires

less human effort and gives higher accuracy, it is of great interest both in theory and in

practice.

14


15/25

Labels are hard to obtain while unlabeled data are abundant, therefore semi-

supervised learning is a good idea to reduce human labor and improve accuracy. Do not

take it for granted. Even though you (or your domain expert) do not spend as much timein labeling the training data, you need to spend reasonable amount of effort to design

good models / features / kernels / similarity functions for semi-supervised learning.

One example of a semi-supervised learning technique is co-training, in which

two or possibly more learners are each trained on a set of examples, but with each

learner using a different, and ideally independent, set of features for each example.

An alternative approach is to model the joint probability distribution of the features and

the labels. For the unlabelled data the labels can then be treated as 'missing data'. It is

common to use the EM algorithm to maximize the likelihood of the model.

CHAPTER 5

ACTIVELEARNING

5.1 ACTIVE LEARNING

15
http://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithmhttp://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithm


16/25

Active learning is to design and analyze learning algorithms that can

effectively filter or choose the samples for which they predict the hypothesis. Active

learning is a method of online learning, where a learner strategically selects newtraining examples that provide maximal information about the unlabeled dataset,

resulting in higher classification accuracy for a given training set size as compared to

using randomly selected examples. Active learning is most useful when there are

sufficient a number of unlabeled samples but it is expensive to obtain class labels.

Many active learning algorithms however assume that the model built upon labeled data

is not biased and the probability distribution of the unlabeled and existing datasets are

identical.

Active learning (sometimes called query learning or optimal experimental

design in the statistics literature) is a subfield of machine learning and, more generally,

artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to

choose the data from which it learnsto be curious, if you willit will perform better

with less training. Why is this a desirable property for learning algorithms to have?

Consider that, for any supervised learning system to perform well, it must often be

trained on hundreds (even thousands) of labeled instances. Sometimes these labels come

at little or no cost, such as the the spam flag you mark on unwanted email messages,

or the five-star rating you might give to films on a social networking website. Learning

systems use these flags and ratings to better filter your junk email and suggest movies

you might enjoy. In these cases you provide such labels for free, but for many other

more sophisticated supervised learning tasks, labeled instances are very difficult, time-

consuming, or expensive to obtain. Here are a few examples:

Speech recognition. Accurate labeling of speech utterances is extremely time

consuming and requires trained linguists. Reports that annotation at the word

level can take ten times longer than the actual audio (e.g., one minute of speech

takes ten minutes to label), and annotating phonemes can take 400 times as long

(e.g., nearly seven hours). The problem is compounded for rare languages or

dialects.

16


17/25

Information extraction. Good information extraction systems must be trained

using labeled documents with detailed annotations. Users highlight entities or

relations of interest in text, such as person and organization names, or whether aperson works for a particular organization. Locating entities and relations can

take a half-hour or more for even simple newswire stories. Annotations for other

knowledge domains may require additional expertise, e.g., annotating gene and

disease mentions for biomedical information extraction usually requires PhD-

level biologists.

Classification and filtering. Learning to classify documents (e.g., articles or web

pages) or any other kind of media (e.g., image, audio, and video files) requires

that users label each document or media file with particular labels, like

relevant or not relevant. Having to annotate thousands of these instances can

be tedious and even redundant.

Active learning systems attempt to overcome the labeling bottleneck by asking queries

in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).

In this way, the active learner aims to achieve high accuracy using as few labeled

instances as possible, thereby minimizing the cost of obtaining labeled data. Active

learning is well-motivated in many modern machine learning problems where data may

be abundant but labels are scarce or expensive to obtain.

CHAPTER 6

LIMITATIONS OF EXISTINGSYSTEM

17


18/25

As the web continues to grow exponentially, the idea of crawling the entire

web on a regular basis becomes less and less feasible. General-purpose searchengines have certain limitations like Querying Mechanism, Keyword Exact

Matching, Low web Coverage Rate, Long result list with low relevancy to user

query. Due to these basic problems and also because of the need to include

information on specific domain, domain-specific search engines were proposed. A

domain- specific search engine is defined as: an information access system that

allows access to all the information on the web that is relevant to a particular domain.

Search engines can be classified as general-purpose search engines and domain-

specific search engines. Domain in domain-specific search engines can be

specialized with two existing kinds of search engines, search engines which focus

on specific document type such as resumes, homepages, movies, etc. And search

engines focus on specific topic like computer science, climbing, sport, etc

Domain Specific search engines can be IBM- Focused Crawler, Cora Domain

Specific Search Engine

6.1 IBM- Focused Crawler

Focused crawling, as Soumen Chakrabati et al [7, 14] proposed a new

approach to topic specific resource discovery at IBMs Almaden center. The focus

topic in this system is represented by a set of example pages that is provided by a

user to the system. In the system described there is a user- browsable topic

taxonomy where the user can mark some of the documents as good and select them

as the focus topic. The system has three main components: A classifier that makes

judgments on the relevancy of crawled documents, and decides on following the

links within pages. The classifier is an extended version of the Nave Bayes

classifier. The second component is a distiller that evaluates the centrality of each

page to determine crawling priority of links within it. The distiller uses the

18


19/25

bibliometric concepts of hub and authority pages as an approximate social

judgment of web page quality. For each page it calculates the hub and authority

scores of each web page with an extended version of the HITS algorithm. Ittries to crawl the links within pages with the highest hub score first in order to

find new authorities. The third component of the system is a dynamic crawler

that crawls the web according to a re-orderable priority queue. The system works in

two phases: training and testing. In the training phase, the classifier is trained

with some labeled data relevant to the focus topic. The training data set is acquired

from existing taxonomy-like search engines (portal) such as Yahoo! and Open

DirectoryProject.

Figure 2.3 IBM- Focused Crawler

Problems associated with this search engines are:

Fixed model of classifier, IBM focused crawleruses a fixed model of

relevancy class as a classifier to evaluate topical relevancy of documents.

A more adaptive classifieruses documents that are marked as relevant by the

19


20/25

classifier to update the classifier. However, ensuring flexibility in the

classifierwithout simultaneously corrupting the classifier is difficult.

Does not model future rewards. One major problem faced by this

focused crawler is that it is does not learn that some sets of off-topic

documents often lead reliably to highly relevant documents. In other words it

does not model the future reward of links. For example, a home page

of a computer science department is not relevant to Reinforcement

Learning, but links from that home page eventually lead to the home page

of the Machine Learning Laboratory or the home page of a researcher

where it may find valuable targetpages.

Lack of comparison of results. The results reported in papers of this

approach are not compared with results of human-maintained portals like

the Open Directory Project. However, judgments on the quality of gathered

pages in this approach are hard.

Exemplary documents. Representation of focus topic in the form of

some high quality documents related to topic is sometimes hard for the user.

CHAPTER 7

PROPOSED WORK

20


21/25

The problems that have been discussed in section 6 can be removed with the help of

machine learning approaches along with the active learning approach. As discussed in

the section that machine learning can be of three types, by employing the semi

structured machine-learning approach to the IBM focused crawlers the system

performance can comparatively be improved. Semi-supervised learning is a goal-

directed activity, which can be precisely evaluated. In the web context our training

data is a small set of labeled documents. The label is document class, and our goal is

to guess the label of an un- seen document. In this category we review learning

from labeled and unlabeled documents. In some semi-supervised approaches, a

learner agent learns from interaction with a dynamic environment. In these

environments, providing a set of training data for the agent is very difficult or

even impossible, because of the dynamics inherent in the environment and

correspondingly huge number of states and actions. One requirement of this model

is a measure of the goodness of action that the agent takes in a state.

7.1 Active Learner in Domain Specific SearchEngine

To overcome the problems that occur in IBM, focused crawler, one technique

is to replace the classifier with active learner. As discussed in section V, Active

learner is capable of predicting the hypothesis [11]; it will generate the hypothesis

with the experiences. Semi-supervised learning is used because it is veryexpensive to

generate labeled data for every set. To use active learning approaches with semi-

supervised learning to improve the efficiency of the system. Semi-supervised

algorithms are used in text classification:

Co-Training

Semi-supervised EM

Co-EM

Co-EMT uses a multi-view active learning algorithm

21


22/25

Multi-view setting applies to learning problems that have a natural way to divide

their features into subsets (views) each of which are sufficient to learn the targetconcept. Multi-view active learning maximizes the accuracy of the learned

hypotheses while minimizing the amount of labeled training data.

The algorithm to improve the efficiency is discussed as follows:

Given

A learning problem withview

A learningalgorithm

Set of T and U labeled and unlabeled examples

No. of k iterations to beperformed

The number N of queries to be made

CO-EMT for domain specific engine

Let iter be the number of iterationsto be made

LOOP for kiterations

- use L, V1(T), and V2(T) to create classifiers h1 and h2

- FOR EACH class Ci DO

- let E1 and E2 be the e unlabeled examples on which h1and h2 make

the most confident predictions for Ci(mean the relevant data

according to the userquery)

- label E1 and E2 according to h1and h2,respectiveiy, and add

them to T

combine the prediction of h1andh2

this combination will be the combination of the required documents.

22


23/25

Fig Improvements in IBM- Focused Crawler

The system has three main components: (1) Active learner makes judgments on the

relevancy of crawled documents, the tasks that are being done by active learner:

Firstly, it maintains the history and makes hypothesis based on them, Secondly it

allows meta- reasoning to be done. (2) Distiller that evaluates the centrality of each

page to determine crawling priority of links within it. The distiller uses the

bibiliometric concepts of hub and authoritypages as an approximate social judgment

of web page quality. For each page it calculates the hub and authority scores of each

web page with an extended version of the HITS algorithm. (3) Dynamic crawler that

crawls the web according to a re-orderable priority queue.

CHAPTER 8

CONCLUSION

23


24/25

Search engine technology has gone through several evolutions and finally reached

the point where Artificial Intelligence can offer tremendous help. We have reviewed

this evolution from the beginning up to now and surveyed several differenttechniques that have been developed to improve search engine functionality. In

particular we highlighted some machine learningapproaches to information retrieval

on the web and concentrated on topic-specific search engines. Finally, we proposed

an information integration environmentbased on active learning. Our approach uses

current technology in a better manner to provide appropriate results.

REFERENCES

24


25/25

1. http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html. INFORMATION

RETRIEVAL WIKI

2. WEB CRAWLER (http://www.chato.cl/papers/crawling_thesis/effective_web_crawling.pdf)

3. INFORMATION RETRIEVAL IN DOMAIN SPECIFIC SEARCH ENGINEWITH MACHINE LEARNING APPROACHES

http://www.waset.org/journals/waset/v42/v42-26.pdf

4. S. Lawrence, Context in web Search, IEEE Data Engineering Bulletin, [11]

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-

training. Proc. of the Conference on Computational Learning Theory (pp. 92100).

http://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdf

5. S. Chakrabarti, Data mining for hypertext: A tutorial survey, SIGKDD: SIGKDD

Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge

Discovery&DataMining,ACM1(2):1.11,2000.http://www.cs.uic.edu/~liub/WebMini

ngBook.html

6. S. Lawrence and C.L. Giles, Searching the World Wide Web, Science 80:98-100,1998.

25
http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.html

Mine Report1

Documents

Transcript of Mine Report1