Mine Report1
-
Upload
nirmal-kr-k-r -
Category
Documents
-
view
223 -
download
0
Transcript of Mine Report1
-
8/8/2019 Mine Report1
1/25
CHAPTER 1
INTRODUCTION
A domain-specific search engine is An information access system that allows access to
all the information on the web that is relevant to a particular domain. As the web
continues to grow exponentially, the idea of crawling the entire web on a regular basis
becomes less and less feasible, so the need to include information on specific domain,
domain-specific search engines was proposed. As more information becomes available
on the World Wide Web, it becomes more difficult to provide effective search tools for
information access.Today, people access web information through two main kinds of search
interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in
the form of a set of keywords showing the topic of interest). Better support is needed for
expressing one's information need and returning high quality search results by web
search tools. There appears to be a need for systems that do reasoning under uncertainty
and are flexible enough to recover from the contradictions, inconsistencies, and
irregularities that such reasoning involves.
In a multi-view problem, the features of the domain can be partitioned into
disjoint subsets (views) that are sufficient to learn the target concept. Semi-supervised,
multi-view algorithms, which reduce the amount of labeled data required for learning,
rely on the assumptions that the views are compatible and uncorrelated.
This paper describes the use of semi-structured machine learning approach with
Active learning for the Domain Specific Search Engines. The proposed work shows
that with the help of this approach relevant data can be extracted with the minimum
queries fired by the user. It requires small number of labeled data and pool of unlabelled
data on which the learning algorithm is applied to extract the required data.
1
-
8/8/2019 Mine Report1
2/25
CHAPTER 2
WWW
World Wide Web is a vast source of information organized in the form of a large
distributed hypertext system. Every Hypertext link on every web page in the world
contains one of the URLs. When you click on a link of any kind on a Webpage, you
send a request to retrieve the unique document on some computer in the world that is
uniquely identified by that URL. URLs are like addresses of web page.
2.1 INFORMATION RETRIEVAL
As we are moving into the information age there is a need to secure data transmission
and also to store and retrieve the data and the information. Information Retrieval
Systems are increasing with the advent of the technology.
Let us fist talk about the basic differences between the data and the information.
2.1.1 DATA:
It is the raw fact.
For its retrieval it needs to be fully mentioned. If the file name or the document
name is not known or is case sensitive, there are chances for the system to fail
and do not retrieve any document.
Examples of data are a piece of paper, a book, an algorithm.
In the above examples their location is not known and hence the meaning cannot
be given to this data
2.1.2 INFORMATION:
Information is processed data.
2
-
8/8/2019 Mine Report1
3/25
For its retrieval a partial information is enough for its evaluation. Hence, the
system never fails.
Examples of information are a piece of paper on a table, a book in the shelf, abubble-sort algorithm.
In the above examples their location are known and hence they have a specified
meaning.
Data retrieving systems can be found in the operating system search. Windows search is
a best example for the data retrieval system. You would have to specify the exact name
of the file that you desire. Where information retrieval systems are are like web search
engines. The best known is Google. It processes the natural language and produces the
output in encompassing the entire set of documents matching the query.
It is very much important in these days to retrieve the data at a faster rate. Earlier
there used to linear search mechanisms where in the entire set of documents in the
database are read and then sorted according to the query and displayed. This had varied
complexities and took greater amount of time compared to the advanced techniques
available these days.
In an information retrieval system, the documents are scanned for the query. To
decrease the computation time of the system, the documents are scanned only for the
repeated key words which are considered relevant to the document. The output displayed
sends a feedback as an input for the next query. In this way with every query there is an
increase in the performance of the system. There should be more and more information
retrieving systems to reduce the time of searching a document and this is done with the
help of indexing.
3
-
8/8/2019 Mine Report1
4/25
2.2 AN INFORMATION RETRIEVAL SYSTEM
Let me illustrate by means of a black box what a typical IR system would look like. The
diagram shows three components: input, processor and output. Such a trichotomy may
seem a little trite, but the components constitute a convenient set of pegs upon which to
hang a discussion.
Starting with the input side of things. The main problem here is to obtain a
representation of each document and query suitable for a computer to use. Let me
emphasize that most computer-based retrieval systems store only a representation of the
document (or query) which means that the text of a document is lost once it has beenprocessed for the purpose of generating its representation. A document representative
could, for example, be a list of extracted words considered to be significant. Rather than
have the computer process the natural language, an alternative approach is to have an
artificial language within which all queries and documents can be formulated.
Figure 2.3 A typical IR system
4
-
8/8/2019 Mine Report1
5/25
When the retrieval system is on-line, it is possible for the user to change his request
during one search session in the light of a sample retrieval, thereby, it is hoped,
improving the subsequent retrieval run. Such a procedure is commonly referred to asfeedback.
Secondly, the processor, that part of the retrieval system concerned with the
retrieval process. The process may involve structuring the information in some
appropriate way, such as classifying it. It will also involve performing the actual
retrieval function, that is, executing the search strategy in response to a query. In the
diagram, the documents have been placed in a separate box to emphasize the fact that
they are not just input but can be used during the retrieval process in such a way that
their structure is more correctly seen as part of the retrieval process. Finally, we come to
the output, which is usually a set of citations or document numbers. In an operational
system the story ends here. However, in an experimental system it leaves the evaluation
to be done.
5
-
8/8/2019 Mine Report1
6/25
CHAPTER 3
SEARCHENGINE
3.1 INFORMATION RETRIEVAL ARCHITECTURE
A search engine is a program that can search the Web on a specific topic. Bytyping in a word orphrase (known as a keyword), the search engine will produce
pages of links on that topic. The more relevant links are at the top of the list, but that
is not always true. Information retrieval system works as shown in the Fig.
We can classify search engines as general-purpose search engines and domain-
specific search engines. Search engines come in three major flavors: Web
crawlers, Web portals, Meta-Search engines. A search engine or a web crawler has
Fig. 2 Information Retrieval Architecture
6
-
8/8/2019 Mine Report1
7/25
several parts: Firstly, a crawler that traverses the web graph and downloads web
documents; Secondly, an indexer that processes and indexes the downloaded
documents; Thirdly, a query manager that handles the user query and returns relevantdocuments indexed in the database to the user. The Web is similar to a graph, in
that links are like edges and web pages are like nodes. Crawler starts from a seed
page and then uses the external links within it to attend to other pages. Web
crawlers retrieve Web pages and add them or their representations to a local
repository. Web Content Analysis is done either by Document-query similarity which
is important for identifying similarity of user query to documents or Document-
document similarity which is important for finding similar documents in the
document pool of a search engine. The analysis Link Structure of Web is done by
ranking or Hub and Authority pages. In the former web pages differ from each
other in the number of in-links that they have. Page Rank Algorithm is used for the
purpose and in later the importance of pages can be extracted from the link
structure of web. In this approach two kindsofpages are identified from web page
links: first pages that are very important and authorities in a special topic, second
pages that have great number of links to authoritypages.
3.2 DOMAIN SPECIFIC SEARCH ENGINE
A domain-specific search engine is An information access system that allows
access to all the information on the web that is relevant to a particular domain.
Domain specific search engine a s distinct from a general Web search engine, focuses
on a specific segment of online content. The vertical content area may be based on
topicality, media type, or genre of content. Common examples include legal, medical,
patent (intellectual property), travel, and automobile search engines. In contrast to
general Web search engines, which attempt to index large portions of the World Wide
Webusing a Web crawler, vertical search engines typically use a focused crawlerthat
attempts to index only Web pages that are relevant to a pre-defined topic or set of
7
http://en.wikipedia.org/wiki/Web_search_enginehttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Search_engine_indexinghttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_crawlerhttp://en.wikipedia.org/wiki/Focused_crawlerhttp://en.wikipedia.org/wiki/Web_search_engine -
8/8/2019 Mine Report1
8/25
topics.
Domain specific search offers several potential benefits over general search engines:
Greater precision due to limited scope
Leverage domain knowledge including taxonomies and ontologies
Support specific unique user tasks
Domain-specific verticals focus on a specific topic. Domain-specific search solutions
focus on one area of knowledge, creating customized search experiences, that because of
the domain's limited corpus and clear relationships between concepts, provide extremely
relevant results for searchers.
3.3 WEB CRAWLER
WebCrawler, the first comprehensive full-text search engine for the World-Wide Web,
has played a fundamental role in making the Web easier to use for millions of people. Its
invention and subsequent evolution, spanning a three-year period, helped fuel the Websgrowth by creating a new way of navigating hypertext.
Before search engines like WebCrawler, users found Web documents by following
hypertext links from one document to another. When the Web was small and its
documents shared the same fundamental purpose, users could find documents with
relative ease. However, the Web quickly grew to millions of pages making navigation
difficult. WebCrawler assists users in their Web navigation by automating the task of
link traversal, creating a searchable index of the web, and fulfilling searchers queries
from the index. To use WebCrawler, a user issues a query to a pre-computed index,
quickly retrieving a list of documents that match the query.
This dissertation describes WebCrawlers scientific contributions: a method for
choosing a subset of the Web to index; an approach to creating a search service that is
easy to use; a new way to rank search results that can generate highly effective results
8
-
8/8/2019 Mine Report1
9/25
for both naive and expert searchers; and an architecture for the service that has
effectively handled a three-order-of-magnitude increase in load.
This dissertation also describes how WebCrawler evolved to accommodate theextraordinary growth of the Web. This growth affected WebCrawler not only by
increasing the size and scope of its index, but also by increasing the demand for its
service. Each of WebCrawlers components had to change to accommodate this growth:
the crawler had to download more documents, the full-text index had to become more
efficient at storing and finding those documents, and the service had to accommodate
heavier demand. Such changes were not only related to scale, however: the evolving
nature of the Web meant that functional changes were necessary, too, such as the ability
to handle naive queries from searchers.
At its heart, WebCrawler is a full-text retrieval engine: it takes queries expressed as
strings of words and looks them up in a full-text index. The field of full-text indexing
and information retrieval (IR) has a long history: many of the fundamental underlying
algorithms were first described over 25 years ago[Salton 1975, Salton 1989]. Much of
this early work is relevant to Web search engines; however, key differences not only
between the Web and typical collections but also between the Webs users and typical
searchers challenge many of the assumptions underlying conventional IR work.
The basic problem in IR is to find relevant documents from a collection of documents in
response to a searchers query. Typically, the query is expressed as a string of words and
a set of constraints on metadata, and the results are a selection of documents from a large
collection. Most of the work in IR has focused on systems for storing and viewing
documents, methods for processing queries and determining the relevant results, and
user interfaces for querying, viewing, and refining results.
3.3.1 PARTS
9
-
8/8/2019 Mine Report1
10/25
A search engine or a web crawler has several parts: Firstly, a crawler that traverses
the web graph and downloads web documents; Secondly, an indexer thatprocesses and indexes the downloaded documents; Thirdly, a query manager that
handles the user query and returns relevant documents indexed in the database to the
user. The Web is similar to a graph, in that links are like edges and web pages are
like nodes. Crawler starts from a seed page and then uses the external links within
it to attend to other pages. Web crawlers o retrieve Web pages and add
them or their representations to a local repository.
Fig WebCrawlers overall architecture. The Crawler retrieves and processes documents
from the Web, creating an index that the server uses to answer queries.
3.4 META SEARCH ENGINE
From a searchers perspective, meta-search is currently desirable because each search
engine, although overlapping in content with others, has some unique documents in its
index. Also, each search engine updates documents with different frequencies; thus,
some will have a more up-to-date picture of the Web. Meta search engines also
synthesize the relevance ranking schemes of these search engines and, as a result, can
theoretically produce better results. The drawback to using meta-search engines is the
same as mentioned above for the widely distributed retrieval models: searchers
encounter a slowdown in performance, though Meta Crawler made heavy use of caching
to eliminate the performance problems.
10
-
8/8/2019 Mine Report1
11/25
3.5 WEB PORTALS
A web portal, also known as a links page, presents information from diverse sources in aunified way. Apart from the standard search engine feature, web portals offer other
services such as e-mail, news, stock prices, information, databases and entertainment.
Portals provide a way for enterprises to provide a consistent look and feel with access
control and procedures for multiple applications and databases, which otherwise would
have been different entities altogether. Examples of public web portals are AOL,
iGoogle, MSNBC, Netvibes, and Yahoo!.
3.6 ANALYSIS LINK STRUCTURE OF WEB
Hubs and Authorities (also known as HITS algorithm) is a scheme used for ranking web
pages based on relevance that existed on the web as a precursor to PageRank. The idea
behind Hubs and Authorities stemmed from a particular insight into the creation of web
pages when the Internet was originally forming; that is, certain web pages, known as
hubs, served as large directories that were not actually authoritative in the information
that it held, but were used as compilations of a broad catalog of information that led
users directly to other authoritative pages. In other words, a good hub represented a page
that pointed to many other pages, and a good authority represented a page that was
linked by many different hubs.[1] The scheme therefore assigns two scores for each page:
a hub score and an authority score.
3.6.1 HUBS
Hubs are highly-valued lists for a given query. For example, a directory page from a
major encyclopedia or paper that links to many different highly-linked pages would
typically have a higher hub score than a page that links to relatively few other sources.
11
http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0http://en.wikipedia.org/wiki/AOLhttp://en.wikipedia.org/wiki/IGooglehttp://en.wikipedia.org/wiki/MSNBChttp://en.wikipedia.org/wiki/Netvibeshttp://en.wikipedia.org/wiki/Yahoo!http://en.wikipedia.org/wiki/HITS_algorithmhttp://en.wikipedia.org/wiki/PageRankhttp://en.wikipedia.org/wiki/Hubs_and_authorities#cite_note-0 -
8/8/2019 Mine Report1
12/25
3.6.2 AUTHORITIES
Authorities are highly endorsed answers to the query. A page that is particularly popular
and linked by many different directories will typically have a higher authority score than
a page that is unpopular.
CHAPTER 4
MACHINE LEARNING
4.1 MACHINE LEARNING
Machine learning is the study of how to make computers learn; the goal
12
-
8/8/2019 Mine Report1
13/25
is to make computers improve their performance through experience. As shown
in Fig. 3 computer is getting the input as a class of tasks and with the help of
algorithm, the performance ofthe system can be improved.Machine learning approaches in Information Retrieval is the ability of a machine to
improve its performance based on previous results. Machine learning is an area of
artificial intelligence concerned with the development of techniques, which allow
computers to "learn". More specifically, machine learning is a method for creating
computer programs by the analysis of data sets.
Figure 2.3 Machine Learning
The Machine Learning is of three types:
Supervised learning
Unsupervised Learning
Semi-supervised learning
4.1.1 SUPERVISED LEARNING: it is also called classification is based on
learning from a training data set. From a theoretical point of view, supervised and
unsupervised learning differ only in the causal structure of the model. In supervised
learning, the model defines the effect one set of observations, called inputs, has on
13
-
8/8/2019 Mine Report1
14/25
another set of observations, called outputs. In other words, the inputs are assumed to
be at the beginning and outputs at the end of the causal chain. The models can
include mediating variables between the inputs and outputs.4.1.2 UNSUPERVISED LEARNING: In this approach data exists but there is no
knowledge of the correct answer of applying the model to the data. An
unsupervised learner agent is neither told the correct action in each state, nor
is an evaluation of action taken in each state. In unsupervised learning, all the
observations are assumed to be caused by latent variables, that is, the observations
are assumed to be at the end of the causal chain. In practice, models for supervised
learning often leave the probability for inputs undefined. This model is not needed
as long as the inputs are available, but if some of the input values are missing, it is
not possible to infer anything about the outputs. If the inputs are also modeled, then
missing inputs cause no problem since they can be considered latent variables as in
unsupervised learning.
4.1.3 SEMI SUPERVISED LEARNING: it is a goal-directed activity, which can
be precisely evaluated either Learning from labeled and unlabeled documents,
Reinforcement learning and Case Based Reasoning. Semi-supervised learning refers
to the use of both labeled and unlabeled data for training. It contrasts supervised learning
(data all labeled) or unsupervised learning (data all unlabeled). Other names are
learning from labeled and unlabeled data or learning from partially labeled/classified
data. Notice semi-supervised learning can be either transductive or inductive.
Here we focus on semi-supervised classification. It is a special form of classification.
Traditional classifiers use only labeled data (feature / label pairs) to train. Labeled
instances however are often difficult, expensive, or time consuming to obtain, as they
require the efforts of experienced human annotators. Mean while unlabeled data may be
relatively easy to collect, but there has been few ways to use them. Semi-supervised
learning addresses this problem by using large amount of unlabeled data, together with
the labeled data, to build better classifiers. Because semi-supervised learning requires
less human effort and gives higher accuracy, it is of great interest both in theory and in
practice.
14
-
8/8/2019 Mine Report1
15/25
Labels are hard to obtain while unlabeled data are abundant, therefore semi-
supervised learning is a good idea to reduce human labor and improve accuracy. Do not
take it for granted. Even though you (or your domain expert) do not spend as much timein labeling the training data, you need to spend reasonable amount of effort to design
good models / features / kernels / similarity functions for semi-supervised learning.
One example of a semi-supervised learning technique is co-training, in which
two or possibly more learners are each trained on a set of examples, but with each
learner using a different, and ideally independent, set of features for each example.
An alternative approach is to model the joint probability distribution of the features and
the labels. For the unlabelled data the labels can then be treated as 'missing data'. It is
common to use the EM algorithm to maximize the likelihood of the model.
CHAPTER 5
ACTIVELEARNING
5.1 ACTIVE LEARNING
15
http://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithmhttp://en.wikipedia.org/wiki/Co-traininghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithm -
8/8/2019 Mine Report1
16/25
Active learning is to design and analyze learning algorithms that can
effectively filter or choose the samples for which they predict the hypothesis. Active
learning is a method of online learning, where a learner strategically selects newtraining examples that provide maximal information about the unlabeled dataset,
resulting in higher classification accuracy for a given training set size as compared to
using randomly selected examples. Active learning is most useful when there are
sufficient a number of unlabeled samples but it is expensive to obtain class labels.
Many active learning algorithms however assume that the model built upon labeled data
is not biased and the probability distribution of the unlabeled and existing datasets are
identical.
Active learning (sometimes called query learning or optimal experimental
design in the statistics literature) is a subfield of machine learning and, more generally,
artificial intelligence. The key hypothesis is that if the learning algorithm is allowed to
choose the data from which it learnsto be curious, if you willit will perform better
with less training. Why is this a desirable property for learning algorithms to have?
Consider that, for any supervised learning system to perform well, it must often be
trained on hundreds (even thousands) of labeled instances. Sometimes these labels come
at little or no cost, such as the the spam flag you mark on unwanted email messages,
or the five-star rating you might give to films on a social networking website. Learning
systems use these flags and ratings to better filter your junk email and suggest movies
you might enjoy. In these cases you provide such labels for free, but for many other
more sophisticated supervised learning tasks, labeled instances are very difficult, time-
consuming, or expensive to obtain. Here are a few examples:
Speech recognition. Accurate labeling of speech utterances is extremely time
consuming and requires trained linguists. Reports that annotation at the word
level can take ten times longer than the actual audio (e.g., one minute of speech
takes ten minutes to label), and annotating phonemes can take 400 times as long
(e.g., nearly seven hours). The problem is compounded for rare languages or
dialects.
16
-
8/8/2019 Mine Report1
17/25
Information extraction. Good information extraction systems must be trained
using labeled documents with detailed annotations. Users highlight entities or
relations of interest in text, such as person and organization names, or whether aperson works for a particular organization. Locating entities and relations can
take a half-hour or more for even simple newswire stories. Annotations for other
knowledge domains may require additional expertise, e.g., annotating gene and
disease mentions for biomedical information extraction usually requires PhD-
level biologists.
Classification and filtering. Learning to classify documents (e.g., articles or web
pages) or any other kind of media (e.g., image, audio, and video files) requires
that users label each document or media file with particular labels, like
relevant or not relevant. Having to annotate thousands of these instances can
be tedious and even redundant.
Active learning systems attempt to overcome the labeling bottleneck by asking queries
in the form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).
In this way, the active learner aims to achieve high accuracy using as few labeled
instances as possible, thereby minimizing the cost of obtaining labeled data. Active
learning is well-motivated in many modern machine learning problems where data may
be abundant but labels are scarce or expensive to obtain.
CHAPTER 6
LIMITATIONS OF EXISTINGSYSTEM
17
-
8/8/2019 Mine Report1
18/25
As the web continues to grow exponentially, the idea of crawling the entire
web on a regular basis becomes less and less feasible. General-purpose searchengines have certain limitations like Querying Mechanism, Keyword Exact
Matching, Low web Coverage Rate, Long result list with low relevancy to user
query. Due to these basic problems and also because of the need to include
information on specific domain, domain-specific search engines were proposed. A
domain- specific search engine is defined as: an information access system that
allows access to all the information on the web that is relevant to a particular domain.
Search engines can be classified as general-purpose search engines and domain-
specific search engines. Domain in domain-specific search engines can be
specialized with two existing kinds of search engines, search engines which focus
on specific document type such as resumes, homepages, movies, etc. And search
engines focus on specific topic like computer science, climbing, sport, etc
Domain Specific search engines can be IBM- Focused Crawler, Cora Domain
Specific Search Engine
6.1 IBM- Focused Crawler
Focused crawling, as Soumen Chakrabati et al [7, 14] proposed a new
approach to topic specific resource discovery at IBMs Almaden center. The focus
topic in this system is represented by a set of example pages that is provided by a
user to the system. In the system described there is a user- browsable topic
taxonomy where the user can mark some of the documents as good and select them
as the focus topic. The system has three main components: A classifier that makes
judgments on the relevancy of crawled documents, and decides on following the
links within pages. The classifier is an extended version of the Nave Bayes
classifier. The second component is a distiller that evaluates the centrality of each
page to determine crawling priority of links within it. The distiller uses the
18
-
8/8/2019 Mine Report1
19/25
bibliometric concepts of hub and authority pages as an approximate social
judgment of web page quality. For each page it calculates the hub and authority
scores of each web page with an extended version of the HITS algorithm. Ittries to crawl the links within pages with the highest hub score first in order to
find new authorities. The third component of the system is a dynamic crawler
that crawls the web according to a re-orderable priority queue. The system works in
two phases: training and testing. In the training phase, the classifier is trained
with some labeled data relevant to the focus topic. The training data set is acquired
from existing taxonomy-like search engines (portal) such as Yahoo! and Open
DirectoryProject.
Figure 2.3 IBM- Focused Crawler
Problems associated with this search engines are:
Fixed model of classifier, IBM focused crawleruses a fixed model of
relevancy class as a classifier to evaluate topical relevancy of documents.
A more adaptive classifieruses documents that are marked as relevant by the
19
-
8/8/2019 Mine Report1
20/25
classifier to update the classifier. However, ensuring flexibility in the
classifierwithout simultaneously corrupting the classifier is difficult.
Does not model future rewards. One major problem faced by this
focused crawler is that it is does not learn that some sets of off-topic
documents often lead reliably to highly relevant documents. In other words it
does not model the future reward of links. For example, a home page
of a computer science department is not relevant to Reinforcement
Learning, but links from that home page eventually lead to the home page
of the Machine Learning Laboratory or the home page of a researcher
where it may find valuable targetpages.
Lack of comparison of results. The results reported in papers of this
approach are not compared with results of human-maintained portals like
the Open Directory Project. However, judgments on the quality of gathered
pages in this approach are hard.
Exemplary documents. Representation of focus topic in the form of
some high quality documents related to topic is sometimes hard for the user.
CHAPTER 7
PROPOSED WORK
20
-
8/8/2019 Mine Report1
21/25
The problems that have been discussed in section 6 can be removed with the help of
machine learning approaches along with the active learning approach. As discussed in
the section that machine learning can be of three types, by employing the semi
structured machine-learning approach to the IBM focused crawlers the system
performance can comparatively be improved. Semi-supervised learning is a goal-
directed activity, which can be precisely evaluated. In the web context our training
data is a small set of labeled documents. The label is document class, and our goal is
to guess the label of an un- seen document. In this category we review learning
from labeled and unlabeled documents. In some semi-supervised approaches, a
learner agent learns from interaction with a dynamic environment. In these
environments, providing a set of training data for the agent is very difficult or
even impossible, because of the dynamics inherent in the environment and
correspondingly huge number of states and actions. One requirement of this model
is a measure of the goodness of action that the agent takes in a state.
7.1 Active Learner in Domain Specific SearchEngine
To overcome the problems that occur in IBM, focused crawler, one technique
is to replace the classifier with active learner. As discussed in section V, Active
learner is capable of predicting the hypothesis [11]; it will generate the hypothesis
with the experiences. Semi-supervised learning is used because it is veryexpensive to
generate labeled data for every set. To use active learning approaches with semi-
supervised learning to improve the efficiency of the system. Semi-supervised
algorithms are used in text classification:
Co-Training
Semi-supervised EM
Co-EM
Co-EMT uses a multi-view active learning algorithm
21
-
8/8/2019 Mine Report1
22/25
Multi-view setting applies to learning problems that have a natural way to divide
their features into subsets (views) each of which are sufficient to learn the targetconcept. Multi-view active learning maximizes the accuracy of the learned
hypotheses while minimizing the amount of labeled training data.
The algorithm to improve the efficiency is discussed as follows:
Given
A learning problem withview
A learningalgorithm
Set of T and U labeled and unlabeled examples
No. of k iterations to beperformed
The number N of queries to be made
CO-EMT for domain specific engine
Let iter be the number of iterationsto be made
LOOP for kiterations
- use L, V1(T), and V2(T) to create classifiers h1 and h2
- FOR EACH class Ci DO
- let E1 and E2 be the e unlabeled examples on which h1and h2 make
the most confident predictions for Ci(mean the relevant data
according to the userquery)
- label E1 and E2 according to h1and h2,respectiveiy, and add
them to T
combine the prediction of h1andh2
this combination will be the combination of the required documents.
22
-
8/8/2019 Mine Report1
23/25
Fig Improvements in IBM- Focused Crawler
The system has three main components: (1) Active learner makes judgments on the
relevancy of crawled documents, the tasks that are being done by active learner:
Firstly, it maintains the history and makes hypothesis based on them, Secondly it
allows meta- reasoning to be done. (2) Distiller that evaluates the centrality of each
page to determine crawling priority of links within it. The distiller uses the
bibiliometric concepts of hub and authoritypages as an approximate social judgment
of web page quality. For each page it calculates the hub and authority scores of each
web page with an extended version of the HITS algorithm. (3) Dynamic crawler that
crawls the web according to a re-orderable priority queue.
CHAPTER 8
CONCLUSION
23
-
8/8/2019 Mine Report1
24/25
Search engine technology has gone through several evolutions and finally reached
the point where Artificial Intelligence can offer tremendous help. We have reviewed
this evolution from the beginning up to now and surveyed several differenttechniques that have been developed to improve search engine functionality. In
particular we highlighted some machine learningapproaches to information retrieval
on the web and concentrated on topic-specific search engines. Finally, we proposed
an information integration environmentbased on active learning. Our approach uses
current technology in a better manner to provide appropriate results.
REFERENCES
24
-
8/8/2019 Mine Report1
25/25
1. http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html. INFORMATION
RETRIEVAL WIKI
2. WEB CRAWLER (http://www.chato.cl/papers/crawling_thesis/effective_web_crawling.pdf)
3. INFORMATION RETRIEVAL IN DOMAIN SPECIFIC SEARCH ENGINEWITH MACHINE LEARNING APPROACHES
http://www.waset.org/journals/waset/v42/v42-26.pdf
4. S. Lawrence, Context in web Search, IEEE Data Engineering Bulletin, [11]
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-
training. Proc. of the Conference on Computational Learning Theory (pp. 92100).
http://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdf
5. S. Chakrabarti, Data mining for hypertext: A tutorial survey, SIGKDD: SIGKDD
Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge
Discovery&DataMining,ACM1(2):1.11,2000.http://www.cs.uic.edu/~liub/WebMini
ngBook.html
6. S. Lawrence and C.L. Giles, Searching the World Wide Web, Science 80:98-100,1998.
25
http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.htmlhttp://www.waset.org/journals/waset/v42/v42-26.pdfhttp://pages.cs.wisc.edu/~bsettles/pub/settles.activelearning.pdfhttp://www.cs.uic.edu/~liub/WebMiningBook.htmlhttp://www.cs.uic.edu/~liub/WebMiningBook.html