Identification of Relevant Sections in Web Pages Using a Machine Learning Approach
-
Upload
jerrin-george -
Category
Technology
-
view
1.172 -
download
1
description
Transcript of Identification of Relevant Sections in Web Pages Using a Machine Learning Approach
Identification of Relevant Sections in Web Pages Using aMachine Learning Approach
Jerrin Shaji George
NIT Calicut
November 8, 2012
Introduction
� There is a massive amount of data available on the internet.
� Extracting only the relevant content has become very important.
� A Machine Learning approach is suitable as it can adapt to therapidly changing dynamics of the internet.
2 of 28
Machine Learning
� The science of getting computers to act without being explicitlyprogrammed.
� A method of teaching computers to make and improve predictionsor behaviors based on some data.
� Machine Learning Algorithms :� Supervised Machine Learning
� Unsupervised Machine Learning
3 of 28
Supervised Learning
� Machine learning task of inferring a function from labeled trainingdata.
Figure: Supervised Learning Model (courtesy scikit-learn)4 of 28
Supervised Learning
� Example of a classification problem - discrete valued output.
Figure: Copyright c©Victor Lavrenko
5 of 28
Supervised Learning
� Example of a regression problem - continuous valued output.
Figure: Copyright c©Victor Lavrenko
6 of 28
Unsupervised Learning
� The data has no labels. The algorithm tries to find similaritiesbetween the objects in question.
Figure: Unsupervised Learning Model (courtesy scikit-learn)
7 of 28
Unsupervised Learning
� Example of a clustering problem
Figure: Copyright c©Victor Lavrenko8 of 28
Support Vector machines (SVM)
� A supervised learning model.
� Used for classification and regression analysis.
� The basic SVM:� A non-probabilistic binary linear classifier.
� Classifies each given input into one of the two possible classes whichforms the output.
9 of 28
The SVM Algorithm
� Inputs are formulated as feature vectors.
� The feature vectors are mapped into a feature space by using akernel function.
� A division is computed in the feature space to optimally separatethe classes of training vectors.
10 of 28
The SVM Algorithm
φ: The Kernel Function
11 of 28
Formal Definition of SVM
� An SVM constructs a hyperplane or set of hyperplanes in a high-or infinite-dimensional space.
� It can be used for classification and regression.
� A good separation is achieved by the hyperplane that has thelargest distance to the nearest training data point of any class(called the functional margin).
12 of 28
Optimal Separating Hyperplane
Figure: Courtesy Steve Gunn
13 of 28
Functional Margin
� The vectors (points) that constrain the width of the margin are thesupport vectors.
Figure: Image from scikit-learn14 of 28
Mapping to Higher Dimensions
� Sometime data is not linearly separable.
� If the original finite-dimensional space is mapped into a muchhigher-dimensional space, the separation is made easier in thatspace.
� This is achieved by the SVM using the Kernel Trick.
15 of 28
Mapping to Higher Dimensions
� Mapping from 1D to 2D
� Mapping from 2D to 3D
Figure: Coutesy Steve Gunn16 of 28
Identification of Relevant Sections in a Web Page forWeb Search
� Shallow techniques like keyword matching gives unsatisfactoryresults.
� Search methodologies must focus more on contextual informationthan just keyword occurrences.
� Search term might not a be very differentiating term.
� It might not appear in the section at all.
� SQUINT : an SVM based approach to identify sections of a Webpage relevant to a Web Search.
17 of 28
Overall Architecure
18 of 28
Feature Generation
� Word Rank Based Features
� Bigram Rank Based Features
� Coverage of Top Ranked Tokens
� Query Word Frequency
� Distance from the Query
19 of 28
Word Rank Based Features
� The rank of a word is defined to be its position in the list if thewords were ordered by frequency of occurrence across all searchresults.
� The value of this feature is the frequency of the particular word inthe given section.
� Bucketing can be used to reduce dimensionality.
20 of 28
Bigram Rank Based Features
� A bigram is defined to be two consecutive words occurring in asection.
� Eg. Machine learning may be more important than machine andlearning separately.
� The value of the feature is calculated same as Word Rank BasedFeatures.
21 of 28
Coverage of Top Ranked Tokens
� Relevance may also be determined by the number of top rankedwords which occur in the section.
� The value of this feature is the coverage of top ranked words perbucket.
22 of 28
Distance from the Query
� The intuition here is that the closer a section is to the query in theWeb page, the more likely it is to be relevant.
� The value of this feature is the section-wise distance between thesection in question and the nearest section which contains thequery.
23 of 28
Query Word Frequency
� The value of this feature is the frequency of the query word in thesection.
� The value is normalized by the number of words in the section.
24 of 28
Training Set Generation
� Query Google to get a set of pages
� Clean each page remove scripts, pictures, links etc.
� Break each page into sections.
� Label each section of every page.
25 of 28
Learning Algorithm
� An Support Vector Machine with a linear kernel is used.
� Given the relatively high dimensionality of the feature vector, it is areasonable choice to use an SVM.
� The predicted margins of each sample are used to get a non-binarymetric of how relevant each sections are.
26 of 28
Conclusion
� Support Vector Machines are an attractive approach to datamodelling.
� Evaluations suggest that using information retrieval inspiredfeatures and some basic hints from summarization give respectableaccuracy with respect to detecting the most relevant section in apage.
� Thus SQUINT can have a large impact on the user’s overall searchexperience.
27 of 28
References
� Cristianini, Nello; and Shawe-Taylor, John; An Introduction toSupport Vector Machines and other kernel-based learning methods,Cambridge University Press, 2000.
� Siddharth Jonathan J.B., Riku Inoue and Jyotika Prasad. SQUINTSVM for Identification of Relevant Sections in Web Pages for WebSearch.
� Wikipedia article on Machine Learning,http://en.wikipedia.org/wiki/Support vector machine
� Machine Learning Course on Coursera,https://class.coursera.org/ml-2012-002/class/index
28 of 28