HMM-based Artificial Designer for Search Interface Segmentation
-
Upload
ritu-khare -
Category
Technology
-
view
48 -
download
2
Transcript of HMM-based Artificial Designer for Search Interface Segmentation
HMM-based Artificial Designer forSearch Interface Segmentation
Ritu Khare, Yuan An, Il-Yeol Song
HMM: ARTIFICIAL DESIGNERAn HMM (Hidden Markov Model) can act like a human designerwho has the ability to design an interface using acquired
ACCESSING THE DEEP WEBDeep Web: Data that exist on the Web but are notreturned by search engines through traditional crawling
RESULTS0.3who has the ability to design an interface using acquired
knowledge and to determine (decode) the segment boundariesand semantic labels of components.
returned by search engines through traditional crawlingand indexing.
Accessing Deep Web contents: The primary way toaccess this data (by manually filling up HTML forms onsearch interfaces ) is not scalable.
Hence, more sophisticated solutions, such as designing
Bag of Components
Search 2-Layered DESIGNING
Knowledge of Semantic Labels
Fig 2. Simulating a Human Designer using HMMs
Text-Trivial
Attribute-name
Operand
0.44
0.16 0.210.23
0.210.15
0.59
0.540.08
The designing process is similar to statistically choosing onecomponent from a bag of components (a superset of all possiblecomponents) and placing it on the interface while keeping thesemantic role (attribute-name, operand, or operator) of the
meta-search engines or creating dynamic pagerepositories, are required. A pre-requisite to thesesolutions is an understanding of the search interfaces.Interface Segmentation is an important portion of theproblem of search interface understanding.
INTERFACE SEGMENTATION
InterfaceHMMDECODING
Segments & Tagged
Components
Semantic Label AccuracySegment /Logical Attribute 86 05 %
Operator0.89 0.09
Fig 4. Learnt Topology of semantic labels
component in mind. See Figure 2.
2-LAYERED HMM APPROACH The problem of decoding is two-folded: 1) Segmentation, 2)Assignment of semantic labels to components. Hence, a 2-layeredHMM is employed as shown in Figure 3. The first layer T-HMMtags each component with appropriate semantic labels (attribute-
CONTRIBUTIONS1 This approach outperforms LEX a contemporary
Segment /Logical Attribute 86.05 %
Operator 85.10 %
Operand 98.60 %
Attribute-name 90.11 %
Marker Range:
cM Position:
between and
e.g., between “D19Mit32” and “Tbx10”
between g p pp p (name, operator, and operand). The second layer S-HMMsegments the interface into logical attributes.
While a user is naturally trained to performsegmentation, a machine is unable to “see” a segment
1. This approach outperforms LEX, a contemporaryheuristic-based method, and achieves a 10%improvement in segmentation accuracy.
2. This is the first work to apply HMMs on deep Websearch interfaces. HMMs helped in incorporating thefirst-hand knowledge of the designer to performinterface understanding.
T-HMM S-HMMHTML coded Interfaces
Segmented and Tagged Interfaces
Training Manually Manually
Fig 1. Segmented Interface (segments marked by dotted lines)
e.g., “10.0 -40.0”
EXPERIMENTATION
g , gdue to the following reasons:
1. The components that are visually close to each othermight be located very far apart in the HTML sourcecode.
2. A machine does not implicitly have any searchexperience that can be leveraged to identify a
t ‘ b d
FUTURE WORK1. To recover the schema of deep Web databases by
extraction of finer details such as data type andconstraints of logical attribute.
2. To test this approach on interfaces from otherdomains, given the diverse domain distribution ofthe deep Web
Interfacesy
Tagged Sequences
ySegmented Interfaces
Fig 3. 2-Layered HMM Architecture
Data Set 200 interfaces from Biology Domainsegment ‘s boundary.
Research Question: How can we make a machine learnhow to segment an interface?
the deep Web3. To investigate the use of the use of Baum Welch
training algorithm to minimize the degree ofautomation .
Parsing DOM-trees of components
Training Maximum Likelihood Method
Testing Viterbi Algorithm