HMM-based Artificial Designer for Search Interface Segmentation

1
HMM-based Artificial Designer for Search Interface Segmentation Ritu Khare, Yuan An, Il-Yeol Song HMM: ARTIFICIAL DESIGNER An HMM (Hidden Markov Model) can act like a human designer who has the ability to design an interface using acquired ACCESSING THE DEEP WEB Deep Web: Data that exist on the Web but are not returned by search engines through traditional crawling RESULTS 0.3 who has the ability to design an interface using acquired knowledge and to determine (decode) the segment boundaries and semantic labels of components. returned by search engines through traditional crawling and indexing. Accessing Deep Web contents: The primary way to access this data (by manually filling up HTML forms on search interfaces ) is not scalable. Hence, more sophisticated solutions, such as designing Bag of Components Search 2-Layered DESIGNING Knowledge of Semantic Labels Fig 2. Simulating a Human Designer using HMMs Text- Trivial Attribute- name Operand 0.44 0.16 0.21 0.23 0.21 0.15 0.59 0.54 0.08 The designing process is similar to statistically choosing one component from a bag of components (a superset of all possible components) and placing it on the interface while keeping the semantic role (attribute-name, operand, or operator) of the meta-search engines or creating dynamic page repositories, are required. A pre-requisite to these solutions is an understanding of the search interfaces. Interface Segmentation is an important portion of the problem of search interface understanding. INTERFACE SEGMENTATION Interface HMM DECODING Segments & Tagged Components Semantic Label Accuracy Segment /Logical Attribute 86 05 % Operator 0.89 0.09 Fig 4. Learnt Topology of semantic labels component in mind. See Figure 2. 2-LAYERED HMM APPROACH The problem of decoding is two-folded: 1) Segmentation, 2) Assignment of semantic labels to components. Hence, a 2-layered HMM is employed as shown in Figure 3. The first layer T-HMM tags each component with appropriate semantic labels (attribute- CONTRIBUTIONS 1 This approach outperforms LEX a contemporary Segment /Logical Attribute 86.05 % Operator 85.10 % Operand 98.60 % Attribute-name 90.11 % Marker Range: cM Position: between and e.g., between “D19Mit32” and “Tbx10” between name, operator, and operand). The second layer S-HMM segments the interface into logical attributes. While a user is naturally trained to perform segmentation, a machine is unable to “see” a segment 1. This approach outperforms LEX, a contemporary heuristic-based method, and achieves a 10% improvement in segmentation accuracy. 2. This is the first work to apply HMMs on deep Web search interfaces. HMMs helped in incorporating the first-hand knowledge of the designer to perform interface understanding. T-HMM S-HMM HTML coded Interfaces Segmented and Tagged Interfaces Training Manually Manually Fig 1. Segmented Interface (segments marked by dotted lines) e.g., “10.0 -40.0” EXPERIMENTATION due to the following reasons: 1. The components that are visually close to each other might be located very far apart in the HTML source code. 2. A machine does not implicitly have any search experience that can be leveraged to identify a t b d FUTURE WORK 1. To recover the schema of deep Web databases by extraction of finer details such as data type and constraints of logical attribute. 2. To test this approach on interfaces from other domains, given the diverse domain distribution of the deep Web Interfaces Tagged Sequences Segmented Interfaces Fig 3. 2-Layered HMM Architecture Data Set 200 interfaces from Biology Domain segment s boundary . Research Question: How can we make a machine learn how to segment an interface? the deep Web 3. To investigate the use of the use of Baum Welch training algorithm to minimize the degree of automation . Parsing DOM-trees of components Training Maximum Likelihood Method Testing Viterbi Algorithm

Transcript of HMM-based Artificial Designer for Search Interface Segmentation

Page 1: HMM-based Artificial Designer for Search Interface Segmentation

HMM-based Artificial Designer forSearch Interface Segmentation

Ritu Khare, Yuan An, Il-Yeol Song

HMM: ARTIFICIAL DESIGNERAn HMM (Hidden Markov Model) can act like a human designerwho has the ability to design an interface using acquired

ACCESSING THE DEEP WEBDeep Web: Data that exist on the Web but are notreturned by search engines through traditional crawling

RESULTS0.3who has the ability to design an interface using acquired

knowledge and to determine (decode) the segment boundariesand semantic labels of components.

returned by search engines through traditional crawlingand indexing.

Accessing Deep Web contents: The primary way toaccess this data (by manually filling up HTML forms onsearch interfaces ) is not scalable.

Hence, more sophisticated solutions, such as designing

Bag of Components

Search 2-Layered DESIGNING

Knowledge of Semantic Labels

Fig 2. Simulating a Human Designer using HMMs

Text-Trivial

Attribute-name

Operand

0.44

0.16 0.210.23

0.210.15

0.59

0.540.08

The designing process is similar to statistically choosing onecomponent from a bag of components (a superset of all possiblecomponents) and placing it on the interface while keeping thesemantic role (attribute-name, operand, or operator) of the

meta-search engines or creating dynamic pagerepositories, are required. A pre-requisite to thesesolutions is an understanding of the search interfaces.Interface Segmentation is an important portion of theproblem of search interface understanding.

INTERFACE SEGMENTATION

InterfaceHMMDECODING

Segments & Tagged

Components

Semantic Label AccuracySegment /Logical Attribute 86 05 %

Operator0.89 0.09

Fig 4. Learnt Topology of semantic labels

component in mind. See Figure 2.

2-LAYERED HMM APPROACH The problem of decoding is two-folded: 1) Segmentation, 2)Assignment of semantic labels to components. Hence, a 2-layeredHMM is employed as shown in Figure 3. The first layer T-HMMtags each component with appropriate semantic labels (attribute-

CONTRIBUTIONS1 This approach outperforms LEX a contemporary

Segment /Logical Attribute 86.05 %

Operator 85.10 %

Operand 98.60 %

Attribute-name 90.11 %

Marker Range:

cM Position:

between and

e.g., between “D19Mit32” and “Tbx10”

between g p pp p (name, operator, and operand). The second layer S-HMMsegments the interface into logical attributes.

While a user is naturally trained to performsegmentation, a machine is unable to “see” a segment

1. This approach outperforms LEX, a contemporaryheuristic-based method, and achieves a 10%improvement in segmentation accuracy.

2. This is the first work to apply HMMs on deep Websearch interfaces. HMMs helped in incorporating thefirst-hand knowledge of the designer to performinterface understanding.

T-HMM S-HMMHTML coded Interfaces

Segmented and Tagged Interfaces

Training Manually Manually

Fig 1. Segmented Interface (segments marked by dotted lines)

e.g., “10.0 -40.0”

EXPERIMENTATION

g , gdue to the following reasons:

1. The components that are visually close to each othermight be located very far apart in the HTML sourcecode.

2. A machine does not implicitly have any searchexperience that can be leveraged to identify a

t ‘ b d

FUTURE WORK1. To recover the schema of deep Web databases by

extraction of finer details such as data type andconstraints of logical attribute.

2. To test this approach on interfaces from otherdomains, given the diverse domain distribution ofthe deep Web

Interfacesy

Tagged Sequences

ySegmented Interfaces

Fig 3. 2-Layered HMM Architecture

Data Set 200 interfaces from Biology Domainsegment ‘s boundary.

Research Question: How can we make a machine learnhow to segment an interface?

the deep Web3. To investigate the use of the use of Baum Welch

training algorithm to minimize the degree ofautomation .

Parsing DOM-trees of components

Training Maximum Likelihood Method

Testing Viterbi Algorithm