A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
description
Transcript of A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang and Gagan AgrawalThe Ohio State University
Presented by : Tantan Liu
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
The Deep Web
• The definition of “Deep web” from Wikipedia
The Deep Web refers to World Wide Web content that is not part of the surface web, which is indexed by standard search engines.
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Deep Web in Biological Domain• 500 times larger than the surface web
• Nearly 800 deep web data sources in the bio-domain
• 95 percent of the deep web is publicly accessible
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Search the Deep Web: Solved and Unsolved Issues
• Data Source Integration– Schema matching and Schema mining
• Query Planning and Answering– Keyword search and Structured query answering
• Fault Tolerance– Data access over wide-area networks– Unpredictable data source inaccessibility/unavailability– Network contention– However, uncompromised user search experience
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Our Solution: A Redundancy based Self-Healing Approach
• Identify data redundancy across independent data sources
• Find the minimal “have to be replaced” sub-plan caused by data source unavailability/inaccessibility
• Find the sub-query corresponding to the “have to be replaced” sub-plan
• Generate a new replacing sub-plan based on redundancy using other data sources
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Roadmap
• Introduction and Motivation
• Problem Formulation in Detail
• Our Self-Healing Approach
• Evaluation
• Conclusion
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Data Redundancy Model
• A data source is represented by a three-tuple– IN: input attribute– O: output attribute– Con: attribute conditions imposed on data source
• Data redundancy condition between data source A and B– They have the same input attributes– They have overlapping output attributes– They have non-conflicting attribute conditions
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Query and Query Plan
• Query– SQL query format
select t1,t2,…,tn search term set ST
from the deep web
where in1=e1 and in2=e2,…,nm=em input term set INT
• Query Plan– A DAG of data source nodes “covers” the user query
D1
D2
D3
D4
Starting node
Its input attributesare input terms in query
Query plan nodes
Output attributes maybe user requested searchterms
Data source dependency
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Algorithm Overview (1)• Find the part of the query plan needs to be replaced
– Impacted sub-plan• the sub-graph reachable from the unavailable data source
nodes
– Minimal impacted sub-plan• The impacted sub-plan without usable data source nodes
considering given data redundancy
A E F
D
Bt1
t3
t4 t6BH
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Algorithm Overview (2)• Find Maximal Fixable Sub-Query
– The sub-query corresponding to the minimal impacted sub plan
• New Sub-Plan Generation– Use our existing query planning algorithm
A E F
D
Bt1
t3
t4 t6B
Select t3, t4 where input=t1
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Minimal Impacted Sub-Plan Algorithm
A
E F :t11
D :t1 0
B:t9t1
t3t4 t5
JI:t12K :t13
L :t14
t2 t6t7
t8
B:t9
I:t12
1. Identify unavailable data sources {B, I}
2. Find the sub-graph reachable from them (impacted sub-plan)
3. Cascading-crash conditions for data source X which is dependent on data source D
A. At least one data source, sharing redundant data with D, is not crashed B. At least one such above data source has the same usage as D
J
L :t14
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Minimal Impacted Sub-Plan Fixability
• Minimal Impacted Sub-Plan Fixability– How much the minimal impacted sub-plan can be fixed using other
data sources taking advantage of data redundancy
• Dead Attribute– No un-crashed data source can provide the attribute as its output
attribute
• Plan Fixability Categorization– Fully fixable: only self crashed node, no dead attribute– Partial fixable: only self crashed node, dead attribute– Cascading fully fixable: cascading crashed node, no dead attribute– Cascading partial fixable: cascading crashed node, dead attribute
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Maximal Fixable Sub-Query Generation• For each source in the minimal impacted sub-plan, we
compute– Input set IN– Requested output set RO– Linking set L
• Maximal Fixable Sub-Query– Input term set: input attributes of all data sources in the minimal impacted
sub-plan without incoming edges (self-crashed data sources)
– Search term set• Users requested search terms which are supposed to be covered by the
minimal impacted sub-plan• Terms in the linking set of the nodes in the minimal impacted sub-plan which
have outgoing edges to data sources outside of the minimal impacted sub-plan
A E F
D
Bt1
t3
t4 t6
IN={t1} L={t3,t4}
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Roadmap
• Introduction and Motivation
• Problem Formulation in Detail
• Our Self-Healing Approach
• Evaluation
• Conclusion
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Evaluation• 12 biological deep web data sources• 20 queries, 4 groups• Each group corresponding to one fixability
category• Methods compared
– Baseline: start from stretch– Our method
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Query Answering Time Comparison
1. Our method is more efficient in fixing failed query plans than the baseline method2. Our method is at least 20% faster for all queries in this figure.
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Query Result Quality Comparison
For 18 out of 20 cases, the recall from our method is exactly the same as the ideal recall from the baseline method
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Conclusion• Propose a self-healing approach to support fault
tolerance for deep web searches• Find the minimal impacted sub-plan caused by
unavailable/inaccessible data sources• Find a new plan to replace the minimal impacted
sub-plan• Our method outperforms a baseline method in
terms of both efficiency and result quality
A Self-Healing Approach for A Domain-Specific Deep Web Search Tool
Fan Wang [email protected], Gagan Agrawal [email protected] BIBE 2010, Philadelphia, USA
Questions?
Contact us: Fan Wang [email protected]
Gagan Agrawal [email protected]