NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and...

3
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality services Critical Challenges: Efficiency Result presentation …… 1. Internet users search information with search engines Expect to query databases in the same way Not know the database schema and SQL 2. Hidden Web problem Most of data on the Web are stored in databases. They are hidden to Web search engines because of the mismatch of search interfaces between Web search engines and databases 3. Integrating different classes of information systems Modern information systems manage many kinds of data structured relational data semi-structured XML documents unstructured text documents …… Keyword search: to provide a unified query language/interface NUITS High efficiency User-friendly result display Supporting advanced keyword queries Characteristics of NUITS Input: In addition to simple keyword queries (a set of keywords), advanced queries with specified conditions are also supported Output: Search results can be organized into clusters, facilitating users' quick browsing Search Engine: Adopt a new and efficient search algorithm A r c h i t e c t u r e

Transcript of NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and...

Page 1: NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases

The integration of DB and IR provides users with a wide range of high quality services

Critical Challenges:

Efficiency

Result presentation

……

1. Internet users search information with search engines

Expect to query databases in the same way

Not know the database schema and SQL

2. Hidden Web problem

Most of data on the Web are stored in databases. They are hidden to Web search engines because of the mismatch of search interfaces between Web search engines and databases

3. Integrating different classes of information systems

Modern information systems manage many kinds of datastructured relational data

semi-structured XML documents unstructured text documents

……

Keyword search: to provide a unified query language/interface

NUITS High efficiency User-friendly result display Supporting advanced keyword queries

Characteristics of NUITS

Input:

In addition to simple keyword queries (a set of keywords), advanced queries with specified conditions are also supported

Output:

Search results can be organized into clusters, facilitating users' quick browsing

Search Engine:

Adopt a new and efficient search algorithm

Architecture

Page 2: NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Keyword Query Specification Simple keyword: just a keyword

eg: database Typed keyword: a type can be either a relation-name or attribute-name

eg: Paper:database

Author:*

Writer:* Conditional keyword: conditions associated with a keyword

eg: database year>2000

database year~2000

Keyword query: a set of keywords associated with Boolean operators

Q ::= p | (Q) | Q AND Q | Q OR Q | NOT Q

p: a keyword which can be a simple keyword, typed keyword, or conditional keyword

Q: a keyword query

eg: Ullman AND (database OR algorithm)

TreeCluster: Clustering Results

In order to assist users to find the needed results, we propose to organize the result trees into two levels of clusters

Structural-Level Clustering: Clustering trees using tree isomorphism. The two trees below are isomorphic, and both mean the two authors Hristidis and Papakonstantinou coauthor papers

Search Algorithm NUITS adopts a data-graph-based search algorithm, and each result is a tuple-connection-tree The algorithm proposes a dynamic programming approach to find the optimal top-1 with a low time complexity It computes top-k minimum cost tuple-connection-trees one-by-one incrementally, and does not need to compute or sort all result trees in order to find the top-k results

An example of result tree Another example of result tree

The structural pattern of the above two trees

Content-Level Clustering: Based on keyword frequencies and content similarity. If the size of the cluster is larger than the user-given threshold after structural-level clustering, the content-level clustering further clusters result trees

eg: Considering a keyword query Gray Database, we cluster the results based on content of author names, because Gray occurs less frequently than Database in DBLP dataset. Then there may be clusters for different authors, such as Jim Gray or W.A. Gray

Note: All examples in this poster are based on DBLP dataset

Page 3: NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Structural pattern

Implementation of TreeCluster

TreeCLuster is cost effective:

Use labels to represent schema information of each result tree and reformulate the clustering problem as a problem of judging whether labeled trees are isomorphic

Rank user keywords according to their frequencies in databases, and further partition the large clusters based on keyword nodes

Give each cluster a readable description, and present the description and each result graphically

Structural-Level

Clusters

Content-Level

Clusters

Tuple-connection-tree

Architecture