Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete...

53
Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by : Deepti Kundu Submitted to : Dr.T.Y.Lin

Transcript of Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete...

Page 1: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Summarization – CS 257 Chapter – 21 (Information Integration)

Database Systems: The Complete Book

Submitted by:Deepti Kundu

Submitted to: Dr.T.Y.Lin

Page 2: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.1 Introduction to Information 21.1 Introduction to Information IntegrationIntegration

Need for Information Integration All the data in the world could put in a single database

(ideal database system) In the real world (impossible for a single database):

databases are created independently hard to design a database to support future use

Page 3: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

University DatabaseUniversity Database

Registrar: to record student and grade Bursar: to record tuition payments by students Human Resources Department: to record employees Other department….

Page 4: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

InconvenientInconvenient

Record grades for students who pay tuition Want to swim in SJSU aquatic center for free in

summer vacation?

(all the cases above cannot achieve the function by a single database)

Solution: one database

Page 5: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

How to integrateHow to integrate

Start over

build one database: contains all the legacy databases; rewrite all the applications

result: painful Build a layer of abstraction (middleware)

on top of all the legacy databases

this layer is often defined by a collection of classes

BUT…

Page 6: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Heterogeneity ProblemHeterogeneity Problem

What is Heterogeneity Problem

Aardvark Automobile Co.

1000 dealers has 1000 databases

to find a model at another dealer

can we use this command:

SELECT * FROM CARS

WHERE MODEL=“A6”;

Page 7: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Type of HeterogeneityType of Heterogeneity

Communication Heterogeneity Query-Language Heterogeneity Schema Heterogeneity Data type difference Value Heterogeneity Semantic Heterogeneity

Page 8: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.2 Modes of Information Integration21.2 Modes of Information Integration

Federations The simplest architecture for integrating several DBs One to one connections between all pairs of DBs n DBs talk to each other, n(n-1) wrappers are needed Good when communications between DBs are limited

Wrapper a software translates incoming queries and outgoing

answers. In a result, it allows information sources to conform to some shared schema.

Page 9: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Federations DiagramFederations Diagram

DB2DB1

DB3 DB4

2 Wrappers

2 Wrappers

2 Wrappers

2 Wrappers

2 Wrappers 2 Wrappers

A federated collection of 4 DBs needs 12 components to translate queries from one to another.

Page 10: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Data WarehouseData Warehouse

Sources are translated from their local schema to a global schema and copied to a central DB.

User transparent: user uses Data Warehouse just like an ordinary DB

User is not allowed to update Data Warehouse

Page 11: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Warehouse DiagramWarehouse Diagram

Warehouse

Extractor Extractor

Source 1 Source 2

User query

result

Combiner

Page 12: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Construct Data WarehouseConstruct Data Warehouse There are mainly 3 ways to constructing the data in the warehouse:

1) Periodically reconstructed from the current data in the sources, once a night or at even longer intervals.Advantages:

simple algorithms. Disadvantages:

need to shut down the warehouse; data can become out of date.

2) Updated periodically based on the changes (i.e. each night) of the sources. Advantages:

involve smaller amounts of data. (important when warehouse is large and needs to be modified in a short period)

Disadvantages: the process to calculate changes to the warehouse is complex. data can become out of date.

Page 13: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

3) Changed immediately, in response to each change or a small set of changes at one or more of the sources.

Advantages: data won’t become out of date.

Disadvantages: requires too much communication, therefore, it is generally too

expensive. (practical for warehouses whose underlying sources changes slowly.)

Page 14: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

MediatorsMediators

Virtual warehouse, which supports a virtual view or a collection of views, that integrates several sources.

Mediator doesn’t store any data. Mediators’ tasks: 1)receive user’s query, 2)send queries to wrappers, 3)combine results from wrappers, 4)send the final result to user.

Page 15: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

A Mediator diagramA Mediator diagram

Mediator

Wrapper Wrapper

Source 1 Source 2

User query

Query

Query

QueryQuery

Result

Result

Result

Result

Result

Page 16: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.3 Wrappers in Mediator-Based 21.3 Wrappers in Mediator-Based SystemsSystems

Intro Templates for Query patterns Wrapper Generator Filter

Page 17: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Wrappers in Mediator-based SystemsWrappers in Mediator-based Systems

More complicated than that in most data warehouse system.

Able to accept a variety of queries from the mediator and translate them to the terms of the source.

Communicate the result to the mediator. How to design a wrapper?

Classify the possible queries that the mediator can ask into templates, which are queries with parameters that represent constants.

Page 18: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Wrapper GeneratorsWrapper Generators

Templates for Query Patterns: Use notation T=>S to express the idea that the template T

is turned by the wrapper into the source query S.

The wrapper generator creates a table holds the various query patterns contained in the templates.

The source queries that are associated with each. FilterFilter

Have a wrapper filter to supporting more queries.

Page 19: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

A driver is used in each wrapper, the task of the driver is to:

Accept a query from the mediator. Search the table for a template that matches the

query. The source query is sent to the source, again using a

“plug-in” communication mechanism. The response is processed by the wrapper.

Page 20: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.4 Capability Based Optimization21.4 Capability Based Optimization

IntroductionTypical DBMS estimates the cost of each query

plan and picks what it believes to be the bestMediator – has knowledge of how long its sources

will take to answerOptimization of mediator queries cannot rely on

cost measure alone to select a query planOptimization by mediator follows capability based

optimization

Page 21: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.4.1 The Problem of Limited Source 21.4.1 The Problem of Limited Source Capabilities Capabilities

Many sources have only Web Based interfaces Web sources usually allow querying through a query

form E.g. Amazon.com interface allows us to query about

books in many different ways. But we cannot ask questions that are too general

E.g. Select * from books;

Page 22: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(con’t)(con’t) Reasons why a source may limit the ways in which

queries can be asked Earliest database did not use relational DBMS

that supports SQL queries Indexes on large database may make certain

queries feasible, while others are too expensive to execute

Security reasons

Page 23: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.4.2 A Notation for Describing 21.4.2 A Notation for Describing Source Capabilities Source Capabilities

For relational data, the legal forms of queries are described by adornments

Adornments – Sequences of codes that represent the requirements for the attributes of the relation, in their standard order f(free) – attribute can be specified or not b(bound) – must specify a value for an attribute but any

value is allowed u(unspecified) – not permitted to specify a value for a

attribute

Page 24: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)

c[S](choice from set S) means that a value must be specified and value must be from finite set S.

o[S](optional from set S) means either do not specify a value or we specify a value from finite set S

A prime (f’) specifies that an attribute is not a part of the output of the query

A capabilities specification is a set of adornments A query must match one of the adornments in its capabilities

specification

Page 25: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.4.3 Capability-Based Query-Plan 21.4.3 Capability-Based Query-Plan Selection Selection

Given a query at the mediator, a capability based query optimizer first considers what queries it can ask at the sources to help answer the query

The process is repeated until: Enough queries are asked at the sources to resolve all the

conditions of the mediator query and therefore query is answered. Such a plan is called feasible.

We can construct no more valid forms of source queries, yet still cannot answer the mediator query. It has been an impossible query.

Page 26: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)

The simplest form of mediator query where we need to apply the above strategy is join relations

E.g we have sources for dealer 2 Autos(serial, model, color) Options(serial, option)

Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and uc[autoTrans, navi]

Query is – find the serial numbers and colors of Gobi models with a navigation system

Page 27: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.4.4 Adding Cost-Based 21.4.4 Adding Cost-Based OptimizationOptimization Mediator’s Query optimizer is not done when the capabilities

of the sources are examined Having found feasible plans, it must choose among them Making an intelligent, cost based query optimization requires

that the mediator knows a great deal about the costs of queries involved

Sources are independent of the mediator, so it is difficult to estimate the cost

Page 28: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.5 Optimizing Mediator Queries21.5 Optimizing Mediator Queries

Chain algorithm – a greed algorithm that finds a way to answer the query by sending a sequence of requests to its sources. Will always find a solution assuming at least one solution

exists. The solution may not be optimal.

Page 29: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.5.1 Simplified Adornment Notation21.5.1 Simplified Adornment Notation

A query at the mediator is limited to b (bound) and f (free) adornments.

We use the following convention for describing adornments: Nameadornments (attributes) where:

name is the name of the relation the number of adornments = the number of attributes

Page 30: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.5.2 Obtaining Answers for 21.5.2 Obtaining Answers for Subgoals Subgoals

Rules for subgoals and sources: Suppose we have the following subgoal:

Rx1x2…xn(a1, a2, …, an),

and source adornments for R are: y1y2…yn. If yi is b or c[S], then xi = b. If xi = f, then yi is not output restricted.

The adornment on the subgoal matches the adornment at the source: If yi is f, u, or o[S] and xi is either b or f.

Page 31: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.5.3 The Chain Algorithm21.5.3 The Chain Algorithm

Maintains 2 types of information: An adornment for each subgoal. A relation X that is the join of the relations for all the

subgoals that have been resolved. Initially, the adornment for a subgoal is b iff the mediator

query provides a constant binding for the corresponding argument of that subgoal.

Initially, X is a relation over no attributes, containing just an empty tuple.

Page 32: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)

First, initialize adornments of subgoals and X. Then, repeatedly select a subgoal that can be resolved. Let

Rα(a1, a2, …, an) be the subgoal:

1. Wherever α has a b, we shall find the argument in R is a constant, or a variable in the schema of R. Project X onto its variables that appear in R.

Page 33: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)2. For each tuple t in the project of X, issue a query to the

source as follows (β is a source adornment). If a component of β is b, then the corresponding

component of α is b, and we can use the corresponding component of t for source query.

If a component of β is c[S], and the corresponding component of t is in S, then the corresponding component of α is b, and we can use the corresponding component of t for the source query.

If a component of β is f, and the corresponding component of α is b, provide a constant value for source query.

Page 34: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d) If a component of β is u, then provide no binding for this

component in the source query. If a component of β is o[S], and the corresponding

component of α is f, then treat it as if it was a f. If a component of β is o[S], and the corresponding

component of α is b, then treat it as if it was c[S].

3. Every variable among a1, a2, …, an is now bound. For each remaining unresolved subgoal, change its adornment so any position holding one of these variables is b.

Page 35: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)4. Replace X with X πs(R), where S is all of the variables

among: a1, a2, …, an.

5. Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal.

If every subgoal is resolved, then X is the answer. If every subgoal is not resolved, then the algorithm fails.

α

Page 36: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.5.4 Incorporating Union Views at 21.5.4 Incorporating Union Views at the Mediator the Mediator

This implementation of the Chain Algorithm does not consider that several sources can contribute tuples to a relation.

If specific sources have tuples to contribute that other sources may not have, it adds complexity.

To resolve this, we can consult all sources, or make best efforts to return all the answers.

Page 37: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

(cont’d)(cont’d)

Consulting All Sources We can only resolve a subgoal when each source for its

relation has an adornment matched by the current adornment of the subgoal.

Less practical because it makes queries harder to answer and impossible if any source is down.

Best Efforts We need only 1 source with a matching adornment to

resolve a subgoal. Need to modify chain algorithm to revisit each subgoal

when that subgoal has new bound requirements.

Page 38: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.6 Local-as-View Mediators21.6 Local-as-View Mediators

GAV: Global as view mediators are like view, it doesn’t exist physically, but piece of it are constructed by the mediator by asking queries

LAV: Local as view mediators, defines the global predicates at the mediator, but we do not define these predicates as views of the source of data

Global expressions are defined for each source involving global predicates that describe the tuple that source is able to produce and queries are answered at mediator by discovering all possible ways to construct the query using the views provided by sources

Page 39: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Motivation for LAV MediatorsMotivation for LAV Mediators

LAV mediators help us to discover how and when to use that source in a given query

Example: Par(c,p)-> GAV of Par(c,p) gives information about the child and parent but does not give information of grandparents

LAV Par(c,p) will help to get information of chlid-parent and even grandparent

Page 40: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Terminology for LAV MediationTerminology for LAV Mediation

It is in form of logic that serves as the language for defining views.

Datalog is used which will remain common for the queries of mediator and source which is known as Conjunctive query.

LAV has global predicates which are the subgoals of mediator queries

Conjunctive queries defines the views which has unique view predicate and that view has Global predicates and associated with particular view.

Page 41: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Containment of Conjunctive QueriesContainment of Conjunctive Queries

Conjunctive query S be the solution to the mediator Q,

Expansion of S->E, produces same answers that Q produces, so, E subset Q.

A containment mapping from Q to E is function Γ(x) is the ith argument of the head E.

Add to Γ the rule that Γ(c) =c for any constant c. IF P(x1,x2,..xn) is a subgoal of Q, then P(Γ(x1), Γ(x2),.., Γ(xn)) is a subgoal of E.

Page 42: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Why Containment Mapping Test Works:Why Containment Mapping Test Works:

Questions:

1. If there is containment mapping, why must there be a containment of conjunctive queries?

2. If there is containment, why must there be a containment mapping?

Page 43: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Finding Solutions to a Mediator QueryFinding Solutions to a Mediator Query

Query Q, solutions S, Expansion E of S is contained in Q.

“If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution that has at most n subgoals.

This is known by LMSS Theorem

Page 44: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Why the LMSS Theorem HoldsWhy the LMSS Theorem Holds

Query Q with n subgoals and S with n subgoals, E of S must be contained in query Q, E is expansion of Q.

S’ must be the solution got after removing all subgoals from S those are not the target of Q.

E subset or equal to Q and also E’ is the expansion of S’. So, S is subser of S’ : identity mapping. Thus there is no need for solution s among the solution S

among the solutions to query Q.

Page 45: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

21.7 Entity Resolution21.7 Entity Resolution

Determining whether two records or tuples do or do not represent the same person, organization, place or other entity is called ENTITY RESOLUTION.

Page 46: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Deciding whether Records represent a Deciding whether Records represent a Common EntityCommon Entity

Two records represent the same individual if the two records have similar values for each of the fields associated with those records.

It is not sufficient that the values of corresponding fields be identical because of following reasons:

1. Misspellings2. Variant Names3. Misunderstanding of Names

4. Evolution of Values5. Abbreviations

Page 47: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Deciding Whether Records Represents Deciding Whether Records Represents a Common Entity - Edit Distancea Common Entity - Edit Distance First approach to measure the similarity of records is Edit

Distance.

Values that are strings can be compared by counting the number of insertions and deletions of characters it takes to turn one string into another.

So the records represent the same entity if their similarity measure is below a given threshold.

Page 48: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Deciding Whether Records Represents Deciding Whether Records Represents a Common Entity - Normalizationa Common Entity - Normalization

To normalize records by replacing certain substrings by others. For instance: we can use the table of abbreviations and replace abbreviations by what they normally stand for.

Once normalize we can use the edit distance to measure the difference between normalized values in the fields.

Page 49: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Merging Similar RecordsMerging Similar Records

Merging means replacing two records that are similar enough to merge and replace by one single record which contain information of both.

There are many merge rules:1. Set the field in which the records disagree to the

empty string.2. (i) Merge by taking the union of the values in each

field(ii) Declare two records similar if at least two of the three fields have a nonempty intersection.

Page 50: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Useful Properties of Similarity and Useful Properties of Similarity and Merge FunctionsMerge Functions

The following properties say that the merge operation is a semi lattice :

1. Idempotence : That is, the merge of a record with itself should surely be that record.

2. Commutativity : If we merge two records, the order in which we list them should not matter.

3. Associativity : The order in which we group records for a merger should not matter.

Page 51: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

There are some other properties that we expect similarity relationship to have:

• Idempotence for similarity : A record is always similar to itself

• Commutativity of similarity : In deciding whether two records are similar it does not matter in which order we list them

• Representability : If r is similar to some other record s, but s is instead merged with some other record t, then r remains similar to the merger of s and t and can be merged with that record.

Page 52: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

R-swoosh Algorithm for ICAR R-swoosh Algorithm for ICAR RecordsRecords Input: A set of records I, similarity function and a merge function. Output: A set of merged records O. Method:

O:= emptyset; WHILE I is not empty DO BEGIN

Let r be any record in I; Find, if possible, some record s in O that is similar to r; IF no record s exists THEN

move r from I to O ELSE BEGIN

delete r from I;

delete s from O;add the merger of r and s to

I; END; END;

Page 53: Summarization – CS 257 Chapter – 21 (Information Integration) Database Systems: The Complete Book Submitted by: Deepti Kundu Submitted to: Dr.T.Y.Lin.

Other Approaches to Entity ResolutionOther Approaches to Entity Resolution The other approaches to entity resolution are :

Non ICAR Datasets : We can define a dominance relation r<=s that means record s contains all the information contained in record r.If so, then we can eliminate record r from further consideration.

Clustering : Some time we group the records into clusters such that members of a cluster are in some sense similar to each other and members of different clusters are not similar.

Partitioning : We can group the records, perhaps several times, into groups that are likely to contain similar records and look only within each group for pairs of similar records.