ETL Quality Stage blocking and matching

32
ETL QualityStage: Matching Stage A simplified explanation of the Matching Stage

Transcript of ETL Quality Stage blocking and matching

Page 1: ETL Quality Stage   blocking and matching

ETL QualityStage:Matching Stage

A simplified explanation of theMatching Stage

Page 2: ETL Quality Stage   blocking and matching

Data matching in Data matching in ETL Quality StageETL Quality Stage

Data matching is used to find records in a single data source or independent data sources that refer to the same entity (such as a person, organization,

location, product, or material) regardless of the availability of a predetermined key.

Let’s take a look at a simplified example and examine the process.

Page 3: ETL Quality Stage   blocking and matching

Let’s say our neighborhood club decided to display our pictures at our club house

bulletin board.

… But we only want to post one picture per club member.

Page 4: ETL Quality Stage   blocking and matching

Neighbors submit their pictures, but some

neighbors submit more than one picture.

Page 5: ETL Quality Stage   blocking and matching

Since we agreed to post only ONE picture, we’ll

have to weed out “duplicates”

(pictures of the same person).

Page 6: ETL Quality Stage   blocking and matching

How do we find pictures of the same person?

Well, traditionally, we’d compare them one by one to determine if they

match certain criteria (same eyes, nose, etc.)

Page 7: ETL Quality Stage   blocking and matching

Same person?

No.

Page 8: ETL Quality Stage   blocking and matching

Same person?

No.

Page 9: ETL Quality Stage   blocking and matching

Same person?

No.

Page 10: ETL Quality Stage   blocking and matching

Same person?

No.

Page 11: ETL Quality Stage   blocking and matching

Same person?

No.

Page 12: ETL Quality Stage   blocking and matching

We have 12 pictures, so we’ll have to compare

12 pictures

You get the idea.

That’s 144 times!

12 times.

Page 13: ETL Quality Stage   blocking and matching

The Matching Stage in QualityStage simplifies the work.

Matching is a two-step process: first you block records

and then you match them.

Page 14: ETL Quality Stage   blocking and matching

Blocking identifies subsets of data so that

matches can be more efficiently performed.

These subsets are called blocks.

Page 15: ETL Quality Stage   blocking and matching

Blocking

• Females < 18 years old• Females > 18 years old• Males < 18 years old• Males > 18 years old

Let’s say we decide to block the data.

We decide to form four subsets:

Page 16: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years old

Page 17: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years old

Making comparisons is easier now.

Page 18: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years oldCompare 5 pictures 5 times = 25 comparisons

Compare 3 pictures 3 times = 12 comparisons

Compare 2 pictures 2 times = 4 comparisons

Compare 2 pictures 2 times = 4 comparisons

Page 19: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years old25 comparisons

9 comparisons

4 comparisons

4 comparisons

Page 20: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years old25 comparisons

9 comparisons

4 comparisons

4 comparisons

4 25

94

52 comparisons

Page 21: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

BlockingMales < 18 years old

Males > 18 years old

52 comparisons.

That’s much more efficient than the 144 comparisons we had earlier, when we were doing one-on-one matching.

Page 22: ETL Quality Stage   blocking and matching

Matching is a two-step process: first you block records

and then you match them.

Females < 18 years old

Females >18 years old

Males < 18 years old

Males > 18 years old

To review:

Blocking identifies subsets of data within which matches can be more efficiently performed.

Page 23: ETL Quality Stage   blocking and matching

Females < 18 years old

Females >18 years old

Males < 18 years old

Males > 18 years old

Matching identifies relationships among records.

Matching is a 2-step process:

- First you block the records.

- Then you match them.

Page 24: ETL Quality Stage   blocking and matching

Matching

Females >18 years old

Let’s pause for a minute to examine the matching process more closely.

Matching is a 2-step process:

- First you block the records.

- Then you match them.

Page 25: ETL Quality Stage   blocking and matching

MatchingFirst, we have to make certain decisions to set up rules.

Will all of the criteria have to match exactly?

(If NO) Will some criteria be more important than other criteria?

(If YES) Can we use some of QualityStage’s “fuzzy logic”?

Which criteria will be more important? We will have to assign weights.

Matching is a 2-step process:

- First you block the records.

- Then you match them.

Page 26: ETL Quality Stage   blocking and matching

MatchingLet’s see what could happen if we were to apply the strict rule:

All the criteria have to match exactly.

In our example, the people in the pictures will need to have the same shape and color of eyes, same length and

color of hair, same hairstyle, etc.

If someone had different hair styles in the pictures, for example, we would have to say that it is a different

person, if we were to apply this strict rule.

Matching is a 2-step process:

- First you block the records.

- Then you match them.

Page 27: ETL Quality Stage   blocking and matching

If the rule were “All the criteria have to match exactly”:

We would have to conclude that these are not pictures of the same person.

Match

Match

No Match

No Match

Eyes Large, oval, brown, long eye lashes

Large, oval, brown, long eye lashes

Nose Not visible Not visible

Mouth Small, petite Large, open, tongue visible

Hair Dark brown, long, straight

Light brown, long, straight

Page 28: ETL Quality Stage   blocking and matching

If the rule were “All the criteria have to match exactly”:

We would have to conclude that these are not pictures of the same person.

Match

Match

No Match

Match

Eyes Large, oval, brown, long eye lashes

Large, oval, brown, long eye lashes

Nose Not visible Not visible

Mouth Small, petite Large, closed, tongue visible

Hair Dark brown, long, straight

Dark brown, long, straight

Page 29: ETL Quality Stage   blocking and matching

If the rule were “All the criteria have to match exactly”:

We would have to conclude that these are not pictures of the same person.

Match

Match

No Match

No Match

Eyes Large, oval, brown, long eye lashes

Large, oval, brown, long eye lashes

Nose Not visible Not visible

Mouth Large, open, tongue visible

Large, closed, tongue visible

Hair Dark brown, long, straight

Light brown, long, straight

Page 30: ETL Quality Stage   blocking and matching

MatchingAs an alternative, we can use some of QualityStage’s “fuzzy logic” and assign “weights” to the criteria.

We will have to decide: Which criteria are more important?

Matching is a 2-step process:

- First you block the records.

- Then you match them.

Page 31: ETL Quality Stage   blocking and matching

We could assign weights to the criteria.

Eyes Large, oval, brown, long eye lashes

Large, oval, brown, long eye lashes

Nose Not visible Not visible

Mouth Small, petite Large, closed, tongue visible

Hair Dark brown, long, straight

Dark brown, long, straight

Large, oval, brown, long eye lashes

Not visible

Large, open, tongue visible

Light brown, long, straight

For example, we could assign higherhigher weightsweights to “nosenose” and “eyeseyes,” a lower weightlower weight to “mouthmouth,” and the lowest weightlowest weight to “hairhair.”

Page 32: ETL Quality Stage   blocking and matching

We could assign weights to the criteria.

Using these assigned weights, ETL can help us conclude that these are pictures of the same person.

Eyes Large, oval, brown, long eye lashes

Large, oval, brown, long eye lashes

Nose Not visible Not visible

Mouth Small, petite Large, closed, tongue visible

Hair Dark brown, long, straight

Dark brown, long, straight

Large, oval, brown, long eye lashes

Not visible

Large, open, tongue visible

Light brown, long, straight

Match

Match

Match

Match