49508437 Quality Stage Wipro

Session 1:Session 1:QualityStage QualityStage

EssentialsEssentials

ObjectivesObjectives

Data QualityIntroduction to QualityStageDeveloping with QualityStageInvestigate and Data Quality AssessmentData PreparationStandardizeRule Set OverridesMatchSurvive

Data Migration ChallengesData Migration Challenges

Data QualityData QualityData QualityData Quality

Legacy Data ScrubbingLegacy Data ScrubbingLegacy Data ScrubbingLegacy Data Scrubbing

Managing End-user ExpectationsManaging End-user Expectations Managing End-user ExpectationsManaging End-user Expectations

Business/Data ModelingBusiness/Data Modeling Business/Data ModelingBusiness/Data Modeling

Managing Mgmt ExpectationsManaging Mgmt ExpectationsManaging Mgmt ExpectationsManaging Mgmt Expectations

Business Rule AnalysisBusiness Rule Analysis Business Rule AnalysisBusiness Rule Analysis

Managing MetadataManaging Metadata Managing MetadataManaging Metadata

0000 5555 10101010 15151515 20202020 2525 30303030 35353535 40404040PercentPercentPercentPercent

Data Quality Increases ROIData Quality Increases ROI

Better decision makingImproved marketing accuracy and scopeIncreased knowledge of customers Improved inventory and asset

management Improved risk analysis, auditing and

reporting

Data Quality Data Quality

There are two significant definitions of data quality Inherent Data Quality

Correctness or accuracy of data - The degree to which data accurately reflects the real-world object that it represents

Pragmatic Data Quality The value that accurate data has in supporting the

work of the enterprise Data that does not help enable the enterprise

accomplish its mission has no quality, no matter how accurate it is

Data Quality ChallengesData Quality Challenges

Different or inconsistent standards in structure, format or values

Missing data, default valuesSpelling errors, data in wrong fieldsBuried informationData myopia Data anomalies

Different or Inconsistent StandardsDifferent or Inconsistent Standards

MARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO

Source 3

DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721

Source 2

Name Field LLocation

Source 1

MARK DI LORENZO MA93 DENIS E. MARIO CT15 TOM & MARY ROBERTS IL21

Missing Data & Default ValuesMissing Data & Default Values

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

228-02-1975

999999999

025-37-1888

34-2671434

101010101

LN#12-756

18-7534216

111111111

6173380300

3380321

415-392-2000

508-466-1200

212-235-1000

FAX 528-9825

5436

NAME SOC. SEC. # TELEPHONE

Do the field values match the meta data labels?

Buried InformationBuried Information

Legacy Meta Desc. Legacy Record Values

Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345

NAME 1

ADDRESS 1

ADDRESS 2

ADDRESS 3

ADDRESS 4

ADDRESS 5

The Anomalies NightmareThe Anomalies Nightmare

CUSNUM NAME ADDRESS SALES $

90328574

90328575

90238495

90233479

90233489

90234889

90345672

IBM

I.B.M. Inc.

International Bus. M.

Int. Bus. Machines

Inter-Nation

Consults

Int. Bus. Consultants

I.B. Manufacturing

8,494.00

3,432.00

2,243.00

5,900.00

6,800.00

10,243.00

15,999.00

187 N.Pk. Str. Salem NH 01456

187 N.Pk. St. Sarem NH 01456

187 No. Park StSalem NH 04156

187 Park Ave Salem NH 04156

15 Main St. Andover MA 02341

PO Box 9 Boston MA 02210

Park Blvd. Boston MA 04106

Spelling ErrorsAnomaliesNo common

key

Lack of Standards

What data challenges do you face?What data challenges do you face?

Acct # Name Address City State Zip Note

5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP

5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111

5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID

87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert

87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176

No unique key linking records together

Business terms and spillover text

Data entry errors and misspellings

No consistent naming convention

Buried informationMissing values or data in the wrong fields

Common Data Quality ApproachesCommon Data Quality ApproachesAnalysis and Assessment

Enterprise-level: Data Quality Assessment (DQA) Project-level: Data investigation

Data Re-engineering Methods Standardization Record linkage/matching Consolidation

Information Engineering Methods Initial load Net change Real-time

Ongoing Metrics Project-level: Post-Data Quality Assessment (PDQA) Enterprise-level: Repeated DQA’s to establish trends

Data Re-engineering MethodologyData Re-engineering Methodology

DiscoverInvestigate

StandardizeCondition

LinkageMatching

Consolidate Survivorship

Understanding the quality

of your data and it’s impact on

achieving success

Standardizing content, structure

and meaning of datain preparation for

matching and downstream processing

Identifying and linking

duplicate entitiesor like entities

Selecting the “Best-of-breed”

data for downstream processing

Do your data sources contain what you think they do?

Does your data mean what you think it does?

Can you correct and improve the quality of your data?

Can you make the data meaningful to users?

Can you deliver & update the data in a timely manner?

How do you match records with the same meaning?

Which source should you use for this project?

Is your data sent to users based on events or content?

Are you able to keep data synchronized across systems?

Why InvestigateWhy Investigate

Discover trends and potential anomalies in the data

100% visibility of single domain and free-form fields

Identify invalid and default valuesReveal undocumented business rules and

common terminologyVerify the reliability of the data in the

fields to be used as matching criteriaGain complete understanding of data

within context

How to InvestigateHow to Investigate

• Single domain (character and type)• Freeform text (Word)

Frequency Percent Pattern Data Sample 3,533,119 39.332% FI? ADINA A /ACEVERO 2,590,837 28.842% F? CARMEN /ABRAHANTE 614,006 6.835% ?? ISHAI /BIRAN 552,579 6.152% FIF SCOTT J /ALBERT 331,279 3.688% FF JENNIFER /ASUNCION 314,199 3.498% ?I? CLAUDVILLE P /ADAMS 154,026 1.715% FF? ELIZABETH ANN /ABELA

Name – ‘Word’

Summary:

100% populated

The top four patterns account

for 81% of the populated

values

Contains ‘slash’ that may

indicate last name

Contains full and partial first name

Pattern Legend Class Description Class Description

? Unknown W Organization Words N Salutation A Abbreviations < Mixed (Leading alpha) ̂ Numeric > Mixed (Leading Numeric) @ Complex I Initials F First Name

other The character itself. For example: (, ), *, %

What is StandardizeWhat is Standardize

“The revealed patterns drive the conditioning rules.”

Pattern Manipulation: Applying business logic to data chaos.

Standards Definition:Enforcing business standards on data

elements.

Field Structuring:Transforming the input to an output which

meets the business requirement.

How to standardizeHow to standardize

Parsing specific data fields into smaller, lower-level (atomic) data elements Categorization of identified elements

Separation of Name, Address, and Area from freeform Name & Address lines

Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)

Refinement of a data element Name = ‘MS GRACY E MATHEWS’ becomes:

Title = ‘MS’ First Name = ‘GRACY’ Middle Name = ‘E’ Last Name = ‘MATHEWS’

Part Description = ‘BLK ACER MONITOR ’ becomes: Color = ‘BLACK’ Type = ‘ACER ’ Part = ‘MONITOR’

Why StandardizeWhy Standardize

Normalize values in data fields to standard valuesTransform First Name = ‘MIKE’ ‘MICHAEL’Transform Title = ‘Doctor’ ‘Dr’Transform Address = ‘ST. Michael Street’

‘Saint Michael St.’ Transform Color = ‘BLK’ ‘BLACK’

Applying phonetic coding to key wordsNYSIISSoundexTypically applied to Name fields (first, last,

street, city)

QualityStage StandardizeQualityStage Standardize

Highly flexible pattern recognition language

Field or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.)

Customizable classification and standardization tables

Utilizes results from data investigation

QualityStage Standardize ExampleQualityStage Standardize Example

Name Standardization Example

Input Name “Bucketed” Name Information after INTEGRITY Standardization Name Type Gdr Prefix First Middle Last Suffix Gen NYSIIS Match First Additional Name

CHESTER FINANCIAL /INC O CHESTER FINANCIAL INC CASAR

MIGUEL A /DEJESUS-VAZQUEZ I M MIGUEL A DEJESUS-VAZQUEZ DAJAS MIGUEL

DEBBIE KOTIN /INSDORF ESQ. I F DEBBIE KOTIN INSDORF ESQ INSDARF DEBORAH

DURAND,RAYMOND J. I M RAYMOND J DURAND DARAN RAYMOND

JOHN FRANCIS /ECKSTEIN IV I M JOHN FRANCIS ECKSTEIN IV ECSAN JOHN

BOB T /HSIEH I M BOB T HSIEH HSAH ROBERT

MOST REV. VINCENT D. /BREEN I M MOST REV VINCENT D BREEN BRAN VINCENT

MISS DOROTHY /MEAGHER I F MISS DOROTHY MEAGHER MAGAR DOROTHY

MINISTER ROXANN D /ROBINSON I F MINISTER ROXANN D ROBINSON RABANSAN ROXANN

LUCIEN D /MOCOMBE MD I M LUCIEN D MOCOMBE MD MACANB LUCIEN

FRANK /MCCORD III I M FRANK MCCORD III MCAD FRANK

JOHN L /HANCOCK 111 I M JOHN L HANCOCK III HANCAC JOHN

ONLINE BANKING /TEST1015 ONLINE BANKING TEST1015

ON LINE BANKING /TEST #1120 ON LINE BANKING TEST 1120

ITA5 /TEST ITA5 TEST

RABBI JEROME M /BLUM I M RABBI JEROME M BLUM BLAN JEROME

FRANCES /WILLIAMS JONES I M FRANCES WILLIAMS JONES JAN FRANCES

QualityStage Standardize ExampleQualityStage Standardize Example

Address Standardization Examples

Input Address “Bucketed” Address Information after INTEGRITY Standardization

Add

ress

Hou

se

Hou

se

Suff

ix

Pre-

Dir

.

Stre

et

Nam

e

Stre

et T

ype

Suff

ix D

ir.

Uni

t Typ

e

Uni

t Val

ue

Flo

or V

alue

Rte

Val

ue

Box

Val

ue

NYS

IIS

St

reet

N

ame

Add

ition

al

Add

ress

In

fo

326 W 17 ST 326 W 17TH ST 17T

200 E.27TH STREET APT. 10H 200 E 27TH ST APT 10H 27T

168 FIRST AVE. 168 1ST AVE 1

35 PIERREPONT STREET APT.#3-B 35 PIERREPONT ST APT 3B PARAPAN

76-D LA BONNE VIE II DR 76 D LA BONNE VIE II DR LABANAVY

1560 BROADWAY SUITE 416 1560 BROADWAY STE 416 BRADWY

50 FAIRVIEW DRIVE SOUTH 50 FAIRVIEW DR S FARV

247 DOVER GRN 247 DOVER GRN DAVAR

3530 HENRY HUDSON PKWY E APT 8D 3530 HENRY HUDSON PKWY E APT 8D

2951 W 33 ST APT 3C 2951 W 33RD ST APT 3C 33D

425 E 8TH ST (2ND FLOOR) 425 E 8TH ST 2 8T

305 WEST 98TH ST APT #4AN 305 W 98TH ST APT 4AN 98T

37-06 100 STREET /FIRST FL 37 06 100TH ST 1 100T

ONE FIFTH AVENUE 1 5TH AVE 5T

1 5TH AVE APT 15G 1 5TH AVE APT 15G 5T

P O BOX 2257 666 ANDERSON AVE 666 ANDERSON AVE 2257 ANDARSAN

MatchMatch

“Conditioned data and QualityStage’s matching engine link the previously

unlinkable.” Match Construction:

Reliability of input data defines a match result.

Statistical Analysis & Match Scoring:Linkage probability determined on a sliding

scale by field level comparison.

Report Generation:All business rules applied have easy to

understand report structure.

What is MatchWhat is Match

Identifying all records on one file that correspond to similar records on another file

Identifying duplicate records in one fileBuilding relationships between records in

multiple filesPerforming statistical and probabilistic

matchingCalculating a score based on the

probability of a match

How to MatchHow to Match

Single file (Unduplication) or two file (Geomatch)

Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison …)

Generation of composite weights from multiple fields

Use of probabilistic or statistical algorithmsApplication of match cutoffs or thresholds to

identify automatic and clerical match levelsIncorporation of override weights to assess

particular data conditions (e.g. default values, discriminatory elements)

QualityStage MatchQualityStage Match

Over 25 match comparison algorithms providing a full spectrum of fuzzy matching functions

Statistically-based method for determining matches (Probabilistic Record Linkage Theory)

Field-by-field comparisons for agreement or disagreement

Assignment of weights or penaltiesOverrides for unique data conditionsScore results to determine the probability of

matched recordsThresholds for final match determinationAbility to measure informational content of

data

QualityStage Match ExamplesQualityStage Match Examples

Type Wgt SSN Input Name Input Address Input City St Zip Title Sal. Maiden Name DOB

XA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 SCUNNAIMINA 19640110

DA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 19640100

XA 29.73 999-99-9999 ANGEL A /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 0 19250101

DA 29.73 000-00-0000 ANGEL /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 X 19250101

XA 29.73 058-09-8019 HARRY W /BOGARDS PO BOX 845 PORT WASHINGTON NY 11050-0202 JR VAN GURP 19120920

DA 7.16 058-09-8019 HARRY /BOGAARDS P O BOX 845 PORT WASHINGT NY 110500202 0

XA 19.29 261-60-5676 ADRIAN /GARCIA ROCKEFELLER CENTER P O BOX 1062 NEW YORK NY 10020 19300908

DA 19.29 000-00-0000 ADRIAN /GARCIA P O BOX 1062 ROCKEFELLER CNTR NEW YORK NY 10185 0

XA 62.78 050-36-6598 GLORIA P /LEONNELL 1655 FLATBUSH AVE APT B302 BROOKLYN NY 11210-3271 19460410

DA 33.09 050-36-6598 GLORIA P /LEONNELL-WILLIAMS1655 FLATBUSH AVE BROOKLYN NY 11210-3276 HILL 19460410

XA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111

DA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111

DA 44.08 000-00-0000 WILLIAM /LOCKLEY 105-16 FLATLANDS 9TH ST BROOKLYN NY 112364624 0

XA 54.42 414-76-9969 MARY /RICHARDSON 651 E 14TH ST NEW YORK NY 10009-3119 19451222

DA 24.73 414-76-9969 MARY P /RICHARDSON GRAY651 E 14TH ST APT 10G NEW YORK NY 10009-3125 ROBINSON 19451222

What is SurviveWhat is Survive

Creation of best-of-breed “surviving” data based on record or field level information

Development of cross-reference file of related keys

Production of load exception reportsCreating output formats:

Relational table with primary and foreign keysTransactions to update databasesCross-reference files, synonym tables

Why surviveWhy survive

Provide consolidated view of dataProvide consolidated view containing the

“best-of-breed” dataResolve conflicting values and fill missing

valuesCross-populate best available dataImplement business and mapping rulesCreate cross-reference keys

How to surviveHow to survive

Highly flexible rulesRecord or field level survivorship

decisionsRules can be based upon data frequency,

data recency (i.e. date), data source, value presence or length

Rules can incorporate multiple testsQualityStage features

Point-and-click (GUI-based) creation of business rules to determine best-of-breed “surviving” data

Performed at record or field level

QualityStage Survive ExamplesQualityStage Survive Examples

Example 1: The longest populated Middle and Last Name

First Name

Middle Name

Last Name First Name

Middle Name

Last Name

MARI LEMELSON-LAPPNER

MARI S LEMELSON-LAPPNER

MARI S LEMELSON

Matched Survived

Example 2: The longest populated Middle Name, Date of Birth, and SSN

First Name Middle NameLast Name DOB SSN First Name Middle NameLast NameDOB SSN

DENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173

DENISE F TRIANO

Matched Survived

Data Re-engineering MethodsData Re-engineering Methods

Analyze free form fields

SurvivorshipInvestigation Conditioning Matching

II IIII IV

Data Quality Assessment

(DQA)

Data Re-Engineering(DRE)

Exercise 1-1:Course ProjectExercise 1-1:Course Project

Course business case: WINN Insurance CRM project

See QualityStage Essentials Exercises, page 4

Course Project DesignCourse Project Design

AUTOHOME LIFE

InvestigateAssess Data Quality

Standardize Country

Add Unique KeyAppend Data to a common format

InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for further processing


Module SummaryModule Summary

Five common data quality contaminants1.Different standards2.Missing and default values3.Spillover and buried information 4.Anomalies5.No consolidated view

Approaches to Data QualityData re-engineering methods

Introduction to QualityStageIntroduction to QualityStage

Why QualityStageWhy QualityStage

Probabilistic record linkage results in highest level of accurate, complete and justifiable match rates

Most flexible parsing/standardization capabilitiesHandles complex free-form data

Ability to verify 200+ country addresses allows for global support

Transparent parallelism exploits multiple CPUs which provides unmatched performance and scalability

Bi-directional meta data exchange ensures users understand data

Productivity, connectivity and interoperability via tight integration with DataStage and RTI Services

QualityStage ArchitectureQualityStage Architecture

OS/390

Windows&

NT Server

QualityStage

Designer

Windows

BUILD ONCE

RUN ANYWHERE

TCP/IP (FTP)

QualityStage Server Platforms

Windows&

NT Server

UNIX

TCP/IP

QualityStage DesignerQualityStage Designer

Designer Client GUI for

designing projects Windows NT, 2000,

XP Enter meta data Define Stages Build Jobs Standardization

Rules Designer Repository

Designer - ToolbarDesigner - Toolbar

NEW NEW Project, Data File definition, Data Field definition, Stage, or Job

CUT, COPY, PASTE CUT, COPY, PASTE Items listed on the right pane of the work area

RUNRUN The job selected on the right pane

DISPLAY DISPLAY Change display of right pane to Large icons, Small icons or show Details

Designer - Rule SetsDesigner - Rule Sets

Pre-defined rules for parsing and standardizing: Name Address Area (City, State and Zip)

Multi-national address processing

Validate structure: Tax ID US Phone Date Email

Append ISO country codes Pre-process or filter name,

address and area Rule sets are stored

locally with the Designer (separate from the repository)

Designer Rule Set OptionsDesigner Rule Set Options

The name and location is defined in the Designer

– File, Designer Options, Standardize Process Definition Dictionary

Quality Stage ServerQuality Stage Server

Deployment modesBatchReal-timeReal-time via API

Master Projects DirectoryProject information is deployed to

the serverProject work files are stored on

the server in project libraries

Directory Structure Directory Structure

QualityStage Designer C:\Ascential\QualityStageDesigner70

Designer Repository C:\Ascential\QualityStageDesiger70\QualityStageDesigner.mdb

Rule Sets C:\Ascential\QualityStageDesigner70

QualityStage Server C:\Ascential\QualityStageServer70

Master Projects Directory

C:\Projects

Sample Project Directory C:\Projects\Quality

Sample Project Results C:\Projects\Quality\Data

Designer

Server

Master Projects Directory Master Projects Directory

Master projects directory resides on the server

Multiple users can share the same Master Projects and Project directory

All project libraries are stored under the Master Projects directory

Project LibrariesProject Libraries

Project libraries are stored under the Master Projects directory

Project Library Description

Ipe_env.sh QualityStage Environment shell

Controls Stage and job control members

Data Location of input and output files

DIC Stage and job dictionary

IPICFG Environment configuration

Logs Location of job run logs

Scripts Job scripts – dependent on the server type

Temp Temp work space

QualityStage Licensed StagesQualityStage Licensed Stages

QualityStageWAVESPostal Certification Solutions

CASS SERP

GeoLocator

Exercise 2-1: Configure QualityStageExercise 2-1: Configure QualityStageConfigure the Designer for the

development serverRun profileDesigner Options

Server – Master Projects directory Designer OptionsStarting the QualityStage Server

During the courseDevelopment environment

Run ProfileRun Profile

One or multiple profiles

Defines for the Designer the server component location and access

Required:

– Host Type

– Host Server Path

– Master Project Directory

Optional:

– Alternate Locale

– Local Report Data Location

Run Profile: Adv Project SettingsRun Profile: Adv Project Settings

Location of the input and output data files

Location of the control members for each stage and job

Server temporary work location

Logs for each stage and job

Scripts to execute jobs

Run Profile: FTP SettingsRun Profile: FTP Settings

If you are connecting to a remote server then you need the login ID and password for the server.

QualityStage Designer OptionsQualityStage Designer Options

Local working temp directory on your local PC

Location of the rule sets

Default location for importing projects

Preferred editor for reviewing rule sets and result file


QualityStage Components Architecture Communication: Designer and Server use TCP/IP (FTP) to

communicate

Configuration User Profile Designer Options Starting the Server

Projects Projects are defined in the Designer To run and execute jobs, QualityStage Server must be

running Project libraries are stored on the server

Developing with QualityStageDeveloping with QualityStage

Module ObjectivesModule Objectives

Introduce the concepts, components and methods for developing projects in QualityStage

After this module you will be able to:Define data files and field definitionsBuild Stages and design JobsDeploy and run JobsLocate and review results

Application ComponentsApplication Components

QualityStage ApplicationProject Components

Stages JobsData File Definitions

Meta data

File Name Requirements

StagesStages

Abbreviate

BuildCASS CollapseFormat

ConvertInvestigate

Stages• Sort

• Standardize

• Survive

• Transfer

• Unijoin

• WAVES

• Z4changes

• Match

•Multinational Standardize

• Parse

• Program

• Select

• SERP

** Licensed Stages – additional licensing required

What is a Job?What is a Job?

A job is an executable QualityStage program

Jobs can be run interactively or in batch mode

In this course, jobs will be run interactively under the control of QualityStage Designer

Job Development OverviewJob Development Overview

Designer Import or enter file definitions and meta data

defining your sources and targetsAdd stages defining the process or taskDeploy the job

ServerRun the jobReview results

Job Development ProcessJob Development Process

1. Define data files Enter or import meta data

2. Define and build stages 3. Define job4. Deploy the job

Move project definitions to project libraries on the server

5. Run the job 6. Review results

Executing a Job: Deploy and RunExecuting a Job: Deploy and Run

QualityStage Server

QualityStage

Designer

Windows

Deploy & Run

Deploy: Transfer project information to the server

Job Script

RUN: Execute the job script on the server

QualityStage Job Run ModesQualityStage Job Run Modes

FILE MODE

DATA STREAM

Process each record through a job before passing all the records to the next job

Process each record and then pass it immediately on to the next job

Exercise 3-1: Deploy and RunExercise 3-1: Deploy and Run

1. Open the demo project Quality2. Select a job3. Select the Run button on the toolbar4. Uncheck the Deploy box5. Choose “Execute File Mode”6. Choose “Run from Start to End”7. Review project libraries on the server

Data File Formats and DefinitionsData File Formats and DefinitionsData File Names

One to eight charactersNo spaces or extensionsFile names are uppercase and case-sensitive

Data File LocationData folder in project library

FormatsQualityStages processes fixed record length

sequential filesAlphanumeric characters

Exercise 3-2: Define a ProjectExercise 3-2: Define a Project

1. Choose, New icon from the Tool Bar

2. Choose Project

3. Project Name: WinnCRM

4. Project Description: Winn Insurance CRM Project

5. Choose OK

Defining Meta DataDefining Meta Data

Data field definitions can be entered or imported into the Designer

Importing options include:Cobol copybooksODBC enabled MetaStage MetabrokerVisual Warehouse

Exercise 3-3: Define a Data FileExercise 3-3: Define a Data File

1. Left pane, select Data File Definitions

2. Right pane, right-click, select New File

3. Filename AUTOHOME

4. File: Auto and Home Policies

5. Choose OK

Exercise 3-4: Data Field DefinitionsExercise 3-4: Data Field Definitions


2. Left pane, select AUTOHOME

3. Right pane, right click, select New Field

4. Complete field information

Lab 3-5: Copy Data File and Field Definitions

Lab 3-5: Copy Data File and Field Definitions


2. Right pane, select AUTOHOME

3. Right-click, select COPY


5. Right pane, right-click, select PASTE

6. Name File: LIFE

7. Choose OK


Data file definitionsData file formatMeta data

Jobs and StagesRun and DeployProject Libraries

Investigate and Data Quality Assessment

Investigate and Data Quality Assessment


Describe how the Investigate stage is used to assess data quality in the project life cycle

Identify the three types of Investigate stageCharacter Discrete InvestigateCharacter Concatenate InvestigateWord Investigate

Design Investigate stages and run Investigate jobs

Review and analyze Investigate results

Project Planning & RequirementsProject Planning & Requirements

Identify Objectives

Data Assessment

Define Development Plan

Define Business Requirements

Define Data Requirements

Requirements

Planning

Application Design Plan



SurvivorshipInvestigation Conditioning Matching

II IIII IV


(DQA)


High-Level DFDHigh-Level DFD

AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Data AssessmentData Assessment

Verify the domainReview each field and verify the data matches

the meta dataIdentify data formats, missing and default

valuesIdentify data anomalies

FormatStructureContent

Discover “unwritten” business rulesIdentify data preparation requirements

Investigate StageInvestigate Stage

FeaturesAnalyze free-form and single domain fieldsProvide frequency distributions of distinct

values and patterns

Investigate methodsCharacter DiscreteCharacter ConcatenateWord

Investigate MethodsInvestigate Methods

Method Why

Character DiscreteAnalyzing field values, formats, and domains

Character ConcatenateCross-field correlation, checking logic relationships between fields

Word InvestigationIdentifying free-form fields that may require parsing and discovery of key words for classification

Investigate TerminologyInvestigate Terminology

Options that represent the data. Options: Character (C), Type (T), Skipped (X)

Tokens

Field Masks

Individual units of data

Character Mask

Usage

C For viewing the actual character values of the data

T For viewing the pattern of the data

X For ignoring characters

Token Mask Result

02116 CCCCC 02116

02116 CCCXX 021

01832-4480 TTTTTTTTTT nnnnn-nnnn

XJ2 6EM TTTTTTT aanbnaa

(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300

617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn

6173380300 CCCXXXXXXXXX 617

(617)3380300 CCCXXXXXXXXX (61

Field Mask ExamplesField Mask Examples

Character Discrete: Field Mask (C)haracter Character Discrete: Field Mask (C)haracter

Usage: Domain quality View the contents of each field to verify

that the data values match the field labels

Investigate Stage: Generates Reports for frequency and

pattern references Report naming conventions:

jobp.FRQ – Results sorted by frequency, descending order

jobp.SRT – Results sorted by field mask, ascending order

job.PAT – Pattern reference file

DOB 00000908 45.309% [X]| DOB 00000005 0.250% 00000000 [X]| 00000000DOB 00000004 0.200% 19440225 [X]| 19440225DOB 00000004 0.200% 19440609 [X]| 19440609DOB 00000004 0.200% 19460212 [X]| 19460212POLNUMB 00000001 0.050% 014669402 [X]| 014669402 POLNUMB 00000208 11.00% 617-338-0300[X]| 617-338-0300 POLNUMB 00000001 0.050% AM07B002470 [X]| AM07B002470POLNUMB 00000001 0.050% AM07B002736 [X]| AM07B002736

DOB 00000908 45.309% [X]| DOB 00000005 0.250% 00000000 [X]| 00000000DOB 00000004 0.200% 19440225 [X]| 19440225DOB 00000004 0.200% 19440609 [X]| 19440609DOB 00000004 0.200% 19460212 [X]| 19460212POLNUMB 00000001 0.050% 014669402 [X]| 014669402 POLNUMB 00000208 11.00% 617-338-0300[X]| 617-338-0300 POLNUMB 00000001 0.050% AM07B002470 [X]| AM07B002470POLNUMB 00000001 0.050% AM07B002736 [X]| AM07B002736

Character Discrete - Character ResultsCharacter Discrete - Character Results

Field Name

FRQ Count

Sample “Example”

FRQ % Field Mask

[X] indicates a new set of example records

Character Discrete: Field Mask (T)ypeCharacter Discrete: Field Mask (T)ype

Usage: Data formats (patterns): View the format of field which contain that

you suspect may follow or conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.

Generates reports for frequency and pattern references Report naming conventions:

jobp.FRQ – Results sorted by frequency, descending order jobp.SRT – Results sorted by field mask, ascending order job.PAT – Pattern reference file

Exercise 4-1: Character Discrete InvestigateExercise 4-1: Character Discrete Investigate

1. Create Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results

Lab 4-1: Character Discrete Investigate – Type T

Lab 4-1: Character Discrete Investigate – Type T1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results

Character ConcatenateCharacter Concatenate

Usage: Identify Field Relationships Investigate one or more fields to uncover any

relationship between the field values.

QualityStage ToolkitUses combinations of character masksGenerates Reports for frequency and pattern

referencesReport naming conventions:

jobp.FRQ – Results sorted by frequency, descending order

jobp.SRT – Results sorted by field mask, ascending order

job.PAT – Pattern reference file

00000908 45.309% bbbbbbbbbbbbbbbb [X] | 00000020 2.009% bbbbnnnnbbbbbbbb [X] | 1904 00001096 54.691% nnnnnnnnbbbbbbbb [X] | 06011944

00000908 45.309% bbbbbbbbbbbbbbbb [X] | 00000020 2.009% bbbbnnnnbbbbbbbb [X] | 1904 00001096 54.691% nnnnnnnnbbbbbbbb [X] | 06011944

Character Concatenate ResultsCharacter Concatenate Results

FRQ Count Sample /

“Example”FRQ % Field Mask

[X] indicates a new set of example records

DOB and DOD Fields

Exercise 4-2: Character Concatenate

Exercise 4-2: Character Concatenate

1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results

Word InvestigateWord Investigate

Usage: Pattern free-form fields and lexical analysis To view the pattern of the data within a

freeform text field and parse it into individual tokens

QualityStage Apply rules sets to free-form fields Discover parsing requirements Pattern data Generates reports for word frequency, pattern

frequency distributions, and word classification

Word Investigation ResultsWord Investigation Results

^D?T 639 N MILLS AVE^D?S 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE

0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE

ABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001

Pattern Reports

Word Classification Reports

Word Frequency Reports

Rule SetsRule Sets

Rules for parsing, classifying, and organizing data

Rule Set DomainsCountry processingPre-processingDomain Processing

Name: Business and Personal Street Address Area: Locality, City, State and Zip/Postal codes

Multinational Address Processing

ParsingParsing

Parse free-form data with the SEPLIST and a STRIPLIST SEPLIST - Any character in the SEPLIST

will separate tokens, and become a token itself

STRIPLIST - Any character in the STRIPLIST will be ignored in the resulting pattern

The SEPLIST is always applied first

Parsing ExampleParsing Example

Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8120 Main St . N . W .

Token1 Token2 Token3 Token4

120 Main St NW

Token1 Token2 Token3 Token4 Token5

120 Main St N W

SEPLIST “¬.”STRIPLIST “¬.“

SEPLIST “¬”STRIPLIST “¬.“

SEPLIST “¬.”STRIPLIST “¬“

Example: 120 Main St. N.W.

Data Typing: Classifying TokensData Typing: Classifying Tokens

Identify and type the token in terms of it’s business meaning and value

MASK KEY:

N – Numeric token

A – Alpha token

M – Mixed Token

120 Main Street Apt 6C

N A A A M

PATTERN KEY:

^ – Numeric token

? – Unclassified alpha token

@, <, > – Mixed Token

T – Street Type

U – Unit Type

120 Main Street Apt 6C

^ ? T U >

21 WINGATE STREET APARTMENT 601

T ^?^

Parse

Classify known wordsand assign default

tagsU

Produce Reports based on Patterns &

Tokens

^D?T 639 N MILLS AVE^D?T 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE

^D?T 639 N MILLS AVE^D?T 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE

0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE

0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE



Example: Word InvestigateExample: Word Investigate

Lab 4-3: Word Investigation Address and AreaLab 4-3: Word Investigation Address and Area

1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose rule set and field(s)5. Choose Advanced Options6. Stage and run job7. Review report results

Data Quality AssessmentData Quality Assessment

Review and analyze each field for the following information:How often is the field populated?What are the anomalies and out-of-range

values? How often does each one occur?How many unique values were found?What is the distribution of the data or

patterns?

Use Investigate results to:Update business requirementsDefine development plan and application

design

QuizQuiz

What is domain integrity?What is the difference between a Type C

and a Type T field mask?When might you use a Type X field mask?Where can you find the Investigate

reports?


DRE Methodology: Data Quality Assessment

Character discrete, concatenate and word investigation

Field MasksCharacter (C)Type (T) Ignore (X)

Parsing – SEPLIST, STRIPLISTData ClassificationPatterns

Data Preparation Data Preparation

Data PreparationData Preparation

Format of data fileUnique record identifierCommon record layout


AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Data File FormatData File Format

Preferred data file format for QualityStage is: Fixed record length Fix fielded data Sequential file with terminated records Alphanumeric data

QualityStage provides the following features for working with other file formats: ODBC enabled for pulling/pushing data from/to a table Unterminated and Variable length Fixed-length unterminated

The Transfer (GTF) stage is used to read in the various formats and output a fixed-record length terminated file

Unique Record KeyUnique Record Key

Every record should start the QualityStage process with a unique record key

This key can be created in QualityStage or by other tools like DataStage

The QualityStage Investigate Stage will help validate if a unique key exists

This unique key provides developers with a way to audit each record as it passes through the QualityStage application

The Transfer Stage can be used to create a new key field and populate the new field with a unique value

Common Data FormatCommon Data Format

Fields identified for processing should be moved forward from each source and appended into a single new source fileAllows for efficiently processing all data in one

stream using one set of rules

In QualityStage, appending data files is accomplished with the Transfer (GTF) stage

Transfer Stage (GTF)Transfer Stage (GTF)

Transforms data file formats to fixed length flat files

Adds new fields Assign literal values such as a source indicator Generate and assign a sequential value

Reformatting record layouts Dropping fields

Format field data Case formatting Right/left justification Right/left fill

Concatenate fieldsAppend Data files

FW000000001 639 N MILLS AVE ORLANDO FL 32803FW000000002 4275 OWENS RD, STE 536 EVANS GA 30809FW000000003 ERIN OFFICE PARK DUBLIN GA 31021FW000000004 600 E OGLETHORPE HWY HINESVILLE GA 31313

Add a Record KeyAdd a Record Key

639 N MILLS AVE ORLANDO FL 328034275 OWENS RD, STE 536 EVANS GA 30809ERIN OFFICE PARK DUBLIN GA 31021600 E OGLETHORPE HWY HINESVILLE GA 31313

Populate Record Key

field

1,25 26,25 52,20 73,2 76,5

1,12 13,25 38,25 64,20 88,585,2

Input Data File

Output Data File

Transfer Stage(GTF)

Record Key Best PracticesRecord Key Best Practices

Add a unique record identifier in the QualityStage process or prior to entering QualityStage processing

Create a 12 byte fieldThe first 2 bytes indicate the sourcePositions 3 through 11 store a sequential

numberPosition 12 is intentionally left blank for

training providing a space between the record key and the data

Append Data FilesAppend Data Files

The transfer stage can read one input file and produce one output file

To append data, you will need to define a Transfer stage for each file you want to append

Be careful of the order – the first transfer does not generally append – only subsequent transfer stages referencing the same output file should append data

Transfer Stage 1Transfer Stage 1Transfer Stage 2

Append options selected

Transfer Stage 2

Append options selected

COMBINED

AUTOHOME LIFE

Exercise 5-1:Add a Record Key and Append Data FilesExercise 5-1:Add a Record Key and Append Data Files

1. Read in each source of data

2. Define a new output file with a common format

3. Create Transfer Stage 1

4. Create new Record Key field

5. Populate the Record Key field

6. Add AUTOHOME Data to new COMBINED output file

AUTOHOMEAUTOHOME

COMBINEDCOMBINED

Stage name: AHKEY

Stage type: Transfer

Job Name: Append

Stage name: AHKEY


Job Name: Append

Lab 5-1: Append LIFE to COMBINED OutputLab 5-1: Append LIFE to COMBINED Output

1. Create transfer stage

2. Define new record key field

3. Populate the record key field

4. Append LIFE to AUTOHOME in the COMBINED output file

LIFELIFE

COMBINEDCOMBINED

Stage name: LFKEY


Job Name: Append

Stage name: LFKEY


Job Name: Append


QualityStage requires files to be fixed record length terminated records.

The Transfer stage can be used to:Convert file formats to fixed record lengthAdd new fields and populated with literal

values or sequential numbersAppend data filesFormat FieldsReformat record layout

Standardize Standardize


Describe the Standardize Stage in the Data Re-engineering Methodology

Identify Rule SetsApply the Standardize StageInterpret Standardize resultsInvestigate unhandled data and patterns

Project Lifecycle: DevelopmentProject Lifecycle: Development

Review Data Flow Diagram

Construct Application

Review & Refine

Standardize Data

Find Duplicate Candidate (Match)

Survive Best of Breed (Survive)

Development {Unit Test



SurviveInvestigation Standardize Match

II IIII IV


(DQA)



AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




StandardizeStandardize

TransformationParsing free form fieldsComparison threshold for classifying like wordsBucketing data tokens

StandardizationApplying standard values and standard

formats

Phonetic Coding for use in MatchingNYSIISSoundex

Standardize Example Standardize Example

Input File:

Address Line 1 Address Line 2

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR

Result File:

House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City Type Value Type Value

1721 W ELFINDALE AVE UNIT 201721 W ELFINDALE AVE 2016200 VENTURA BLVD12 WESTERN AVE1705 W ST

PHILADELPHIA1655 PONCE DE LEONAVE FLOOR15

Input File:

Address Line 1 Address Line 2

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR

Result File:

House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City Type Value Type Value

1721 W ELFINDALE AVE UNIT 201721 W ELFINDALE AVE 2016200 VENTURA BLVD12 WESTERN AVE1705 W ST

PHILADELPHIA1655 PONCE DE LEONAVE FLOOR15


^?^

Parse

Classify &assign default

tags T U

House Street UnitNumber Street Name Type UnitType

21 WINGATE ST APT 601

Process Patterns and Bucket Data

Standardize ProcessStandardize Process

Output File

Key:

^ = Single numeric

? = One or more unknown alphas

T = Street type

U = Unit type

Standardize StageStandardize Stage

Standardize StageUses Rule sets for:

Country processing Pre-domain processing

– USPREP Domain processing

– USADDR– USAREA– USNAME

Multi-national Address WAVES

Types of Rule SetsTypes of Rule Sets

Country Identifier

COUNTRY

Country Identifier

COUNTRY

Domain Pre-processor

USPREP

Domain Pre-processor

USPREP

Domain Specific: USNAME

Domain Specific: USNAME

Domain Specific: USADDR

Domain Specific: USADDR

Domain Specific: USAREA

Domain Specific: USAREA

Example: Country IdentifierExample: Country Identifier

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET

Example: Domain Pre-ProcessorExample: Domain Pre-Processor

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Example: Domain-SpecificExample: Domain-Specific

Input Record

100 SUMMER STREET 15TH FLOOR

Input Record

100 SUMMER STREET 15TH FLOOR

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U

Rule SetsRule Sets

Rule sets contain logic for:ParsingClassifyingProcessing data by pattern and bucketing data

Three required filesClassification TableDictionary FilePattern Action FileOptional Lookup Tables

Rule Sets FilesRule Sets Files

Contains a series of patterns and programming commands to condition the data

Contains standard abbreviations that identify and classify key words.

Optional conversion and lookup tables for converting and returning standardized values

Define the output file fields to store the parsed and conditioned data

Description file for the Rule Set

Tables for storing overrides entered into the Designer GUI

Classification Table (.CLS)

Pattern Action File (.PAT)

Dictionary File (.DCT)

Rule Set Description (.PRC)

Lookup Tables (.CLS)

Override Tables (.CLS)

Classification TableClassification Table

Contains the words for classification, standardized versions of words, and data class

Data class (data tag) is assigned to each data token

Default classes are the same across all rule sets

User-defined classes are assigned in the classification tableUsers may modify, add or delete these classesUser-defined classes are a single letter

Default ClassesDefault Classes

Class Description

^ A single numeric

+ A single unclassified alpha (word)

? One or more consecutive unclassified alphas

@ Complex mixed token, e.g., ½, O’Connell

> Leading numeric, e.g., 6A

< Trailing numeric, e.g. A6

Zero Null class

User-defined ClassesUser-defined Classes

Class Description

USNAME

G Generational, e.g., Senior, I, II

P Prefix, e.g. Dr., Mr., Miss

USADDR

T Street Type

D Directional

B Box Type

USAREA

S State Abbreviation

Comparison ThresholdComparison Threshold

May be used in the Classification table

Used to efficiently make entries into the classification table

Helps overcome spelling and data entry errors

Not requiredThreshold uses a

logical string comparator

Threshold level900

Exact match

850

Almost certainly the same

800

Most likely equivalent

750

Most likely not the same

700

Almost certainly not the same

Classification Table ExampleClassification Table Example;-------------------------------------------------------------------------------

; USADDR Classification Table ;-------------------------------------------------------------------------------; Classification Legend ;-------------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B

Dictionary FileDictionary File

Defines the fields definitions for the output file

When data is moved to these output fields it is called “bucketing” the data

The order that the fields are listed in the dictionary file defines the order the fields are written to the output file

Dictionary file entries are similar to field definitions

Dictionary File ExampleDictionary File Example

;;QualityStage v7.0 \FORMAT\ SORT=N ;-------------------------------------------------------------------------------; USADDR Dictionary File ;-------------------------------------------------------------------------------; Total Dictionary Length = 415 ;-------------------------------------------------------------------------------; Business Intelligence Fields ;-------------------------------------------------------------------------------HN C 10 S HouseNumber ;0001-0010HS C 10 S HouseNumberSuffix ;0011-0020PD C 3 S StreetPrefixDirectional ;0021-0023PT C 20 S StreetPrefixType ;0024-0043SN C 25 S StreetName ;0044-0068ST C 5 S StreetSuffixType ;0069-0073SQ C 5 S StreetSuffixQualifier ;0074-0078SD C 3 S StreetSuffixDirectional ;0079-0081RT C 3 S RuralRouteType ;0082-0084RV C 10 S RuralRouteValue ;0085-0094BT C 7 S BoxType ;0095-0101BV C 10 S BoxValue ;0102-0111

Pattern Action FilePattern Action File

The Pattern-Action file contains the rules for standardization; that is, the actions to execute with a given pattern of tokens

Records are processed from the top downWritten in Pattern Action Language Complex parsing can be coded in this file

Street Address 10 Hollow Oak Road

Pattern ^ ? T

Pattern Action Language COPY [1] {HN}COPY_S [2] {SN}COPY_A [3] {ST}

{HN} {SN}

{ST}

Pattern Action File ExamplePattern Action File Example

10 Hollow Oak Rd

Optional Lookup TablesOptional Lookup Tables

Called from the Pattern Action FileRule sets may contain lookup tables such

as:Common First Names and Enhanced First

Names Barb & Barbara Ted & Edward

Gender based on nameState abbreviationsCommon city abbreviations

NYC = New York City LA = Los Angeles


^?^

Parse

Classify &assign default

tags T U

House Street UnitNumber Street Name Type UnitType

21 WINGATE ST APT 601

Pattern Action File

Process Patterns and Bucket Data

Classification Table

Dictionary File

Standardize ProcessStandardize Process

Standardizing International DataStandardizing International Data

Two methodsMethod 1: Use country pre-processor, domain

pre-processor, and domain-specific rules Uses out-of-the-box, included functionality/rules

Method 2: Use Multinational Standardize and WAVES

Requires purchase of WAVES database

Making WAVESMaking WAVES

When working with files containing multinational addresses, QualityStage provides the following tools: Multinational Address Standardization stage which

standardizes address files at city-and street-levels Functionality available out-of-the-box

WAVES (Worldwide Address Verification and Enhancement System) stage which standardizes, corrects, verifies and enhances addresses against a country-specific postal reference file

Requires purchase of WAVES database

What the WAVES Stage DoesWhat the WAVES Stage Does

After the original input files have been standardized, the WAVES stage performs these main functions Corrects – Corrects defects in the input data due to

typographical or spelling errors Verifies – Using probabilitistic matching, WAVES stage tries to

match corrected address records against addresses in a country-specific postal reference file

Enhances – Assigns the portal record data to the appropriate fields in output file, thereby substituting any erroneous and missing input data with verified postal data

Indicates Verification result – Shows whether a record has been successfully verified by the WAVES stage

Overall degree of verification also indicated

WAVES Input File RequirementsWAVES Input File Requirements

Fixed-fielded, fixed record length data files Total line length cannot exceed 4096 Address data must occur within first 3072

Each record must contain Country indicator

Full spelling Abbreviation 2- or 3-bytes ISO country code Mismatch of country indicator to country- and street-level

formats results in data not being standardized and output as unhandled

– For example, identifier says record is German and address format is that of France

Unique record identifier (record key) Use preprocessor to remove any non-address data

from the address fields c/o Attn: Department

Multinational Standardize (MNS) stage automatically used as part of WAVES stage processing

WAVES OutputWAVES Output

City-level verification Correct, enhance and verify city field Correct, enhance and verify neighborhood/locality field Correct, enhance and verify state/province field Verify postal code field (but not correct it) Indicate if record has been verified (and to what degree)

Street-level verification Correct, enhance and verify the street info field Correct and verify postal code Indicate the match weight, which shows the degree of

certainty of the probabilistic match between the input and reference data

About the Verification ProcessAbout the Verification Process

Use ISO codes, which are applied during standardization, as critical match fields on all city and street level verification attempts

Try to verify the city, state/province, and postal code are correct based upon the available information in the record For example, if no state/province is available, uses

postal code to impute the missing information

If the postal code conflicts with the city, WAVES copies the city and province fields from the postal record, but does not change the postal code since WAVES cannot verify which is the correct data

Modifying Standardization BehaviorModifying Standardization BehaviorMNS rules (used by WAVES) can be

modified using the override functionality in QualityStage Designer


AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Country Rule SetCountry Rule Set

Country Rule set appends the two byte ISO country code

Input to the country rule set includes:Street AddressCity or localityStateZip or Postal codeCountry field (if it exists)

Output:Two byte ISO country codeFlag identifying explicit or default decision

Exercise 6-1: Country StandardizeExercise 6-1: Country Standardize1. Define the output

file2. Create the job 3. Define the job

details Select Country rule

set Identify fields to be

conditioned Enter metadata label

4. Run the job5. Review results

COMBINEDCOMBINED

CNTRYOUTCNTRYOUT

Stage name: CNTRYSTAN

Stage type: Standardize

Rule set: COUNTRY

Job Name: STAN

Stage name: CNTRYSTAN


Rule set: COUNTRY

Job Name: STAN


AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Selecting US DataSelecting US Data

The QualityStage Select Stage provides the capability of selecting and/or rejecting records based on a set of values for a field

Selecting or splitting data requiring compound or complex logic may require multiple select stages or a custom rule set

Select StageSelect Stage

Select Stage accepts one input file and may output multiple filesAccept – records that meet the criteriaReject – records that do not meet the criteriaSplit – both the Accept and Reject file are

created

Input and output files have the same layout

Select allows you to choose a field and create a list of values If a record is equivalent to a value on the list

then the record is accepted, else it is rejected

Exercise 6-2: Split US From Non-US DataExercise 6-2: Split US From Non-US Data

1. Create output files One for Accept records One for reject records

2. Create Select stage

3. Add to Stan Job4. Run Select stage5. Review Results

CNTRYOUTCNTRYOUT

USDATA(Accept)USDATA(Accept)

Stage name: SPLIT

Stage type: Select

Job Name: STAN

Stage name: SPLIT

Stage type: Select

Job Name: STAN

NONUSDATA(Reject)

NONUSDATA(Reject)

Domain Pre-Processor Rule SetsDomain Pre-Processor Rule Sets

Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) dataFor example, if the city, state and zip is found

in ADDRESS LINE 2, the pre-processor rule set will attempt to recognize this data and move it into the area domain

The pre-processor rule set prepares the data for processing by domain specific rule sets

Exercise 6-3: US Prep Rule SetExercise 6-3: US Prep Rule Set

1. Define the output file2. Create the job 3. Define the job details

Select US PREP rule set Identify fields to be

conditioned Enter metadata labels


USDATAUSDATA

PREPOUTPREPOUT

Stage name: PREPDATA


Rule set: USPREP

Job Name: STAN

Stage name: PREPDATA


Rule set: USPREP

Job Name: STAN

Domain Rule SetsDomain Rule Sets

Domain rule sets expect only data for that domain as the input

Domain rule sets that come with QualityStage are:NameStreet addressArea (city, state and zip)


AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




USNAME Rule SetUSNAME Rule Set

The USNAME rule set works on both personal names and organization names for US data

Data is parsed into name componentsPhonetic coding of the First Name and

Primary Name are created for matching

USADDR Rule SetUSADDR Rule Set

This rule set is applied to street address fields

The “Address Type” flag identifies different types of addresses ‘S’ Street address ‘B’ Box address ‘R’ Rural route address

Phonetic coding of the Street Name is created for matching

USAREA Rule SetUSAREA Rule Set

This rule set is applied to city, state and postal code fields

Data is parsed into city name, state abbreviation, zip code and zip plus four

Phonetic coding of the city name is created for matching

Exercise 6-4: Domain StandardizeExercise 6-4: Domain Standardize1. Define the output file2. Create the job 3. Define the job details

Select USNAME, USADDR, USAREA rule sets

Identify fields to be conditioned


PREPOUTPREPOUT

STANOUTSTANOUT

Stage name: STANALL


Rule set: USNAME, USADDR, USAREA

Job Name: STAN

Stage name: STANALL


Rule set: USNAME, USADDR, USAREA

Job Name: STAN

Standardize ResultsStandardize Results

Business Intelligence fields Parsed from the original data, they may be

used in matching and generally they are moved to the target system

Matching FieldsGenerally these fields are created to help

during the match process and are dropped after successful matching

Reporting fieldsSpecifically created to help review results of

Standardize and recognized handled and unhandled data

Business Intelligence FieldsBusiness Intelligence Fields

Intelligent data parsed and bucketed from the input free-form field

USNAME Examples

• Title

• First Name

• Middle Name

• Primary Name

• Generational

USADDR Examples

HouseNumber

Directional

Street Name

Unit Types

Box Types

Unit Values

Building Names

USADDR Examples

•City

•State

•Zip5

•Zip4

Standardize Matching FieldsStandardize Matching Fields

Phonetic codingNYSIIS Reverse NYSIISSoundexReverse Soundex

Hash keysFirst 2 characters of the first five words

Packed KeysData concatenated, or packed

Standardize Reporting FieldsStandardize Reporting Fields

The tokens not processed by the rule set because they represent a data exception.

The pattern generated for the stream of input tokens based on the parsing rules and token classifications.

The pattern generated for tokens not processed by the selected rule set.

The remaining tokens not processed by the selected rule set.

Unhandled Pattern

Unhandled Data

Input Pattern

Exception Data

Best Practice: Investigate Handled DataBest Practice: Investigate Handled Data

Review the business intelligence fields to ensure accurate bucketing of the data

Build a Character Discrete Investigation for each field and review the contents and the format

Build Character Concatenate Investigation to review:Unhandled PatternsUnhandled Data Input Pattern Input Fields


AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Exercise 6-5: Investigate NAME Unhandled Patterns and Data

Exercise 6-5: Investigate NAME Unhandled Patterns and Data

Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.

Field Name Field Description Type

UPUSNAM Unhandled Pattern C

UDUSNAM Unhandled Data X

IPUSNAM Input Pattern X

NAME Original Name field data X

1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Exercise 6-6: Investigate Address/Area Unhandled Patterns

Exercise 6-6: Investigate Address/Area Unhandled Patterns

Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.

Field Name Field Description Type

UPUSADD Unhandled Pattern C

UDUSADD Unhandled Data X

IPUSADD Input Pattern X

ADDR1 Address Line 1 X

ADDR2 Address Line 2 X


QualityStage comes with pre-defined rule sets that are highly flexible and customizable

Support multi-national address processingCountry rule setsPre-processing rule sets for mixed-domain

challengesDomain rule setsCustom rule sets

Rule Set Overrides Rule Set Overrides


Identify the location of the User Override functionality

Describe the different types of User Overrides

Apply User OverridesTest User OverridesLocate the User Override tables




II IIII IV


(DQA)



AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Customizing Rule SetsCustomizing Rule Sets

A rule set may require modification if some input data is:Not processed Incorrectly processed

QualityStage User OverridesRules Analyzer

User OverridesUser Overrides

Provides the user the ability to modify the rule sets

The following types of rule sets can be modified using User Overrides Domain Pre-processor rule sets Domain rule sets own Standardize rules Validation rule sets Multinational Address Standardize rule sets

There are five types of user overrides relating to: classifications, patterns, and text strings

User overrides are GUI Driven Stored in separate lookup tables

User Classification OverrideUser Classification Override

Recognized as a keyword and classifiedAdditional words

New abbreviation, variation Misspelling of a word

User Classifications may override or add: Original values (Token values) Standard value Class

Token Value Standard Value Class

Example: Classification OverrideExample: Classification Override

FCarolynne

Carolynne

Input Pattern Original Data +,+ HOCHREITER , CAROLYNNE

Input Pattern Original Data +,F HOCHREITER , CAROLYNNE

Corrected Pattern

Unhandled Data

Override

Add CAROLYNNE

as a valid first name

to the classification table

Text OverridesText Overrides

Allow the user to specify overrides based on an entire text string

Use this override for special cases and specific handling of a string of text

Input Text OverridesApplied to the original text string

Unhandled Text OverridesApplied to the Unhandled Data field

Example: Input Text OverridesExample: Input Text Overrides

Input Text OverrideREIFF FUNERAL Move text string to the Primary name field

Unhandled Pattern Input Text++ ZACHARIA GELLMAN++ TOMMOTHY CABBOTT++ REIFF FUNERAL

Input Text

Override

Input Pattern Primary Name+ + REIFF FUNERAL

Results

Pattern OverridesPattern Overrides

Allow the user to specify overrides based on an entire pattern

Use this override when most or all records should be processed with identical logic

Input Pattern OverridesApplied to the original text string

Unhandled Pattern OverridesApplied to the Unhandled Data field

Unhandled Pattern OverridesUnhandled Pattern Overrides

Unhandled Pattern Override+, + Move + to Primary Name Comma provides context

Move + to First Name

Unhandled Pattern Input Text+, + HAYWARD, WINSLOW+, + ESHAGHIAN , JOUBI+, + BOULDER, CORONA

UnhandledPattern

Override

Results

Unhandled Pattern First Primary Name Name +, + WINSLOW HAYWARD+, + JOUBI ESHAGHIAN+, + CORONA BOULDER

User Override PrecedenceUser Override Precedence

Recognize words to classify

Modify logic based on the input string

Modify logic based on the input pattern

Modify logic based on the Unhandled data string

Modify logic based on the unhandled pattern

User Classification

Input Text

Input Pattern

Unhandled Text

Unhandled Pattern

Rule Set PrecedenceRule Set Precedence

UNHANDLED TEXT

INPUT PATTERN

INPUT TEXT

USER CLASSIFICATION

UNHANDLED PATTERN

CLASSIFICATION TABLE

PATTERN ACTION FILE

Rule Set Override ProcessRule Set Override Process

1. Enter override2. Apply override3. Test override with the Rules Analyzer4. Repeat steps 1 through 3 for all desired

overrides

Exercise 7-1: Name Rule Set User OverrideExercise 7-1: Name Rule Set User OverrideReview the unhandled NAME patterns in

the INUPNAMp.frq reportApply NAME overridesTest NAME overridesRe-run the STAN Job to re-produce the

new output file with the overrides applied

Exercise 7-2: Address and Area OverridesExercise 7-2: Address and Area OverridesReview the Investigation reports of

unhandled Address and Area dataApply Users Overrides to unhandled dataTest the OverrideRe-run the STAN Job to re-produce the

new output file with the overrides applied


There are fives type of user overridesUser overrides can be applied to:

The classification table Input text Input patternsUnhandled textUnhandled patterns

Overrides are applied in a specific orderThe Standardize Rules Analyzer can be

used to test and review user overrides

Match Match


Describe where Match fits in the Data Re-engineering Methodology

Describe QualityStage Match conceptsDefine the type of matching algorithmsDescribe the importance of blockingApply multiple match passes to increase

efficiency/efficacyInterpret and improve match results




II IIII IV


(DQA)



AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Match Stage Match Stage

Statistically-based method for determining matches

Over 24 match comparison algorithms providing a full spectrum of fuzzy matching functions

Ability to measure informational content of data

Identify duplicate entities within one or more files

Array matchingMatch wizards and templatesCritical field settings

What Constitutes a Good Match?What Constitutes a Good Match?

W HOLDEN 12 MAIN ST W HOLDEN 12 MAINE ST

Which of the following record pairs is a match? And how do you know?

Do you compare all the shared or common fields? Do you give partial credit? Are some fields (or some values) more important to you than others?

Why? Do more fields increase your confidence? By how much? What is enough?

W HOLDEN 128 MAIN PL 02111 12/8/62 W HOLDEN 128 MAINE PL 02110 12/8/62WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824 WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824

The Value of Information ContentThe Value of Information ContentInformation content measures the

significance of one field over another (Discriminating Value)A Gender Code contributes less information

than a Tax-Id NumberInformation content also measures the

significance of one value in a field over another (Frequency) In a First-Name Field, JOHN contributes less

information than DWEZELSignificance is determined by a value’s

reliability and its ability to discriminate, both can be calculated from your data

The weighted score is a relative measure of the probability of a match

Thresholds defined can be used for automated

processing

0

500

1000

1500

2000

2500

3000

3500

4000

-20 -10 0 10 20 30 40

# o

f P

air

s

Non-Matches

Matches

Distribution of WeightsDistribution of Weights

Weight of Comparisons

Less Confidence More Confidence

Gre

y ar

ea

WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62 +1 +1 +17 +2 +4 -1 +7 +9 = 40

WeightsWeights

Measures the information content of a data value

Each field contributes to the confidence (probability) of a match

Types of WeightsTypes of Weights

If a field matches, the agreement weight is usedAgreement weight is a positive value

If a field doesn’t match, the disagreement weight is usedDisagreement is a negative value

Partial weight is assigned for non-exact or “fuzzy” matches

Missing values have a default weight of zero

Weights for all field comparisons are summed to form a composite weight

Matching TerminologyMatching Terminology

Measures the informational content of a data value

Distinguish matches from non-matches

Records with a score above the High cutoff that really aren’t a match

Records below the low cutoff that really are a match

Measures the significance of one field value over another

Measures the confidence of a match

Informational Content

Weight

Composite Weight

Match Cutoffs

False Positives

False Negatives

Measuring the Conditions of UncertaintyMeasuring the Conditions of UncertaintyReliability of the data in a given field

Estimated as the probability that the field agrees given the record pair is a match

Probability of a random agreement of values Estimated as the probability the field agrees

given the record pair is not a match

Reliability (M-Probability)Reliability (M-Probability)

Approximated as, 1 - error rate for the given field

The higher the m-probability, the higher the disagreement weight will be for the field not matching since the data is considered reliable

Chance Agreement (U-Probability)Chance Agreement (U-Probability)The u-probability can be approximated as

the probability that a field agrees at random (by chance)

QualityStage uses a frequency analysis to determine the probability of chance agreement for all values

Rare values bring more weight to a match

Calculating WeightsCalculating Weights

Agreement weight is estimated as: log2(m/u)

Disagreement weight is estimated as: log2 ((1-m)/(1-u))

M (m-prob) = .9

U (u-prob) = .01

Agreement weight log2 (.9/.01) = 6.49

Disagreement weightlog2 (1-.9)/(1-.01) = -3.31

BlockingBlocking

Grouping together like records that have a high-probability of producing matches

Only “like” records are compared to each other making the match more efficient and computationally feasible

Records in a “block” match exactly on one to several blocking fields

Blocking Example: Sample DataBlocking Example: Sample Data

NYSIIS LNAME NAME ADDRESS ZIP

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753

GARAS GEROSA, FRAN X 29 AARONS CT 06877

YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341

GARAS GERISA, FRANCIS 29 AARONS CT 06877

GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

GARAS GEROSA, MARY 29 AARONS CT 06877

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753


YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341

GARAS GERISA, FRANCIS 29 AARONS CT 06877


MATAC MARCUS MATIC 100 SUMMER STREET 02111




Block on NYSIIS of Last Name

Blocking ExampleBlocking Example

YANG YOUNG , WAYNE D 9000 SHEPARD DRIVE 78753

YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072





GARAS GARISA, FRANCIS 29 AARONS CT 06877MATAC MARCUS MATIC 100 SUMMER STREET 02111


NYSIIS NAME ADDRESS ZIP

Blocks with only one records are considered residuals

Balance Scope and AccuracyBalance Scope and Accuracy

Balance the scope and accuracy to compare a reasonable amount of “like” records

Accuracy The quality of the candidate

records

ScopeThe number of records

Blocking StrategyBlocking Strategy

Choose fields with reliable dataChoose fields with a good distribution of

valuesCombinations of fields may be used

Examples of Blocking StrategiesExamples of Blocking Strategies

Zip code for matching addressesNYSIIS of last name for matching

individualsBrand name for matching productsCombination of zip code and NYSIIS of

street name for matching addressesCombination of NYSIIS of last name and

first letter of first name for matching individuals

Blocking SummaryBlocking Summary

Blocking groups together “like” recordsMatching is more efficient for small block

sizesBlocks should be between 100 and 200 records

Blocking fields must match exactly for a candidate set to be created/evaluated

Match TypesMatch Types

Unduplication Identifies duplicates candidates in one file

Match (Two File)One-to-one correspondenceFor every record on File A we expect to find a

match to one record on File B

Geomatch (Two File)Many-to-one correspondenceMore than one record on File A can match to

the same record on File B

Comparing Data ValuesComparing Data Values

Different comparisons for different dataOver 24 comparison methodsMost common

CHAR - (character comparison) character by character, left to right.

UNCERT - (character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters

CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance threshold

NAME_UNCERT – Can be used to compare and character values, if the strings are different lengths then the shorter of the two lengths is used

Exercise 8-1: Undup MatchExercise 8-1: Undup Match

1. Define the output file2. Define the Match Stage3. Define the pass

• Choose blocking fields

4. Choose fields to compare and comparison method

5. Build the Match Extract6. Create the Pass

Match Output Files Match Output Files

Report includes matched records and Summary Statistics

Contains the raw match results including the WEIGHT, TYPE of match of records and SETID

Contains the histogram, tables of weights and summary statistics

Match Extract

Match Report

Match Statistics Report

Match ExtractMatch Extract

SETID | TYPE |PASS| WEIGHT|ALL_OF_THE_DATA

393 | XA | 1 | 55.32 | MICHAEL F DOHERTY 393 | DA | 1 | 41.36 | MICHAEL F DOUGHERTY

468 | XA | 1 | 50.40 | EUGENE B BOROWITZ468 | DA | 1 | 24.01 | BOROWITZ FAMILY TRUST468 | DA | 1 | 47.26 | GENE BOROWITZ

520 | XA | 1 | 52.75 | FRAN X GEROSA 520 | DA | 1 | 40.95 | FRANCIS XAVIER GEROSA 520 | DA | 1 | 52.75 | FRANCIS X GEROSA 520 | DA | 1 | 41.22 | FRANK X GEROSA

1035 | RA | 1 | DARRYL F LINDBERG

Rest of Fields

WEIGHT

Custom Extract SpecificationCustom Extract Specification

MOVE @SETMOVE " "MOVE @TYPEMOVE @PASSMOVE " "MOVE @WGT

MOVEALL OF AMOVE " "

PASSTYPESetID

1-9 11-12

13 15-21

221410 23

This is an example of a common match extract specification

It should match the output file defined in the previous exercised

The data is moved to the output file according to these commands

Exercise 8-2: Custom Match ExtractExercise 8-2: Custom Match Extract

1. Select Extract Type2. Select Output File3. Enter Extract commands

Match Improvement StrategyMatch Improvement Strategy

1. Set critical values for important fields2. Review calculated weights

Adjust weights using weight overrides

3. Set cutoffs4. Add additional passes

Critical FieldsCritical Fields

Used to identify fields that must agree in order for records to be linkedCritical – Fields values must agree exactly or

the records cannot be linked (considered a match)

Critical Missing OK – Field values must agree exactly on values not considered “missing values”

QualityStage feature: VARTYPE

Weight OverridesWeight Overrides

Allows you to adjust both the agreement and/or disagreement weights for specific situationsAdd to calculated weightReplace weight

Exercise 8-3: Critical VartypesExercise 8-3: Critical Vartypes

1. Modify the Stage2. Modify the Pass3. Add additional Match fields4. Re-run the Match Job5. Review Results

CutoffsCutoffs

There are two cutoffs Match cutoff (high cutoff) Clerical cutoff (low cutoff)

Records with a weight equal to or above the Match cutoff are considered matches

Records with a weight below the low cutoff are not matches

Records with a weight greater than or equal to the low cutoff and less than the high cutoff are considered clerical records for manual review

Cutoffs can be set at the same value eliminating clerical records

Setting the Match Cut-off

27.82 PO BOX 93020227.82 PO BOX 93020227.82 PO BOX 930202

38.65 35 COLLIER RD NW STE 610 38.65 35 COLLIER RD NW STE 610

25.81 928 S 1ST ST 14.45 S 1ST ST

Weights Data fields

DefiniteMatch

DefiniteMatch

QuestionableMatch

Exercise 8-4: Set Match CutoffsExercise 8-4: Set Match Cutoffs

1. Modify the Match Stage2. Modify Pass 13. Set Cutoffs4. Re-run the Match Job5. Review Results

Multiple Match PassesMultiple Match Passes

Additional passes are helpful in overcoming data errors and missing values in block fields

You should always create at least two match passes

Change blocking strategies for each pass

Pass 1 blocked on street name Pass 2 found additional matched records in which the street

name was different but the names were the same

Example: Multiple Match Passes

Pass Weights Data fields

1 26.31 JASON BIRCH 1350 WALTON WAY 30901 1 26.31 JASON BIRSH 1350 WALTON WAY 30901

1 20.42 JOHN SMITH 2047 PRINCE AVE 30604 1 10.83 MARY SMITH 2047 PRINCE AVE 30604

1 RES A JOHN SMITH P.O. BOX 123 30604

2 20.42 JOHN SMITH 2047 PRINCE AVE 30604 2 10.19 JOHN SMITH P.O. BOX 123 30604

Exercise 8-5: Add Match Pass 2Exercise 8-5: Add Match Pass 2

1. Modify the Match Stage2. Add a new Pass3. Choose Block Fields4. Choose Match Fields5. Run Job6. Review Results


Three type of matches Undup Match Geomatch

Block to group together like records Only like records are compared adding computational

efficiency

Over 24 match comparisonsCritical fieldsMatch cutoffsMultiple passes

Survive Survive


Describe where Survive is in the Data Re-engineering Methodology

Identify Survive techniquesDescribe implementation optionsDefine Survive rulesBuild Survive stage




II IIII IV


(DQA)



AUTOHOME LIFE


Standardize Country





Reject NON US Data

Pre-Process US Data




Survive StageSurvive Stage

Point-and-click creation of business rules to determine “surviving” data – user decides how to survive data

Performed at record or field level – very flexible

Creates a single, consolidated record containing the “best-of-breed” data

Cross-populates best available dataCreates a cross-reference keyProvides consolidated view of the data

Survive ExampleSurvive ExampleSurvive Input (Match Output)

Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.

1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR

23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian5901 SW 74TH ST23 D352 William Obrian 5901 74 ST # 202

Survived Consolidated Output

Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.

1 D150 Robert Dickson 1500 SE ROSS CLARK CIR

23 D689 William Alex O’Brian5901 SW 74TH ST STE 202

Group Legacy1 D1501 A1367

23 D68923 A43623 D352

Cross-Reference File

Survive RulesSurvive Rules

A rule contains a condition and a set of target fieldsWhen the condition is met the field becomes a

candidate for the “best”All records in a group are tested against the

conditionThe “best” populates the target fields

Multiple targets are permitted for the same rule

Survive RulesSurvive Rules

Custom RuleBuild your own logical expressionComparison (=, !=, <, > ,<=, >=)Logical (and, or, not) Indicate the current and best records with the

following notation c.field indicates the current b.field indicates the best

Parentheses ( ) can be used for grouping complex conditions

String literals are enclosed in double quotation marks, such as “MARS”.

A semicolon (;) terminates a rule.

Building Survive RulesBuilding Survive Rules

Survive Rules Definition screen lets you easily build, delete and manage survivor rules

Survive TechniquesSurvive Techniques

Pre-defined TechniquesSourceRecencyFrequencyMost complete (longest string)

User-specified logic

Target FieldsTarget Fields

Fields you want to write to the output filePopulated based on meeting the

conditions of the survivor rule(s)Fields not listed as targets are excluded

from the output fileMay have multiple targets for each rule

Example: Complex Survive RuleExample: Complex Survive Rule

The following rule states that FIELD3 of the current record should be retained if the field contains five or more characters and FIELD1 has any contents.

The prefix of b. indicates the current “best” record

The prefix c. indicates the current record testing against the survivor rule

FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGET CONDITION

Exercise 9-1: Survive the Best Customer Record

Exercise 9-1: Survive the Best Customer Record1. Define the output file2. Define Survive stage3. Choose target fields4. Define Survive rules5. Deploy and run6. Review results


Consolidate or survive the best record by choosing the best record or best field from multiple records

Use pre-defined techniques or build your own

May use multiple rules

49508437 Quality Stage Wipro

Documents

Transcript of 49508437 Quality Stage Wipro