49508437 Quality Stage Wipro
-
Upload
mohd-sufiyan-ansari -
Category
Documents
-
view
131 -
download
8
Transcript of 49508437 Quality Stage Wipro
Session 1:Session 1:QualityStage QualityStage
EssentialsEssentials
ObjectivesObjectives
Data QualityIntroduction to QualityStageDeveloping with QualityStageInvestigate and Data Quality AssessmentData PreparationStandardizeRule Set OverridesMatchSurvive
Data Migration ChallengesData Migration Challenges
Data QualityData QualityData QualityData Quality
Legacy Data ScrubbingLegacy Data ScrubbingLegacy Data ScrubbingLegacy Data Scrubbing
Managing End-user ExpectationsManaging End-user Expectations Managing End-user ExpectationsManaging End-user Expectations
Business/Data ModelingBusiness/Data Modeling Business/Data ModelingBusiness/Data Modeling
Managing Mgmt ExpectationsManaging Mgmt ExpectationsManaging Mgmt ExpectationsManaging Mgmt Expectations
Business Rule AnalysisBusiness Rule Analysis Business Rule AnalysisBusiness Rule Analysis
Managing MetadataManaging Metadata Managing MetadataManaging Metadata
0000 5555 10101010 15151515 20202020 2525 30303030 35353535 40404040PercentPercentPercentPercent
Data Quality Increases ROIData Quality Increases ROI
Better decision makingImproved marketing accuracy and scopeIncreased knowledge of customers Improved inventory and asset
management Improved risk analysis, auditing and
reporting
Data Quality Data Quality
There are two significant definitions of data quality Inherent Data Quality
Correctness or accuracy of data - The degree to which data accurately reflects the real-world object that it represents
Pragmatic Data Quality The value that accurate data has in supporting the
work of the enterprise Data that does not help enable the enterprise
accomplish its mission has no quality, no matter how accurate it is
Data Quality ChallengesData Quality Challenges
Different or inconsistent standards in structure, format or values
Missing data, default valuesSpelling errors, data in wrong fieldsBuried informationData myopia Data anomalies
Different or Inconsistent StandardsDifferent or Inconsistent Standards
MARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO
Source 3
DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721
Source 2
Name Field LLocation
Source 1
MARK DI LORENZO MA93 DENIS E. MARIO CT15 TOM & MARY ROBERTS IL21
Missing Data & Default ValuesMissing Data & Default Values
Denise Mario DBA
Marc Di Lorenzo ETAL
Tom & Mary Roberts
First Natl Provident
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
228-02-1975
999999999
025-37-1888
34-2671434
101010101
LN#12-756
18-7534216
111111111
6173380300
3380321
415-392-2000
508-466-1200
212-235-1000
FAX 528-9825
5436
NAME SOC. SEC. # TELEPHONE
Do the field values match the meta data labels?
Buried InformationBuried Information
Legacy Meta Desc. Legacy Record Values
Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345
NAME 1
ADDRESS 1
ADDRESS 2
ADDRESS 3
ADDRESS 4
ADDRESS 5
The Anomalies NightmareThe Anomalies Nightmare
CUSNUM NAME ADDRESS SALES $
90328574
90328575
90238495
90233479
90233489
90234889
90345672
IBM
I.B.M. Inc.
International Bus. M.
Int. Bus. Machines
Inter-Nation
Consults
Int. Bus. Consultants
I.B. Manufacturing
8,494.00
3,432.00
2,243.00
5,900.00
6,800.00
10,243.00
15,999.00
187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Sarem NH 01456
187 No. Park StSalem NH 04156
187 Park Ave Salem NH 04156
15 Main St. Andover MA 02341
PO Box 9 Boston MA 02210
Park Blvd. Boston MA 04106
Spelling ErrorsAnomaliesNo common
key
Lack of Standards
What data challenges do you face?What data challenges do you face?
Acct # Name Address City State Zip Note
5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP
5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111
5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID
87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert
87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176
No unique key linking records together
Business terms and spillover text
Data entry errors and misspellings
No consistent naming convention
Buried informationMissing values or data in the wrong fields
Common Data Quality ApproachesCommon Data Quality ApproachesAnalysis and Assessment
Enterprise-level: Data Quality Assessment (DQA) Project-level: Data investigation
Data Re-engineering Methods Standardization Record linkage/matching Consolidation
Information Engineering Methods Initial load Net change Real-time
Ongoing Metrics Project-level: Post-Data Quality Assessment (PDQA) Enterprise-level: Repeated DQA’s to establish trends
Data Re-engineering MethodologyData Re-engineering Methodology
DiscoverInvestigate
StandardizeCondition
LinkageMatching
Consolidate Survivorship
Understanding the quality
of your data and it’s impact on
achieving success
Standardizing content, structure
and meaning of datain preparation for
matching and downstream processing
Identifying and linking
duplicate entitiesor like entities
Selecting the “Best-of-breed”
data for downstream processing
Do your data sources contain what you think they do?
Does your data mean what you think it does?
Can you correct and improve the quality of your data?
Can you make the data meaningful to users?
Can you deliver & update the data in a timely manner?
How do you match records with the same meaning?
Which source should you use for this project?
Is your data sent to users based on events or content?
Are you able to keep data synchronized across systems?
Why InvestigateWhy Investigate
Discover trends and potential anomalies in the data
100% visibility of single domain and free-form fields
Identify invalid and default valuesReveal undocumented business rules and
common terminologyVerify the reliability of the data in the
fields to be used as matching criteriaGain complete understanding of data
within context
How to InvestigateHow to Investigate
• Single domain (character and type)• Freeform text (Word)
Frequency Percent Pattern Data Sample 3,533,119 39.332% FI? ADINA A /ACEVERO 2,590,837 28.842% F? CARMEN /ABRAHANTE 614,006 6.835% ?? ISHAI /BIRAN 552,579 6.152% FIF SCOTT J /ALBERT 331,279 3.688% FF JENNIFER /ASUNCION 314,199 3.498% ?I? CLAUDVILLE P /ADAMS 154,026 1.715% FF? ELIZABETH ANN /ABELA
Name – ‘Word’
Summary:
100% populated
The top four patterns account
for 81% of the populated
values
Contains ‘slash’ that may
indicate last name
Contains full and partial first name
Pattern Legend Class Description Class Description
? Unknown W Organization Words N Salutation A Abbreviations < Mixed (Leading alpha) ̂ Numeric > Mixed (Leading Numeric) @ Complex I Initials F First Name
other The character itself. For example: (, ), *, %
What is StandardizeWhat is Standardize
“The revealed patterns drive the conditioning rules.”
Pattern Manipulation: Applying business logic to data chaos.
Standards Definition:Enforcing business standards on data
elements.
Field Structuring:Transforming the input to an output which
meets the business requirement.
How to standardizeHow to standardize
Parsing specific data fields into smaller, lower-level (atomic) data elements Categorization of identified elements
Separation of Name, Address, and Area from freeform Name & Address lines
Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic Equipment)
Refinement of a data element Name = ‘MS GRACY E MATHEWS’ becomes:
Title = ‘MS’ First Name = ‘GRACY’ Middle Name = ‘E’ Last Name = ‘MATHEWS’
Part Description = ‘BLK ACER MONITOR ’ becomes: Color = ‘BLACK’ Type = ‘ACER ’ Part = ‘MONITOR’
Why StandardizeWhy Standardize
Normalize values in data fields to standard valuesTransform First Name = ‘MIKE’ ‘MICHAEL’Transform Title = ‘Doctor’ ‘Dr’Transform Address = ‘ST. Michael Street’
‘Saint Michael St.’ Transform Color = ‘BLK’ ‘BLACK’
Applying phonetic coding to key wordsNYSIISSoundexTypically applied to Name fields (first, last,
street, city)
QualityStage StandardizeQualityStage Standardize
Highly flexible pattern recognition language
Field or domain specific standardization (i.e. unique rules for names vs. addresses vs. dates, etc.)
Customizable classification and standardization tables
Utilizes results from data investigation
QualityStage Standardize ExampleQualityStage Standardize Example
Name Standardization Example
Input Name “Bucketed” Name Information after INTEGRITY Standardization Name Type Gdr Prefix First Middle Last Suffix Gen NYSIIS Match First Additional Name
CHESTER FINANCIAL /INC O CHESTER FINANCIAL INC CASAR
MIGUEL A /DEJESUS-VAZQUEZ I M MIGUEL A DEJESUS-VAZQUEZ DAJAS MIGUEL
DEBBIE KOTIN /INSDORF ESQ. I F DEBBIE KOTIN INSDORF ESQ INSDARF DEBORAH
DURAND,RAYMOND J. I M RAYMOND J DURAND DARAN RAYMOND
JOHN FRANCIS /ECKSTEIN IV I M JOHN FRANCIS ECKSTEIN IV ECSAN JOHN
BOB T /HSIEH I M BOB T HSIEH HSAH ROBERT
MOST REV. VINCENT D. /BREEN I M MOST REV VINCENT D BREEN BRAN VINCENT
MISS DOROTHY /MEAGHER I F MISS DOROTHY MEAGHER MAGAR DOROTHY
MINISTER ROXANN D /ROBINSON I F MINISTER ROXANN D ROBINSON RABANSAN ROXANN
LUCIEN D /MOCOMBE MD I M LUCIEN D MOCOMBE MD MACANB LUCIEN
FRANK /MCCORD III I M FRANK MCCORD III MCAD FRANK
JOHN L /HANCOCK 111 I M JOHN L HANCOCK III HANCAC JOHN
ONLINE BANKING /TEST1015 ONLINE BANKING TEST1015
ON LINE BANKING /TEST #1120 ON LINE BANKING TEST 1120
ITA5 /TEST ITA5 TEST
RABBI JEROME M /BLUM I M RABBI JEROME M BLUM BLAN JEROME
FRANCES /WILLIAMS JONES I M FRANCES WILLIAMS JONES JAN FRANCES
QualityStage Standardize ExampleQualityStage Standardize Example
Address Standardization Examples
Input Address “Bucketed” Address Information after INTEGRITY Standardization
Add
ress
Hou
se
Hou
se
Suff
ix
Pre-
Dir
.
Stre
et
Nam
e
Stre
et T
ype
Suff
ix D
ir.
Uni
t Typ
e
Uni
t Val
ue
Flo
or V
alue
Rte
Val
ue
Box
Val
ue
NYS
IIS
St
reet
N
ame
Add
ition
al
Add
ress
In
fo
326 W 17 ST 326 W 17TH ST 17T
200 E.27TH STREET APT. 10H 200 E 27TH ST APT 10H 27T
168 FIRST AVE. 168 1ST AVE 1
35 PIERREPONT STREET APT.#3-B 35 PIERREPONT ST APT 3B PARAPAN
76-D LA BONNE VIE II DR 76 D LA BONNE VIE II DR LABANAVY
1560 BROADWAY SUITE 416 1560 BROADWAY STE 416 BRADWY
50 FAIRVIEW DRIVE SOUTH 50 FAIRVIEW DR S FARV
247 DOVER GRN 247 DOVER GRN DAVAR
3530 HENRY HUDSON PKWY E APT 8D 3530 HENRY HUDSON PKWY E APT 8D
2951 W 33 ST APT 3C 2951 W 33RD ST APT 3C 33D
425 E 8TH ST (2ND FLOOR) 425 E 8TH ST 2 8T
305 WEST 98TH ST APT #4AN 305 W 98TH ST APT 4AN 98T
37-06 100 STREET /FIRST FL 37 06 100TH ST 1 100T
ONE FIFTH AVENUE 1 5TH AVE 5T
1 5TH AVE APT 15G 1 5TH AVE APT 15G 5T
P O BOX 2257 666 ANDERSON AVE 666 ANDERSON AVE 2257 ANDARSAN
MatchMatch
“Conditioned data and QualityStage’s matching engine link the previously
unlinkable.” Match Construction:
Reliability of input data defines a match result.
Statistical Analysis & Match Scoring:Linkage probability determined on a sliding
scale by field level comparison.
Report Generation:All business rules applied have easy to
understand report structure.
What is MatchWhat is Match
Identifying all records on one file that correspond to similar records on another file
Identifying duplicate records in one fileBuilding relationships between records in
multiple filesPerforming statistical and probabilistic
matchingCalculating a score based on the
probability of a match
How to MatchHow to Match
Single file (Unduplication) or two file (Geomatch)
Different match comparisons for different types of data (e.g. exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison …)
Generation of composite weights from multiple fields
Use of probabilistic or statistical algorithmsApplication of match cutoffs or thresholds to
identify automatic and clerical match levelsIncorporation of override weights to assess
particular data conditions (e.g. default values, discriminatory elements)
QualityStage MatchQualityStage Match
Over 25 match comparison algorithms providing a full spectrum of fuzzy matching functions
Statistically-based method for determining matches (Probabilistic Record Linkage Theory)
Field-by-field comparisons for agreement or disagreement
Assignment of weights or penaltiesOverrides for unique data conditionsScore results to determine the probability of
matched recordsThresholds for final match determinationAbility to measure informational content of
data
QualityStage Match ExamplesQualityStage Match Examples
Type Wgt SSN Input Name Input Address Input City St Zip Title Sal. Maiden Name DOB
XA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 SCUNNAIMINA 19640110
DA 29.73 092-52-1195 JEROME /LOFFREDO PO BOX 40206 BROOKLYN NY 11204-0206 19640100
XA 29.73 999-99-9999 ANGEL A /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 0 19250101
DA 29.73 000-00-0000 ANGEL /MOUMDJIAN PO BOX 16 ALPINE NJ 07620-0016 X 19250101
XA 29.73 058-09-8019 HARRY W /BOGARDS PO BOX 845 PORT WASHINGTON NY 11050-0202 JR VAN GURP 19120920
DA 7.16 058-09-8019 HARRY /BOGAARDS P O BOX 845 PORT WASHINGT NY 110500202 0
XA 19.29 261-60-5676 ADRIAN /GARCIA ROCKEFELLER CENTER P O BOX 1062 NEW YORK NY 10020 19300908
DA 19.29 000-00-0000 ADRIAN /GARCIA P O BOX 1062 ROCKEFELLER CNTR NEW YORK NY 10185 0
XA 62.78 050-36-6598 GLORIA P /LEONNELL 1655 FLATBUSH AVE APT B302 BROOKLYN NY 11210-3271 19460410
DA 33.09 050-36-6598 GLORIA P /LEONNELL-WILLIAMS1655 FLATBUSH AVE BROOKLYN NY 11210-3276 HILL 19460410
XA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 62.78 053-52-8625 WILLIAM /LOCKLEY 10516 FLATLANDS 9TH ST BROOKLYN NY 11236-4624 BROWN 19571111
DA 44.08 000-00-0000 WILLIAM /LOCKLEY 105-16 FLATLANDS 9TH ST BROOKLYN NY 112364624 0
XA 54.42 414-76-9969 MARY /RICHARDSON 651 E 14TH ST NEW YORK NY 10009-3119 19451222
DA 24.73 414-76-9969 MARY P /RICHARDSON GRAY651 E 14TH ST APT 10G NEW YORK NY 10009-3125 ROBINSON 19451222
What is SurviveWhat is Survive
Creation of best-of-breed “surviving” data based on record or field level information
Development of cross-reference file of related keys
Production of load exception reportsCreating output formats:
Relational table with primary and foreign keysTransactions to update databasesCross-reference files, synonym tables
Why surviveWhy survive
Provide consolidated view of dataProvide consolidated view containing the
“best-of-breed” dataResolve conflicting values and fill missing
valuesCross-populate best available dataImplement business and mapping rulesCreate cross-reference keys
How to surviveHow to survive
Highly flexible rulesRecord or field level survivorship
decisionsRules can be based upon data frequency,
data recency (i.e. date), data source, value presence or length
Rules can incorporate multiple testsQualityStage features
Point-and-click (GUI-based) creation of business rules to determine best-of-breed “surviving” data
Performed at record or field level
QualityStage Survive ExamplesQualityStage Survive Examples
Example 1: The longest populated Middle and Last Name
First Name
Middle Name
Last Name First Name
Middle Name
Last Name
MARI LEMELSON-LAPPNER
MARI S LEMELSON-LAPPNER
MARI S LEMELSON
Matched Survived
Example 2: The longest populated Middle Name, Date of Birth, and SSN
First Name Middle NameLast Name DOB SSN First Name Middle NameLast NameDOB SSN
DENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173
DENISE F TRIANO
Matched Survived
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurvivorshipInvestigation Conditioning Matching
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
Exercise 1-1:Course ProjectExercise 1-1:Course Project
Course business case: WINN Insurance CRM project
See QualityStage Essentials Exercises, page 4
Course Project DesignCourse Project Design
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for further processing
Select US Data for further processing
Module SummaryModule Summary
Five common data quality contaminants1.Different standards2.Missing and default values3.Spillover and buried information 4.Anomalies5.No consolidated view
Approaches to Data QualityData re-engineering methods
Introduction to QualityStageIntroduction to QualityStage
Why QualityStageWhy QualityStage
Probabilistic record linkage results in highest level of accurate, complete and justifiable match rates
Most flexible parsing/standardization capabilitiesHandles complex free-form data
Ability to verify 200+ country addresses allows for global support
Transparent parallelism exploits multiple CPUs which provides unmatched performance and scalability
Bi-directional meta data exchange ensures users understand data
Productivity, connectivity and interoperability via tight integration with DataStage and RTI Services
QualityStage ArchitectureQualityStage Architecture
OS/390
Windows&
NT Server
QualityStage
Designer
Windows
BUILD ONCE
RUN ANYWHERE
TCP/IP (FTP)
QualityStage Server Platforms
Windows&
NT Server
UNIX
TCP/IP
QualityStage DesignerQualityStage Designer
Designer Client GUI for
designing projects Windows NT, 2000,
XP Enter meta data Define Stages Build Jobs Standardization
Rules Designer Repository
Designer - ToolbarDesigner - Toolbar
NEW NEW Project, Data File definition, Data Field definition, Stage, or Job
CUT, COPY, PASTE CUT, COPY, PASTE Items listed on the right pane of the work area
RUNRUN The job selected on the right pane
DISPLAY DISPLAY Change display of right pane to Large icons, Small icons or show Details
Designer - Rule SetsDesigner - Rule Sets
Pre-defined rules for parsing and standardizing: Name Address Area (City, State and Zip)
Multi-national address processing
Validate structure: Tax ID US Phone Date Email
Append ISO country codes Pre-process or filter name,
address and area Rule sets are stored
locally with the Designer (separate from the repository)
Designer Rule Set OptionsDesigner Rule Set Options
The name and location is defined in the Designer
– File, Designer Options, Standardize Process Definition Dictionary
Quality Stage ServerQuality Stage Server
Deployment modesBatchReal-timeReal-time via API
Master Projects DirectoryProject information is deployed to
the serverProject work files are stored on
the server in project libraries
Directory Structure Directory Structure
QualityStage Designer C:\Ascential\QualityStageDesigner70
Designer Repository C:\Ascential\QualityStageDesiger70\QualityStageDesigner.mdb
Rule Sets C:\Ascential\QualityStageDesigner70
QualityStage Server C:\Ascential\QualityStageServer70
Master Projects Directory
C:\Projects
Sample Project Directory C:\Projects\Quality
Sample Project Results C:\Projects\Quality\Data
Designer
Server
Master Projects Directory Master Projects Directory
Master projects directory resides on the server
Multiple users can share the same Master Projects and Project directory
All project libraries are stored under the Master Projects directory
Project LibrariesProject Libraries
Project libraries are stored under the Master Projects directory
Project Library Description
Ipe_env.sh QualityStage Environment shell
Controls Stage and job control members
Data Location of input and output files
DIC Stage and job dictionary
IPICFG Environment configuration
Logs Location of job run logs
Scripts Job scripts – dependent on the server type
Temp Temp work space
QualityStage Licensed StagesQualityStage Licensed Stages
QualityStageWAVESPostal Certification Solutions
CASS SERP
GeoLocator
Exercise 2-1: Configure QualityStageExercise 2-1: Configure QualityStageConfigure the Designer for the
development serverRun profileDesigner Options
Server – Master Projects directory Designer OptionsStarting the QualityStage Server
During the courseDevelopment environment
Run ProfileRun Profile
One or multiple profiles
Defines for the Designer the server component location and access
Required:
– Host Type
– Host Server Path
– Master Project Directory
Optional:
– Alternate Locale
– Local Report Data Location
Run Profile: Adv Project SettingsRun Profile: Adv Project Settings
Location of the input and output data files
Location of the control members for each stage and job
Server temporary work location
Logs for each stage and job
Scripts to execute jobs
Run Profile: FTP SettingsRun Profile: FTP Settings
If you are connecting to a remote server then you need the login ID and password for the server.
QualityStage Designer OptionsQualityStage Designer Options
Local working temp directory on your local PC
Location of the rule sets
Default location for importing projects
Preferred editor for reviewing rule sets and result file
Module SummaryModule Summary
QualityStage Components Architecture Communication: Designer and Server use TCP/IP (FTP) to
communicate
Configuration User Profile Designer Options Starting the Server
Projects Projects are defined in the Designer To run and execute jobs, QualityStage Server must be
running Project libraries are stored on the server
Developing with QualityStageDeveloping with QualityStage
Module ObjectivesModule Objectives
Introduce the concepts, components and methods for developing projects in QualityStage
After this module you will be able to:Define data files and field definitionsBuild Stages and design JobsDeploy and run JobsLocate and review results
Application ComponentsApplication Components
QualityStage ApplicationProject Components
Stages JobsData File Definitions
Meta data
File Name Requirements
StagesStages
Abbreviate
BuildCASS CollapseFormat
ConvertInvestigate
Stages• Sort
• Standardize
• Survive
• Transfer
• Unijoin
• WAVES
• Z4changes
• Match
•Multinational Standardize
• Parse
• Program
• Select
• SERP
** Licensed Stages – additional licensing required
What is a Job?What is a Job?
A job is an executable QualityStage program
Jobs can be run interactively or in batch mode
In this course, jobs will be run interactively under the control of QualityStage Designer
Job Development OverviewJob Development Overview
Designer Import or enter file definitions and meta data
defining your sources and targetsAdd stages defining the process or taskDeploy the job
ServerRun the jobReview results
Job Development ProcessJob Development Process
1. Define data files Enter or import meta data
2. Define and build stages 3. Define job4. Deploy the job
Move project definitions to project libraries on the server
5. Run the job 6. Review results
Executing a Job: Deploy and RunExecuting a Job: Deploy and Run
QualityStage Server
QualityStage
Designer
Windows
Deploy & Run
Deploy: Transfer project information to the server
Job Script
RUN: Execute the job script on the server
QualityStage Job Run ModesQualityStage Job Run Modes
FILE MODE
DATA STREAM
Process each record through a job before passing all the records to the next job
Process each record and then pass it immediately on to the next job
Exercise 3-1: Deploy and RunExercise 3-1: Deploy and Run
1. Open the demo project Quality2. Select a job3. Select the Run button on the toolbar4. Uncheck the Deploy box5. Choose “Execute File Mode”6. Choose “Run from Start to End”7. Review project libraries on the server
Data File Formats and DefinitionsData File Formats and DefinitionsData File Names
One to eight charactersNo spaces or extensionsFile names are uppercase and case-sensitive
Data File LocationData folder in project library
FormatsQualityStages processes fixed record length
sequential filesAlphanumeric characters
Exercise 3-2: Define a ProjectExercise 3-2: Define a Project
1. Choose, New icon from the Tool Bar
2. Choose Project
3. Project Name: WinnCRM
4. Project Description: Winn Insurance CRM Project
5. Choose OK
Defining Meta DataDefining Meta Data
Data field definitions can be entered or imported into the Designer
Importing options include:Cobol copybooksODBC enabled MetaStage MetabrokerVisual Warehouse
Exercise 3-3: Define a Data FileExercise 3-3: Define a Data File
1. Left pane, select Data File Definitions
2. Right pane, right-click, select New File
3. Filename AUTOHOME
4. File: Auto and Home Policies
5. Choose OK
Exercise 3-4: Data Field DefinitionsExercise 3-4: Data Field Definitions
1. Left pane, select Data File Definitions
2. Left pane, select AUTOHOME
3. Right pane, right click, select New Field
4. Complete field information
Lab 3-5: Copy Data File and Field Definitions
Lab 3-5: Copy Data File and Field Definitions
1. Left pane, select Data File Definitions
2. Right pane, select AUTOHOME
3. Right-click, select COPY
4. Left pane, select Data File Definitions
5. Right pane, right-click, select PASTE
6. Name File: LIFE
7. Choose OK
Module SummaryModule Summary
Data file definitionsData file formatMeta data
Jobs and StagesRun and DeployProject Libraries
Investigate and Data Quality Assessment
Investigate and Data Quality Assessment
Module ObjectivesModule Objectives
Describe how the Investigate stage is used to assess data quality in the project life cycle
Identify the three types of Investigate stageCharacter Discrete InvestigateCharacter Concatenate InvestigateWord Investigate
Design Investigate stages and run Investigate jobs
Review and analyze Investigate results
Project Planning & RequirementsProject Planning & Requirements
Identify Objectives
Data Assessment
Define Development Plan
Define Business Requirements
Define Data Requirements
Requirements
Planning
Application Design Plan
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurvivorshipInvestigation Conditioning Matching
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Data AssessmentData Assessment
Verify the domainReview each field and verify the data matches
the meta dataIdentify data formats, missing and default
valuesIdentify data anomalies
FormatStructureContent
Discover “unwritten” business rulesIdentify data preparation requirements
Investigate StageInvestigate Stage
FeaturesAnalyze free-form and single domain fieldsProvide frequency distributions of distinct
values and patterns
Investigate methodsCharacter DiscreteCharacter ConcatenateWord
Investigate MethodsInvestigate Methods
Method Why
Character DiscreteAnalyzing field values, formats, and domains
Character ConcatenateCross-field correlation, checking logic relationships between fields
Word InvestigationIdentifying free-form fields that may require parsing and discovery of key words for classification
Investigate TerminologyInvestigate Terminology
Options that represent the data. Options: Character (C), Type (T), Skipped (X)
Tokens
Field Masks
Individual units of data
Character Mask
Usage
C For viewing the actual character values of the data
T For viewing the pattern of the data
X For ignoring characters
Token Mask Result
02116 CCCCC 02116
02116 CCCXX 021
01832-4480 TTTTTTTTTT nnnnn-nnnn
XJ2 6EM TTTTTTT aanbnaa
(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300
617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn
6173380300 CCCXXXXXXXXX 617
(617)3380300 CCCXXXXXXXXX (61
Field Mask ExamplesField Mask Examples
Character Discrete: Field Mask (C)haracter Character Discrete: Field Mask (C)haracter
Usage: Domain quality View the contents of each field to verify
that the data values match the field labels
Investigate Stage: Generates Reports for frequency and
pattern references Report naming conventions:
jobp.FRQ – Results sorted by frequency, descending order
jobp.SRT – Results sorted by field mask, ascending order
job.PAT – Pattern reference file
DOB 00000908 45.309% [X]| DOB 00000005 0.250% 00000000 [X]| 00000000DOB 00000004 0.200% 19440225 [X]| 19440225DOB 00000004 0.200% 19440609 [X]| 19440609DOB 00000004 0.200% 19460212 [X]| 19460212POLNUMB 00000001 0.050% 014669402 [X]| 014669402 POLNUMB 00000208 11.00% 617-338-0300[X]| 617-338-0300 POLNUMB 00000001 0.050% AM07B002470 [X]| AM07B002470POLNUMB 00000001 0.050% AM07B002736 [X]| AM07B002736
DOB 00000908 45.309% [X]| DOB 00000005 0.250% 00000000 [X]| 00000000DOB 00000004 0.200% 19440225 [X]| 19440225DOB 00000004 0.200% 19440609 [X]| 19440609DOB 00000004 0.200% 19460212 [X]| 19460212POLNUMB 00000001 0.050% 014669402 [X]| 014669402 POLNUMB 00000208 11.00% 617-338-0300[X]| 617-338-0300 POLNUMB 00000001 0.050% AM07B002470 [X]| AM07B002470POLNUMB 00000001 0.050% AM07B002736 [X]| AM07B002736
Character Discrete - Character ResultsCharacter Discrete - Character Results
Field Name
FRQ Count
Sample “Example”
FRQ % Field Mask
[X] indicates a new set of example records
Character Discrete: Field Mask (T)ypeCharacter Discrete: Field Mask (T)ype
Usage: Data formats (patterns): View the format of field which contain that
you suspect may follow or conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.
Generates reports for frequency and pattern references Report naming conventions:
jobp.FRQ – Results sorted by frequency, descending order jobp.SRT – Results sorted by field mask, ascending order job.PAT – Pattern reference file
Exercise 4-1: Character Discrete InvestigateExercise 4-1: Character Discrete Investigate
1. Create Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results
Lab 4-1: Character Discrete Investigate – Type T
Lab 4-1: Character Discrete Investigate – Type T1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results
Character ConcatenateCharacter Concatenate
Usage: Identify Field Relationships Investigate one or more fields to uncover any
relationship between the field values.
QualityStage ToolkitUses combinations of character masksGenerates Reports for frequency and pattern
referencesReport naming conventions:
jobp.FRQ – Results sorted by frequency, descending order
jobp.SRT – Results sorted by field mask, ascending order
job.PAT – Pattern reference file
00000908 45.309% bbbbbbbbbbbbbbbb [X] | 00000020 2.009% bbbbnnnnbbbbbbbb [X] | 1904 00001096 54.691% nnnnnnnnbbbbbbbb [X] | 06011944
00000908 45.309% bbbbbbbbbbbbbbbb [X] | 00000020 2.009% bbbbnnnnbbbbbbbb [X] | 1904 00001096 54.691% nnnnnnnnbbbbbbbb [X] | 06011944
Character Concatenate ResultsCharacter Concatenate Results
FRQ Count Sample /
“Example”FRQ % Field Mask
[X] indicates a new set of example records
DOB and DOD Fields
Exercise 4-2: Character Concatenate
Exercise 4-2: Character Concatenate
1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose field (s) and mask options5. Stage and run job6. Review report results
Word InvestigateWord Investigate
Usage: Pattern free-form fields and lexical analysis To view the pattern of the data within a
freeform text field and parse it into individual tokens
QualityStage Apply rules sets to free-form fields Discover parsing requirements Pattern data Generates reports for word frequency, pattern
frequency distributions, and word classification
Word Investigation ResultsWord Investigation Results
^D?T 639 N MILLS AVE^D?S 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE
0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE
ABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001
Pattern Reports
Word Classification Reports
Word Frequency Reports
Rule SetsRule Sets
Rules for parsing, classifying, and organizing data
Rule Set DomainsCountry processingPre-processingDomain Processing
Name: Business and Personal Street Address Area: Locality, City, State and Zip/Postal codes
Multinational Address Processing
ParsingParsing
Parse free-form data with the SEPLIST and a STRIPLIST SEPLIST - Any character in the SEPLIST
will separate tokens, and become a token itself
STRIPLIST - Any character in the STRIPLIST will be ignored in the resulting pattern
The SEPLIST is always applied first
Parsing ExampleParsing Example
Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8120 Main St . N . W .
Token1 Token2 Token3 Token4
120 Main St NW
Token1 Token2 Token3 Token4 Token5
120 Main St N W
SEPLIST “¬.”STRIPLIST “¬.“
SEPLIST “¬”STRIPLIST “¬.“
SEPLIST “¬.”STRIPLIST “¬“
Example: 120 Main St. N.W.
Data Typing: Classifying TokensData Typing: Classifying Tokens
Identify and type the token in terms of it’s business meaning and value
MASK KEY:
N – Numeric token
A – Alpha token
M – Mixed Token
120 Main Street Apt 6C
N A A A M
PATTERN KEY:
^ – Numeric token
? – Unclassified alpha token
@, <, > – Mixed Token
T – Street Type
U – Unit Type
120 Main Street Apt 6C
^ ? T U >
21 WINGATE STREET APARTMENT 601
T ^?^
Parse
Classify known wordsand assign default
tagsU
Produce Reports based on Patterns &
Tokens
^D?T 639 N MILLS AVE^D?T 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE
^D?T 639 N MILLS AVE^D?T 306 W MAIN ST ^D?T 3142 W CENTRAL AVE ^?T 843 HEARD AVE
0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE
0000000869 ST 0000000791 RD 0000000622 STE 0000000566 AVE
ABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001
ABBOTT ABBOTT ? ;0000000001ABERCON ABERCON ? ;0000000001ABERCORN ABERCORN ? ;0000000007ABERDEEN ABERDEEN ? ;0000000001
Example: Word InvestigateExample: Word Investigate
Lab 4-3: Word Investigation Address and AreaLab 4-3: Word Investigation Address and Area
1. Add Investigate job2. Identify the type of investigation3. Select input file4. Choose rule set and field(s)5. Choose Advanced Options6. Stage and run job7. Review report results
Data Quality AssessmentData Quality Assessment
Review and analyze each field for the following information:How often is the field populated?What are the anomalies and out-of-range
values? How often does each one occur?How many unique values were found?What is the distribution of the data or
patterns?
Use Investigate results to:Update business requirementsDefine development plan and application
design
QuizQuiz
What is domain integrity?What is the difference between a Type C
and a Type T field mask?When might you use a Type X field mask?Where can you find the Investigate
reports?
Module SummaryModule Summary
DRE Methodology: Data Quality Assessment
Character discrete, concatenate and word investigation
Field MasksCharacter (C)Type (T) Ignore (X)
Parsing – SEPLIST, STRIPLISTData ClassificationPatterns
Data Preparation Data Preparation
Data PreparationData Preparation
Format of data fileUnique record identifierCommon record layout
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Data File FormatData File Format
Preferred data file format for QualityStage is: Fixed record length Fix fielded data Sequential file with terminated records Alphanumeric data
QualityStage provides the following features for working with other file formats: ODBC enabled for pulling/pushing data from/to a table Unterminated and Variable length Fixed-length unterminated
The Transfer (GTF) stage is used to read in the various formats and output a fixed-record length terminated file
Unique Record KeyUnique Record Key
Every record should start the QualityStage process with a unique record key
This key can be created in QualityStage or by other tools like DataStage
The QualityStage Investigate Stage will help validate if a unique key exists
This unique key provides developers with a way to audit each record as it passes through the QualityStage application
The Transfer Stage can be used to create a new key field and populate the new field with a unique value
Common Data FormatCommon Data Format
Fields identified for processing should be moved forward from each source and appended into a single new source fileAllows for efficiently processing all data in one
stream using one set of rules
In QualityStage, appending data files is accomplished with the Transfer (GTF) stage
Transfer Stage (GTF)Transfer Stage (GTF)
Transforms data file formats to fixed length flat files
Adds new fields Assign literal values such as a source indicator Generate and assign a sequential value
Reformatting record layouts Dropping fields
Format field data Case formatting Right/left justification Right/left fill
Concatenate fieldsAppend Data files
FW000000001 639 N MILLS AVE ORLANDO FL 32803FW000000002 4275 OWENS RD, STE 536 EVANS GA 30809FW000000003 ERIN OFFICE PARK DUBLIN GA 31021FW000000004 600 E OGLETHORPE HWY HINESVILLE GA 31313
Add a Record KeyAdd a Record Key
639 N MILLS AVE ORLANDO FL 328034275 OWENS RD, STE 536 EVANS GA 30809ERIN OFFICE PARK DUBLIN GA 31021600 E OGLETHORPE HWY HINESVILLE GA 31313
Populate Record Key
field
1,25 26,25 52,20 73,2 76,5
1,12 13,25 38,25 64,20 88,585,2
Input Data File
Output Data File
Transfer Stage(GTF)
Record Key Best PracticesRecord Key Best Practices
Add a unique record identifier in the QualityStage process or prior to entering QualityStage processing
Create a 12 byte fieldThe first 2 bytes indicate the sourcePositions 3 through 11 store a sequential
numberPosition 12 is intentionally left blank for
training providing a space between the record key and the data
Append Data FilesAppend Data Files
The transfer stage can read one input file and produce one output file
To append data, you will need to define a Transfer stage for each file you want to append
Be careful of the order – the first transfer does not generally append – only subsequent transfer stages referencing the same output file should append data
Transfer Stage 1Transfer Stage 1Transfer Stage 2
Append options selected
Transfer Stage 2
Append options selected
COMBINED
AUTOHOME LIFE
Exercise 5-1:Add a Record Key and Append Data FilesExercise 5-1:Add a Record Key and Append Data Files
1. Read in each source of data
2. Define a new output file with a common format
3. Create Transfer Stage 1
4. Create new Record Key field
5. Populate the Record Key field
6. Add AUTOHOME Data to new COMBINED output file
AUTOHOMEAUTOHOME
COMBINEDCOMBINED
Stage name: AHKEY
Stage type: Transfer
Job Name: Append
Stage name: AHKEY
Stage type: Transfer
Job Name: Append
Lab 5-1: Append LIFE to COMBINED OutputLab 5-1: Append LIFE to COMBINED Output
1. Create transfer stage
2. Define new record key field
3. Populate the record key field
4. Append LIFE to AUTOHOME in the COMBINED output file
LIFELIFE
COMBINEDCOMBINED
Stage name: LFKEY
Stage type: Transfer
Job Name: Append
Stage name: LFKEY
Stage type: Transfer
Job Name: Append
Module SummaryModule Summary
QualityStage requires files to be fixed record length terminated records.
The Transfer stage can be used to:Convert file formats to fixed record lengthAdd new fields and populated with literal
values or sequential numbersAppend data filesFormat FieldsReformat record layout
Standardize Standardize
Module ObjectivesModule Objectives
Describe the Standardize Stage in the Data Re-engineering Methodology
Identify Rule SetsApply the Standardize StageInterpret Standardize resultsInvestigate unhandled data and patterns
Project Lifecycle: DevelopmentProject Lifecycle: Development
Review Data Flow Diagram
Construct Application
Review & Refine
Standardize Data
Find Duplicate Candidate (Match)
Survive Best of Breed (Survive)
Development {Unit Test
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurviveInvestigation Standardize Match
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
StandardizeStandardize
TransformationParsing free form fieldsComparison threshold for classifying like wordsBucketing data tokens
StandardizationApplying standard values and standard
formats
Phonetic Coding for use in MatchingNYSIISSoundex
Standardize Example Standardize Example
Input File:
Address Line 1 Address Line 2
1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR
Result File:
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City Type Value Type Value
1721 W ELFINDALE AVE UNIT 201721 W ELFINDALE AVE 2016200 VENTURA BLVD12 WESTERN AVE1705 W ST
PHILADELPHIA1655 PONCE DE LEONAVE FLOOR15
Input File:
Address Line 1 Address Line 2
1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR
Result File:
House # Dir Str. Name Type Unit Unit. Floor Floor NYSIIS Soundex City Type Value Type Value
1721 W ELFINDALE AVE UNIT 201721 W ELFINDALE AVE 2016200 VENTURA BLVD12 WESTERN AVE1705 W ST
PHILADELPHIA1655 PONCE DE LEONAVE FLOOR15
21 WINGATE STREET APARTMENT 601
^?^
Parse
Classify &assign default
tags T U
House Street UnitNumber Street Name Type UnitType
21 WINGATE ST APT 601
Process Patterns and Bucket Data
Standardize ProcessStandardize Process
Output File
Key:
^ = Single numeric
? = One or more unknown alphas
T = Street type
U = Unit type
Standardize StageStandardize Stage
Standardize StageUses Rule sets for:
Country processing Pre-domain processing
– USPREP Domain processing
– USADDR– USAREA– USNAME
Multi-national Address WAVES
Types of Rule SetsTypes of Rule Sets
Country Identifier
COUNTRY
Country Identifier
COUNTRY
Domain Pre-processor
USPREP
Domain Pre-processor
USPREP
Domain Specific: USNAME
Domain Specific: USNAME
Domain Specific: USADDR
Domain Specific: USADDR
Domain Specific: USAREA
Domain Specific: USAREA
Example: Country IdentifierExample: Country Identifier
Input Record
100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET
Input Record
100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET
Output Record
US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET
Output Record
US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET
Example: Domain Pre-ProcessorExample: Domain Pre-Processor
Input Record
Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148
Input Record
Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148
Output Record
Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426
Output Record
Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426
Example: Domain-SpecificExample: Domain-Specific
Input Record
100 SUMMER STREET 15TH FLOOR
Input Record
100 SUMMER STREET 15TH FLOOR
Output Record
House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U
Output Record
House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U
Rule SetsRule Sets
Rule sets contain logic for:ParsingClassifyingProcessing data by pattern and bucketing data
Three required filesClassification TableDictionary FilePattern Action FileOptional Lookup Tables
Rule Sets FilesRule Sets Files
Contains a series of patterns and programming commands to condition the data
Contains standard abbreviations that identify and classify key words.
Optional conversion and lookup tables for converting and returning standardized values
Define the output file fields to store the parsed and conditioned data
Description file for the Rule Set
Tables for storing overrides entered into the Designer GUI
Classification Table (.CLS)
Pattern Action File (.PAT)
Dictionary File (.DCT)
Rule Set Description (.PRC)
Lookup Tables (.CLS)
Override Tables (.CLS)
Classification TableClassification Table
Contains the words for classification, standardized versions of words, and data class
Data class (data tag) is assigned to each data token
Default classes are the same across all rule sets
User-defined classes are assigned in the classification tableUsers may modify, add or delete these classesUser-defined classes are a single letter
Default ClassesDefault Classes
Class Description
^ A single numeric
+ A single unclassified alpha (word)
? One or more consecutive unclassified alphas
@ Complex mixed token, e.g., ½, O’Connell
> Leading numeric, e.g., 6A
< Trailing numeric, e.g. A6
Zero Null class
User-defined ClassesUser-defined Classes
Class Description
USNAME
G Generational, e.g., Senior, I, II
P Prefix, e.g. Dr., Mr., Miss
USADDR
T Street Type
D Directional
B Box Type
USAREA
S State Abbreviation
Comparison ThresholdComparison Threshold
May be used in the Classification table
Used to efficiently make entries into the classification table
Helps overcome spelling and data entry errors
Not requiredThreshold uses a
logical string comparator
Threshold level900
Exact match
850
Almost certainly the same
800
Most likely equivalent
750
Most likely not the same
700
Almost certainly not the same
Classification Table ExampleClassification Table Example;-------------------------------------------------------------------------------
; USADDR Classification Table ;-------------------------------------------------------------------------------; Classification Legend ;-------------------------------------------------------------------------------; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B
Dictionary FileDictionary File
Defines the fields definitions for the output file
When data is moved to these output fields it is called “bucketing” the data
The order that the fields are listed in the dictionary file defines the order the fields are written to the output file
Dictionary file entries are similar to field definitions
Dictionary File ExampleDictionary File Example
;;QualityStage v7.0 \FORMAT\ SORT=N ;-------------------------------------------------------------------------------; USADDR Dictionary File ;-------------------------------------------------------------------------------; Total Dictionary Length = 415 ;-------------------------------------------------------------------------------; Business Intelligence Fields ;-------------------------------------------------------------------------------HN C 10 S HouseNumber ;0001-0010HS C 10 S HouseNumberSuffix ;0011-0020PD C 3 S StreetPrefixDirectional ;0021-0023PT C 20 S StreetPrefixType ;0024-0043SN C 25 S StreetName ;0044-0068ST C 5 S StreetSuffixType ;0069-0073SQ C 5 S StreetSuffixQualifier ;0074-0078SD C 3 S StreetSuffixDirectional ;0079-0081RT C 3 S RuralRouteType ;0082-0084RV C 10 S RuralRouteValue ;0085-0094BT C 7 S BoxType ;0095-0101BV C 10 S BoxValue ;0102-0111
Pattern Action FilePattern Action File
The Pattern-Action file contains the rules for standardization; that is, the actions to execute with a given pattern of tokens
Records are processed from the top downWritten in Pattern Action Language Complex parsing can be coded in this file
Street Address 10 Hollow Oak Road
Pattern ^ ? T
Pattern Action Language COPY [1] {HN}COPY_S [2] {SN}COPY_A [3] {ST}
{HN} {SN}
{ST}
Pattern Action File ExamplePattern Action File Example
10 Hollow Oak Rd
Optional Lookup TablesOptional Lookup Tables
Called from the Pattern Action FileRule sets may contain lookup tables such
as:Common First Names and Enhanced First
Names Barb & Barbara Ted & Edward
Gender based on nameState abbreviationsCommon city abbreviations
NYC = New York City LA = Los Angeles
21 WINGATE STREET APARTMENT 601
^?^
Parse
Classify &assign default
tags T U
House Street UnitNumber Street Name Type UnitType
21 WINGATE ST APT 601
Pattern Action File
Process Patterns and Bucket Data
Classification Table
Dictionary File
Standardize ProcessStandardize Process
Standardizing International DataStandardizing International Data
Two methodsMethod 1: Use country pre-processor, domain
pre-processor, and domain-specific rules Uses out-of-the-box, included functionality/rules
Method 2: Use Multinational Standardize and WAVES
Requires purchase of WAVES database
Making WAVESMaking WAVES
When working with files containing multinational addresses, QualityStage provides the following tools: Multinational Address Standardization stage which
standardizes address files at city-and street-levels Functionality available out-of-the-box
WAVES (Worldwide Address Verification and Enhancement System) stage which standardizes, corrects, verifies and enhances addresses against a country-specific postal reference file
Requires purchase of WAVES database
What the WAVES Stage DoesWhat the WAVES Stage Does
After the original input files have been standardized, the WAVES stage performs these main functions Corrects – Corrects defects in the input data due to
typographical or spelling errors Verifies – Using probabilitistic matching, WAVES stage tries to
match corrected address records against addresses in a country-specific postal reference file
Enhances – Assigns the portal record data to the appropriate fields in output file, thereby substituting any erroneous and missing input data with verified postal data
Indicates Verification result – Shows whether a record has been successfully verified by the WAVES stage
Overall degree of verification also indicated
WAVES Input File RequirementsWAVES Input File Requirements
Fixed-fielded, fixed record length data files Total line length cannot exceed 4096 Address data must occur within first 3072
Each record must contain Country indicator
Full spelling Abbreviation 2- or 3-bytes ISO country code Mismatch of country indicator to country- and street-level
formats results in data not being standardized and output as unhandled
– For example, identifier says record is German and address format is that of France
Unique record identifier (record key) Use preprocessor to remove any non-address data
from the address fields c/o Attn: Department
Multinational Standardize (MNS) stage automatically used as part of WAVES stage processing
WAVES OutputWAVES Output
City-level verification Correct, enhance and verify city field Correct, enhance and verify neighborhood/locality field Correct, enhance and verify state/province field Verify postal code field (but not correct it) Indicate if record has been verified (and to what degree)
Street-level verification Correct, enhance and verify the street info field Correct and verify postal code Indicate the match weight, which shows the degree of
certainty of the probabilistic match between the input and reference data
About the Verification ProcessAbout the Verification Process
Use ISO codes, which are applied during standardization, as critical match fields on all city and street level verification attempts
Try to verify the city, state/province, and postal code are correct based upon the available information in the record For example, if no state/province is available, uses
postal code to impute the missing information
If the postal code conflicts with the city, WAVES copies the city and province fields from the postal record, but does not change the postal code since WAVES cannot verify which is the correct data
Modifying Standardization BehaviorModifying Standardization BehaviorMNS rules (used by WAVES) can be
modified using the override functionality in QualityStage Designer
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Country Rule SetCountry Rule Set
Country Rule set appends the two byte ISO country code
Input to the country rule set includes:Street AddressCity or localityStateZip or Postal codeCountry field (if it exists)
Output:Two byte ISO country codeFlag identifying explicit or default decision
Exercise 6-1: Country StandardizeExercise 6-1: Country Standardize1. Define the output
file2. Create the job 3. Define the job
details Select Country rule
set Identify fields to be
conditioned Enter metadata label
4. Run the job5. Review results
COMBINEDCOMBINED
CNTRYOUTCNTRYOUT
Stage name: CNTRYSTAN
Stage type: Standardize
Rule set: COUNTRY
Job Name: STAN
Stage name: CNTRYSTAN
Stage type: Standardize
Rule set: COUNTRY
Job Name: STAN
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Selecting US DataSelecting US Data
The QualityStage Select Stage provides the capability of selecting and/or rejecting records based on a set of values for a field
Selecting or splitting data requiring compound or complex logic may require multiple select stages or a custom rule set
Select StageSelect Stage
Select Stage accepts one input file and may output multiple filesAccept – records that meet the criteriaReject – records that do not meet the criteriaSplit – both the Accept and Reject file are
created
Input and output files have the same layout
Select allows you to choose a field and create a list of values If a record is equivalent to a value on the list
then the record is accepted, else it is rejected
Exercise 6-2: Split US From Non-US DataExercise 6-2: Split US From Non-US Data
1. Create output files One for Accept records One for reject records
2. Create Select stage
3. Add to Stan Job4. Run Select stage5. Review Results
CNTRYOUTCNTRYOUT
USDATA(Accept)USDATA(Accept)
Stage name: SPLIT
Stage type: Select
Job Name: STAN
Stage name: SPLIT
Stage type: Select
Job Name: STAN
NONUSDATA(Reject)
NONUSDATA(Reject)
Domain Pre-Processor Rule SetsDomain Pre-Processor Rule Sets
Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) dataFor example, if the city, state and zip is found
in ADDRESS LINE 2, the pre-processor rule set will attempt to recognize this data and move it into the area domain
The pre-processor rule set prepares the data for processing by domain specific rule sets
Exercise 6-3: US Prep Rule SetExercise 6-3: US Prep Rule Set
1. Define the output file2. Create the job 3. Define the job details
Select US PREP rule set Identify fields to be
conditioned Enter metadata labels
4. Run the job5. Review results
USDATAUSDATA
PREPOUTPREPOUT
Stage name: PREPDATA
Stage type: Standardize
Rule set: USPREP
Job Name: STAN
Stage name: PREPDATA
Stage type: Standardize
Rule set: USPREP
Job Name: STAN
Domain Rule SetsDomain Rule Sets
Domain rule sets expect only data for that domain as the input
Domain rule sets that come with QualityStage are:NameStreet addressArea (city, state and zip)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
USNAME Rule SetUSNAME Rule Set
The USNAME rule set works on both personal names and organization names for US data
Data is parsed into name componentsPhonetic coding of the First Name and
Primary Name are created for matching
USADDR Rule SetUSADDR Rule Set
This rule set is applied to street address fields
The “Address Type” flag identifies different types of addresses ‘S’ Street address ‘B’ Box address ‘R’ Rural route address
Phonetic coding of the Street Name is created for matching
USAREA Rule SetUSAREA Rule Set
This rule set is applied to city, state and postal code fields
Data is parsed into city name, state abbreviation, zip code and zip plus four
Phonetic coding of the city name is created for matching
Exercise 6-4: Domain StandardizeExercise 6-4: Domain Standardize1. Define the output file2. Create the job 3. Define the job details
Select USNAME, USADDR, USAREA rule sets
Identify fields to be conditioned
4. Run the job5. Review results
PREPOUTPREPOUT
STANOUTSTANOUT
Stage name: STANALL
Stage type: Standardize
Rule set: USNAME, USADDR, USAREA
Job Name: STAN
Stage name: STANALL
Stage type: Standardize
Rule set: USNAME, USADDR, USAREA
Job Name: STAN
Standardize ResultsStandardize Results
Business Intelligence fields Parsed from the original data, they may be
used in matching and generally they are moved to the target system
Matching FieldsGenerally these fields are created to help
during the match process and are dropped after successful matching
Reporting fieldsSpecifically created to help review results of
Standardize and recognized handled and unhandled data
Business Intelligence FieldsBusiness Intelligence Fields
Intelligent data parsed and bucketed from the input free-form field
USNAME Examples
• Title
• First Name
• Middle Name
• Primary Name
• Generational
USADDR Examples
HouseNumber
Directional
Street Name
Unit Types
Box Types
Unit Values
Building Names
USADDR Examples
•City
•State
•Zip5
•Zip4
Standardize Matching FieldsStandardize Matching Fields
Phonetic codingNYSIIS Reverse NYSIISSoundexReverse Soundex
Hash keysFirst 2 characters of the first five words
Packed KeysData concatenated, or packed
Standardize Reporting FieldsStandardize Reporting Fields
The tokens not processed by the rule set because they represent a data exception.
The pattern generated for the stream of input tokens based on the parsing rules and token classifications.
The pattern generated for tokens not processed by the selected rule set.
The remaining tokens not processed by the selected rule set.
Unhandled Pattern
Unhandled Data
Input Pattern
Exception Data
Best Practice: Investigate Handled DataBest Practice: Investigate Handled Data
Review the business intelligence fields to ensure accurate bucketing of the data
Build a Character Discrete Investigation for each field and review the contents and the format
Build Character Concatenate Investigation to review:Unhandled PatternsUnhandled Data Input Pattern Input Fields
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
1. Build a Character Concatenate Investigation using the following fields
2. Increase the number of samples to 5
Exercise 6-5: Investigate NAME Unhandled Patterns and Data
Exercise 6-5: Investigate NAME Unhandled Patterns and Data
Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.
Field Name Field Description Type
UPUSNAM Unhandled Pattern C
UDUSNAM Unhandled Data X
IPUSNAM Input Pattern X
NAME Original Name field data X
1. Build a Character Concatenate Investigation using the following fields
2. Increase the number of samples to 5
Exercise 6-6: Investigate Address/Area Unhandled Patterns
Exercise 6-6: Investigate Address/Area Unhandled Patterns
Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.
Field Name Field Description Type
UPUSADD Unhandled Pattern C
UDUSADD Unhandled Data X
IPUSADD Input Pattern X
ADDR1 Address Line 1 X
ADDR2 Address Line 2 X
Module SummaryModule Summary
QualityStage comes with pre-defined rule sets that are highly flexible and customizable
Support multi-national address processingCountry rule setsPre-processing rule sets for mixed-domain
challengesDomain rule setsCustom rule sets
Rule Set Overrides Rule Set Overrides
Module ObjectivesModule Objectives
Identify the location of the User Override functionality
Describe the different types of User Overrides
Apply User OverridesTest User OverridesLocate the User Override tables
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurviveInvestigation Standardize Match
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Customizing Rule SetsCustomizing Rule Sets
A rule set may require modification if some input data is:Not processed Incorrectly processed
QualityStage User OverridesRules Analyzer
User OverridesUser Overrides
Provides the user the ability to modify the rule sets
The following types of rule sets can be modified using User Overrides Domain Pre-processor rule sets Domain rule sets own Standardize rules Validation rule sets Multinational Address Standardize rule sets
There are five types of user overrides relating to: classifications, patterns, and text strings
User overrides are GUI Driven Stored in separate lookup tables
User Classification OverrideUser Classification Override
Recognized as a keyword and classifiedAdditional words
New abbreviation, variation Misspelling of a word
User Classifications may override or add: Original values (Token values) Standard value Class
Token Value Standard Value Class
Example: Classification OverrideExample: Classification Override
FCarolynne
Carolynne
Input Pattern Original Data +,+ HOCHREITER , CAROLYNNE
Input Pattern Original Data +,F HOCHREITER , CAROLYNNE
Corrected Pattern
Unhandled Data
Override
Add CAROLYNNE
as a valid first name
to the classification table
Text OverridesText Overrides
Allow the user to specify overrides based on an entire text string
Use this override for special cases and specific handling of a string of text
Input Text OverridesApplied to the original text string
Unhandled Text OverridesApplied to the Unhandled Data field
Example: Input Text OverridesExample: Input Text Overrides
Input Text OverrideREIFF FUNERAL Move text string to the Primary name field
Unhandled Pattern Input Text++ ZACHARIA GELLMAN++ TOMMOTHY CABBOTT++ REIFF FUNERAL
Input Text
Override
Input Pattern Primary Name+ + REIFF FUNERAL
Results
Pattern OverridesPattern Overrides
Allow the user to specify overrides based on an entire pattern
Use this override when most or all records should be processed with identical logic
Input Pattern OverridesApplied to the original text string
Unhandled Pattern OverridesApplied to the Unhandled Data field
Unhandled Pattern OverridesUnhandled Pattern Overrides
Unhandled Pattern Override+, + Move + to Primary Name Comma provides context
Move + to First Name
Unhandled Pattern Input Text+, + HAYWARD, WINSLOW+, + ESHAGHIAN , JOUBI+, + BOULDER, CORONA
UnhandledPattern
Override
Results
Unhandled Pattern First Primary Name Name +, + WINSLOW HAYWARD+, + JOUBI ESHAGHIAN+, + CORONA BOULDER
User Override PrecedenceUser Override Precedence
Recognize words to classify
Modify logic based on the input string
Modify logic based on the input pattern
Modify logic based on the Unhandled data string
Modify logic based on the unhandled pattern
User Classification
Input Text
Input Pattern
Unhandled Text
Unhandled Pattern
Rule Set PrecedenceRule Set Precedence
UNHANDLED TEXT
INPUT PATTERN
INPUT TEXT
USER CLASSIFICATION
UNHANDLED PATTERN
CLASSIFICATION TABLE
PATTERN ACTION FILE
Rule Set Override ProcessRule Set Override Process
1. Enter override2. Apply override3. Test override with the Rules Analyzer4. Repeat steps 1 through 3 for all desired
overrides
Exercise 7-1: Name Rule Set User OverrideExercise 7-1: Name Rule Set User OverrideReview the unhandled NAME patterns in
the INUPNAMp.frq reportApply NAME overridesTest NAME overridesRe-run the STAN Job to re-produce the
new output file with the overrides applied
Exercise 7-2: Address and Area OverridesExercise 7-2: Address and Area OverridesReview the Investigation reports of
unhandled Address and Area dataApply Users Overrides to unhandled dataTest the OverrideRe-run the STAN Job to re-produce the
new output file with the overrides applied
Module SummaryModule Summary
There are fives type of user overridesUser overrides can be applied to:
The classification table Input text Input patternsUnhandled textUnhandled patterns
Overrides are applied in a specific orderThe Standardize Rules Analyzer can be
used to test and review user overrides
Match Match
Module ObjectivesModule Objectives
Describe where Match fits in the Data Re-engineering Methodology
Describe QualityStage Match conceptsDefine the type of matching algorithmsDescribe the importance of blockingApply multiple match passes to increase
efficiency/efficacyInterpret and improve match results
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurviveInvestigation Standardize Match
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Match Stage Match Stage
Statistically-based method for determining matches
Over 24 match comparison algorithms providing a full spectrum of fuzzy matching functions
Ability to measure informational content of data
Identify duplicate entities within one or more files
Array matchingMatch wizards and templatesCritical field settings
What Constitutes a Good Match?What Constitutes a Good Match?
W HOLDEN 12 MAIN ST W HOLDEN 12 MAINE ST
Which of the following record pairs is a match? And how do you know?
Do you compare all the shared or common fields? Do you give partial credit? Are some fields (or some values) more important to you than others?
Why? Do more fields increase your confidence? By how much? What is enough?
W HOLDEN 128 MAIN PL 02111 12/8/62 W HOLDEN 128 MAINE PL 02110 12/8/62WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824 WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824
The Value of Information ContentThe Value of Information ContentInformation content measures the
significance of one field over another (Discriminating Value)A Gender Code contributes less information
than a Tax-Id NumberInformation content also measures the
significance of one value in a field over another (Frequency) In a First-Name Field, JOHN contributes less
information than DWEZELSignificance is determined by a value’s
reliability and its ability to discriminate, both can be calculated from your data
The weighted score is a relative measure of the probability of a match
Thresholds defined can be used for automated
processing
0
500
1000
1500
2000
2500
3000
3500
4000
-20 -10 0 10 20 30 40
# o
f P
air
s
Non-Matches
Matches
Distribution of WeightsDistribution of Weights
Weight of Comparisons
Less Confidence More Confidence
Gre
y ar
ea
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62 +1 +1 +17 +2 +4 -1 +7 +9 = 40
WeightsWeights
Measures the information content of a data value
Each field contributes to the confidence (probability) of a match
Types of WeightsTypes of Weights
If a field matches, the agreement weight is usedAgreement weight is a positive value
If a field doesn’t match, the disagreement weight is usedDisagreement is a negative value
Partial weight is assigned for non-exact or “fuzzy” matches
Missing values have a default weight of zero
Weights for all field comparisons are summed to form a composite weight
Matching TerminologyMatching Terminology
Measures the informational content of a data value
Distinguish matches from non-matches
Records with a score above the High cutoff that really aren’t a match
Records below the low cutoff that really are a match
Measures the significance of one field value over another
Measures the confidence of a match
Informational Content
Weight
Composite Weight
Match Cutoffs
False Positives
False Negatives
Measuring the Conditions of UncertaintyMeasuring the Conditions of UncertaintyReliability of the data in a given field
Estimated as the probability that the field agrees given the record pair is a match
Probability of a random agreement of values Estimated as the probability the field agrees
given the record pair is not a match
Reliability (M-Probability)Reliability (M-Probability)
Approximated as, 1 - error rate for the given field
The higher the m-probability, the higher the disagreement weight will be for the field not matching since the data is considered reliable
Chance Agreement (U-Probability)Chance Agreement (U-Probability)The u-probability can be approximated as
the probability that a field agrees at random (by chance)
QualityStage uses a frequency analysis to determine the probability of chance agreement for all values
Rare values bring more weight to a match
Calculating WeightsCalculating Weights
Agreement weight is estimated as: log2(m/u)
Disagreement weight is estimated as: log2 ((1-m)/(1-u))
M (m-prob) = .9
U (u-prob) = .01
Agreement weight log2 (.9/.01) = 6.49
Disagreement weightlog2 (1-.9)/(1-.01) = -3.31
BlockingBlocking
Grouping together like records that have a high-probability of producing matches
Only “like” records are compared to each other making the match more efficient and computationally feasible
Records in a “block” match exactly on one to several blocking fields
Blocking Example: Sample DataBlocking Example: Sample Data
NYSIIS LNAME NAME ADDRESS ZIP
YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753
GARAS GEROSA, FRAN X 29 AARONS CT 06877
YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341
GARAS GERISA, FRANCIS 29 AARONS CT 06877
GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
MATAC MARCUS MATIC 100 SUMMER STREET 02111
GARAS GEROSA, MARY 29 AARONS CT 06877
JANCAN RENEE JENKINS 100 SUMMER STREET 02111
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341
YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753
GARAS GEROSA, FRAN X 29 AARONS CT 06877
YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341
GARAS GERISA, FRANCIS 29 AARONS CT 06877
GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
MATAC MARCUS MATIC 100 SUMMER STREET 02111
GARAS GEROSA, MARY 29 AARONS CT 06877
JANCAN RENEE JENKINS 100 SUMMER STREET 02111
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341
Block on NYSIIS of Last Name
Blocking ExampleBlocking Example
YANG YOUNG , WAYNE D 9000 SHEPARD DRIVE 78753
YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341
GARAS GEROSA, FRAN X 29 AARONS CT 06877
GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
GARAS GEROSA, MARY 29 AARONS CT 06877
GARAS GARISA, FRANCIS 29 AARONS CT 06877MATAC MARCUS MATIC 100 SUMMER STREET 02111
JANCAN RENEE JENKINS 100 SUMMER STREET 02111
NYSIIS NAME ADDRESS ZIP
Blocks with only one records are considered residuals
Balance Scope and AccuracyBalance Scope and Accuracy
Balance the scope and accuracy to compare a reasonable amount of “like” records
Accuracy The quality of the candidate
records
ScopeThe number of records
Blocking StrategyBlocking Strategy
Choose fields with reliable dataChoose fields with a good distribution of
valuesCombinations of fields may be used
Examples of Blocking StrategiesExamples of Blocking Strategies
Zip code for matching addressesNYSIIS of last name for matching
individualsBrand name for matching productsCombination of zip code and NYSIIS of
street name for matching addressesCombination of NYSIIS of last name and
first letter of first name for matching individuals
Blocking SummaryBlocking Summary
Blocking groups together “like” recordsMatching is more efficient for small block
sizesBlocks should be between 100 and 200 records
Blocking fields must match exactly for a candidate set to be created/evaluated
Match TypesMatch Types
Unduplication Identifies duplicates candidates in one file
Match (Two File)One-to-one correspondenceFor every record on File A we expect to find a
match to one record on File B
Geomatch (Two File)Many-to-one correspondenceMore than one record on File A can match to
the same record on File B
Comparing Data ValuesComparing Data Values
Different comparisons for different dataOver 24 comparison methodsMost common
CHAR - (character comparison) character by character, left to right.
UNCERT - (character uncertainty) tolerates phonetic errors, transpositions, random insertion, deletion, and replacement of characters
CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance threshold
NAME_UNCERT – Can be used to compare and character values, if the strings are different lengths then the shorter of the two lengths is used
Exercise 8-1: Undup MatchExercise 8-1: Undup Match
1. Define the output file2. Define the Match Stage3. Define the pass
• Choose blocking fields
4. Choose fields to compare and comparison method
5. Build the Match Extract6. Create the Pass
Match Output Files Match Output Files
Report includes matched records and Summary Statistics
Contains the raw match results including the WEIGHT, TYPE of match of records and SETID
Contains the histogram, tables of weights and summary statistics
Match Extract
Match Report
Match Statistics Report
Match ExtractMatch Extract
SETID | TYPE |PASS| WEIGHT|ALL_OF_THE_DATA
393 | XA | 1 | 55.32 | MICHAEL F DOHERTY 393 | DA | 1 | 41.36 | MICHAEL F DOUGHERTY
468 | XA | 1 | 50.40 | EUGENE B BOROWITZ468 | DA | 1 | 24.01 | BOROWITZ FAMILY TRUST468 | DA | 1 | 47.26 | GENE BOROWITZ
520 | XA | 1 | 52.75 | FRAN X GEROSA 520 | DA | 1 | 40.95 | FRANCIS XAVIER GEROSA 520 | DA | 1 | 52.75 | FRANCIS X GEROSA 520 | DA | 1 | 41.22 | FRANK X GEROSA
1035 | RA | 1 | DARRYL F LINDBERG
Rest of Fields
WEIGHT
Custom Extract SpecificationCustom Extract Specification
MOVE @SETMOVE " "MOVE @TYPEMOVE @PASSMOVE " "MOVE @WGT
MOVEALL OF AMOVE " "
PASSTYPESetID
1-9 11-12
13 15-21
221410 23
This is an example of a common match extract specification
It should match the output file defined in the previous exercised
The data is moved to the output file according to these commands
Exercise 8-2: Custom Match ExtractExercise 8-2: Custom Match Extract
1. Select Extract Type2. Select Output File3. Enter Extract commands
Match Improvement StrategyMatch Improvement Strategy
1. Set critical values for important fields2. Review calculated weights
Adjust weights using weight overrides
3. Set cutoffs4. Add additional passes
Critical FieldsCritical Fields
Used to identify fields that must agree in order for records to be linkedCritical – Fields values must agree exactly or
the records cannot be linked (considered a match)
Critical Missing OK – Field values must agree exactly on values not considered “missing values”
QualityStage feature: VARTYPE
Weight OverridesWeight Overrides
Allows you to adjust both the agreement and/or disagreement weights for specific situationsAdd to calculated weightReplace weight
Exercise 8-3: Critical VartypesExercise 8-3: Critical Vartypes
1. Modify the Stage2. Modify the Pass3. Add additional Match fields4. Re-run the Match Job5. Review Results
CutoffsCutoffs
There are two cutoffs Match cutoff (high cutoff) Clerical cutoff (low cutoff)
Records with a weight equal to or above the Match cutoff are considered matches
Records with a weight below the low cutoff are not matches
Records with a weight greater than or equal to the low cutoff and less than the high cutoff are considered clerical records for manual review
Cutoffs can be set at the same value eliminating clerical records
Setting the Match Cut-off
27.82 PO BOX 93020227.82 PO BOX 93020227.82 PO BOX 930202
38.65 35 COLLIER RD NW STE 610 38.65 35 COLLIER RD NW STE 610
25.81 928 S 1ST ST 14.45 S 1ST ST
Weights Data fields
DefiniteMatch
DefiniteMatch
QuestionableMatch
Exercise 8-4: Set Match CutoffsExercise 8-4: Set Match Cutoffs
1. Modify the Match Stage2. Modify Pass 13. Set Cutoffs4. Re-run the Match Job5. Review Results
Multiple Match PassesMultiple Match Passes
Additional passes are helpful in overcoming data errors and missing values in block fields
You should always create at least two match passes
Change blocking strategies for each pass
Pass 1 blocked on street name Pass 2 found additional matched records in which the street
name was different but the names were the same
Example: Multiple Match Passes
Pass Weights Data fields
1 26.31 JASON BIRCH 1350 WALTON WAY 30901 1 26.31 JASON BIRSH 1350 WALTON WAY 30901
1 20.42 JOHN SMITH 2047 PRINCE AVE 30604 1 10.83 MARY SMITH 2047 PRINCE AVE 30604
1 RES A JOHN SMITH P.O. BOX 123 30604
2 20.42 JOHN SMITH 2047 PRINCE AVE 30604 2 10.19 JOHN SMITH P.O. BOX 123 30604
Exercise 8-5: Add Match Pass 2Exercise 8-5: Add Match Pass 2
1. Modify the Match Stage2. Add a new Pass3. Choose Block Fields4. Choose Match Fields5. Run Job6. Review Results
Module SummaryModule Summary
Three type of matches Undup Match Geomatch
Block to group together like records Only like records are compared adding computational
efficiency
Over 24 match comparisonsCritical fieldsMatch cutoffsMultiple passes
Survive Survive
Module ObjectivesModule Objectives
Describe where Survive is in the Data Re-engineering Methodology
Identify Survive techniquesDescribe implementation optionsDefine Survive rulesBuild Survive stage
Data Re-engineering MethodsData Re-engineering Methods
Analyze free form fields
SurviveInvestigation Standardize Match
II IIII IV
Data Quality Assessment
(DQA)
Data Re-Engineering(DRE)
High-Level DFDHigh-Level DFD
AUTOHOME LIFE
InvestigateAssess Data Quality
Standardize Country
Add Unique KeyAppend Data to a common format
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Reject NON US Data
Pre-Process US Data
Select US Data for further processing
Condition Name, Address and Area
InvestigateConditioned Results
Survive StageSurvive Stage
Point-and-click creation of business rules to determine “surviving” data – user decides how to survive data
Performed at record or field level – very flexible
Creates a single, consolidated record containing the “best-of-breed” data
Cross-populates best available dataCreates a cross-reference keyProvides consolidated view of the data
Survive ExampleSurvive ExampleSurvive Input (Match Output)
Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.
1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR
23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian5901 SW 74TH ST23 D352 William Obrian 5901 74 ST # 202
Survived Consolidated Output
Group Legacy First Middle Last No. Dir. Str. Name Type Unit No.
1 D150 Robert Dickson 1500 SE ROSS CLARK CIR
23 D689 William Alex O’Brian5901 SW 74TH ST STE 202
Group Legacy1 D1501 A1367
23 D68923 A43623 D352
Cross-Reference File
Survive RulesSurvive Rules
A rule contains a condition and a set of target fieldsWhen the condition is met the field becomes a
candidate for the “best”All records in a group are tested against the
conditionThe “best” populates the target fields
Multiple targets are permitted for the same rule
Survive RulesSurvive Rules
Custom RuleBuild your own logical expressionComparison (=, !=, <, > ,<=, >=)Logical (and, or, not) Indicate the current and best records with the
following notation c.field indicates the current b.field indicates the best
Parentheses ( ) can be used for grouping complex conditions
String literals are enclosed in double quotation marks, such as “MARS”.
A semicolon (;) terminates a rule.
Building Survive RulesBuilding Survive Rules
Survive Rules Definition screen lets you easily build, delete and manage survivor rules
Survive TechniquesSurvive Techniques
Pre-defined TechniquesSourceRecencyFrequencyMost complete (longest string)
User-specified logic
Target FieldsTarget Fields
Fields you want to write to the output filePopulated based on meeting the
conditions of the survivor rule(s)Fields not listed as targets are excluded
from the output fileMay have multiple targets for each rule
Example: Complex Survive RuleExample: Complex Survive Rule
The following rule states that FIELD3 of the current record should be retained if the field contains five or more characters and FIELD1 has any contents.
The prefix of b. indicates the current “best” record
The prefix c. indicates the current record testing against the survivor rule
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;
TARGET CONDITION
Exercise 9-1: Survive the Best Customer Record
Exercise 9-1: Survive the Best Customer Record1. Define the output file2. Define Survive stage3. Choose target fields4. Define Survive rules5. Deploy and run6. Review results
Module SummaryModule Summary
Consolidate or survive the best record by choosing the best record or best field from multiple records
Use pre-defined techniques or build your own
May use multiple rules