Getting to the Single Source of Truth...
Transcript of Getting to the Single Source of Truth...
Data Imperative
“Getting the Single Source of Truth”
CS Lee
27th August 2015
to
2
Data Sources
Historical data
• System Replacements
• System Migrations– Merger & Acquisition
• Data Feeds– Banks, Partnerships
New data types
• Social media
• Public emails
3
Common data problems
• Lack of information standards– Different formats & structures
across different systems
• Data surprises in individual fields– Data misplaced in the database
• Information buried in free-form fields
– Descriptions, addresses
• Data myopia– Lack of consistent identifiers
inhibit a single view
• The redundancy nightmare– Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300
Williams & Co. C/O Bill 025-37-1888 415-392-2000
1st Natl Provident 34-2671434 3380321
HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868-A HEX BOLT .25” - DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) - DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 01456
90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456
90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156
90233479 International Bus. M. 187 Park Ave Salem NH 04156
90233489 Inter-Nation Consults 15 Main Street Andover MA 02341
90345672 I.B. Manufacturing Park Blvd. Bostno MA 04106
4
Data Quality Dependency
CRM, AFM
ScoreCard, Behavior Trending
KPIs etc
Core Business Transaction Data
• Consistency
• Completeness
• Accuracy
• Uniformity
Strictly Private & Confidential
Page 5
Use a Data Quality Methodology
Profiling
Obtains 100%
visibility into
actual data
condition
Analyze single
domain as well as
free form fields
Generate
frequency counts
of unique values
Uncovers trends
and discrepancies
Matching
Match identical or
near-identical
entities within one
or more files using
with proven
techniques
Creates
consolidated
view of an entity
Establishes cross-
references
Survivorship
Create “Best of
Breed”
representation of
the data
Cross-populate
multiple data files
with best of breed
values
Resolves
conflicting data
values based on
user-defined
business rules
Parse free-form
fields
Standardize data
Incorporate
business or
industry
standards
Apply phonetic
coding to key
words
Standardization
Knowing Cleansing Retaining
Tested & Proven Methodology
Reject or
Unmatched
Data
Cleansed
Consolidated and
merge data
Cleansed
consolidated
& merge
data
UAT and
Test Case
1. Data Extraction & Metadata Capture
2. Data Audit & Profiling
3. Data Cleansing
4. Data Mapping Alignment & Conversion
5. Data Validation
6. Test Migration7. Data Load
Fail UAT and
Test Case
Mapping
Criteria
Common
Target Format
Re-evaluate Mapping
Criteria and repeat step 4
onwards.
Populate and
load
cleansed /
merge data
into Staging
DB
Pass UAT
and Test
Case
Reject Unmatched data
will be send for Clerical
Review.
- Review and fine tune
cleansing rules
Data
Cleansing
Extracted Flat
Files in Staging
Area
Source 1
Source 2
Source 3
Master Data
- Asset type
- Addresses
- Region
- State
- Daerah
- PostCode
- City
- Tmn / Bdr
- Jln / Lrg
- Rates
- etc
Comprehensive Data Profiling Process
Sample of Data Profiling outputs
Discovering Quality of Data
Sample of Data Profiling outputs
Discovering Quality of Data
10
CASE: Fire Risk Accumulation
Ideal situation• Knowing location of insured assets
• accumulate Sum Insured
• monitor against Risk Thresholds
11
CASE: Fire Risk Accumulation - tokenized
Key data
• Insured Asset Address
ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE
LOT XXX, KOMPLEKS CAYMAN, 08000 SUNGAI PETANI, KEDAH
LOT XXX KOMPLEKS CAYMAN 08000 SUNGAI PETANI KEDAH
XXX, KOMPLEKS PKNP, KG. TEKEK, PULAU TIOMAN, 26800 ROMPIN, PAHANG DARUL MAKMUR
XXX,,KG TEKEK,PULAU TIOMAN KOMPLEKS PKNP 26800 ROMPIN PAHANG
LOT NO TXXX7 TYPE THE GRAND PHASE CP6C 41200 KLANG SELANGOR
LOT NO TXXX7 TYPE THE GRAND PHASE CP6C 41200 KLANG SELANGOR
XXX46 1ST FLOOR THE MINES SHOPPING FAIR 43300 SERI KEMBANGAN, SELANGOR
XXX46 1ST FLOOR THE MINES SHOPPING FAIR 43300 SERI KEMBANGAN SELANGOR
LOT NO.LXXX7 THE MINES SHOPPING FAIR JALAN DULANG, MINES RESORT CITY SERI KEMBANGAN
LOT NO LXXX7 THE MINES SHOPPING FAIR JALAN DULANG,MINES RESORT CITY 43300 SERI KEMBANGAN SELANGOR
UNIT NO XXX TYPE A-2 BLOCK A THE REEF 48000 RAWANG SELANGOR
UNIT NO XXX TYPE A-2 BLK A THE REEF 48000 RAWANG SELANGOR
SUITE XXX01, 11TH FLOOR, WISMA HANGSAM, 1, JALAN HANG LEKIR, 50000 KUALA LUMPUR.
STE XXX01,11TH FLOOR,,1, WISMA HANGSAM JALAN HANG LEKIR 50000 KUALA LUMPUR WP KUALA LUMPUR
BLOCK XXX3-7 MENARA CITY ONE JALAN MUNSHI ABDULLAH KUALA LUMPUR
BLK XXX3-7 MENARA CITY ONE JALAN MUNSHI ABDULLAH 50000 KUALA LUMPUR WP KUALA LUMPUR
NO XXX6-8 & XXX7-8 MENARA CITY ONE JLN MUNSHI ABDULLAH 50100 WILAYAH PERSEKUTUAN K.LUMPUR.
NO XXX6-8 & XXX7-8 MENARA CITY ONE JLN MUNSHI ABDULLAH WILAYAH PERSEKUTUAN K LUMPUR 50100 WP KUALA LUMPUR
XXX2-03 MENARA CITY ONE NO 3 JALAN MUNSHI ABDULLAH 50100 KUALA LUMPUR WILAYAH PERSEKUTUAN
XXX2-03 NO 3 MENARA CITY ONE JALAN MUNSHI ABDULLAH 50100 KUALA LUMPUR WP KUALA LUMPUR
XXX3 PLAZA SEE HOY CHAN, JALAN RAJA CHULAN 50200 KUALA LUMPUR
XXX 3 , PLAZA SEE HOY CHAN JALAN RAJA CHULAN 50200 KUALA LUMPUR WP KUALA LUMPUR
XXX7-3 MENARA ANTARA, NO 11 JALAN BUKIT CEYLON 50200 KUALA LUMPUR
XXX7-3 ,NO 11 MENARA ANTARA JALAN BUKIT CEYLON 50200 KUALA LUMPUR WP KUALA LUMPUR
CP58, SUITE XXX05-06 18TH FLOOR CENTRAL PLAZA 34 JALAN SULTAN ISMAIL 50250 KUALA LUMPUR
CP58,STE XXX05-06 18TH FLOOR CENTRAL PLAZA 34 JALAN SULTAN ISMAIL 50250 KUALA LUMPUR WP KUALA LUMPUR
LOT XXX9, GROUND FLOOR THE MALL NO 100 JALAN PUTRA 50350 KUALA LUMPUR
LOT XXX9,GROUND FLOOR NO 100 THE MALL JALAN PUTRA 50350 KUALA LUMPUR WP KUALA LUMPUR
XXXH FOOLR MENARA TH SELBORN NO.153 JALAN TUN RAZAK 50400 KUALA LUMPUR
XXXH FOOLR NO 153 MENARA TH SELBORN JALAN TUN RAZAK 50400 KUALA LUMPUR WP KUALA LUMPUR
NO 27-7 MENARA PERMATA DAMANSARA NO 685 JALAN DAMANSARA 60000 KUALA LUMPUR
NO 27-7 NO 685 MENARA PERMATA DAMANSARAJALAN DAMANSARA 60000 KUALA LUMPUR WP KUALA LUMPUR
12
CASE: Fire Risk Accumulation
Structured and complete addresses• Grouping by address
• Accumulating by location
• Decision Support
13
CASE: Fire Risk Accumulation - partial
Key data
• Insured Asset Address
ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE
UNIT NO XXX TAMAN SRI PUTRA P T NO 337 IN THE 41000 PEKAN OF KLANG SELANGOR
UNIT NO XXX THE PEKAN OF
TAMAN SRI PUTRA P T
NO 337 IN 41000 KLANG SELANGOR
NO.X XXTH MILE, OFF THE FEDERAL HIGHWAY 47300 PETALING JAYA, SELANGOR DARUL EHSAN
NO X XXTH MILE,OFF THE FEDERAL HWY 47300 PETALING SELANGOR
PARCEL LOT NO RS/GXXX RETAIL SHOP(GROUND FLOOR) PLAZA SRI MUDA SEKSYEN 25 SHAH ALAM SELANGOR
PARCEL LOT NO RS/GXXX RETAIL SHOP
GROUND FLOOR SEKSYEN 25 PLAZA SRI MUDA 40000 SHAH ALAM SELANGOR
PARCEL LOT NO RS/GXXX RETAIL SHOP(GROUND FLOOR) (PLAZA SRI MUDA SEKSYEN 25 SHAH ALAM SELANGOR)
PARCEL LOT NO RS/GXXX RETAIL SHOP
GROUND FLOOR SEKSYEN 25 PLAZA SRI MUDA 40000 SHAH ALAM SELANGOR
XX, PLAZA PUCHONG JALAN PUCHONG MESRA 1 58200 KUALA LUMPUR W.P KUALA LUMPUR
XX, PLAZA PUCHONG
JALAN PUCHONG MESRA 1
KUALA LUMPUR W P 58200
KUALA
LUMPUR
WP KUALA
LUMPUR
NO.BXXXX PLAZA MOUNT KIARA NO.2 JALAN KIARA KUALA LUMPUR GM 6147 LOT 56054 MK BATU 50480 KL
NO BXXXX NO 2 PLAZA MT KIARA
JALAN KIARA KUALA LUMPUR
GM 6147 LOT 56054 MK BATU 50480
KUALA
LUMPUR
WP KUALA
LUMPUR
THE LEGENDS GOLF & COUNTRY RESORT, LOT XXXX, KEBUN SEDENAK, P.O. BOX 11, KULAI , JOHOR
LOT XXXX,KEBUN SEDENAK,P O BOX 11
THE LEGENDS GOLF&
COUNTRY RESORT, 81000 KULAI JOHOR
XXX & XXX FLOOR, KOMPLEKS YAYASAN BELIA SEDUNIA (WYF COMPLEX), LEBOH AYER KEROH, 75450 MELAKA
XXX& XXX FLOOR,,
KOMPLEKS YAYASAN BELIA
SEDUNIA WYF COMPLEX LEBUH 75450 AYER KEROH MELAKA
14
CASE: Fire Risk Accumulation – data acquisition
Key data
• Insured Asset Address
ADDRESS RESIDUAL_ADD BUILDING ROAD GARDEN POSTCODE CITY STATE
HS(D) 95160 PT8007 MUKIM OF RASAH DISTRICT OF SEREMBAN 70000 NEGERI SEMBILAN
HS D 95160 PT8007 MUKIM OF RASAH DISTRICT OF 70000 SEREMBANNEGERI SEMBILAN
H.S (D) 36900 PT 32027 MK KAJANG LOT XX BLK E KWS PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG
H S D 36900 PT 32027 MK KAJANG LOT XX BLK E KWS PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG SELANGOR
H.S.(D) 36901 PT 32028 MK KAJANG LOT XX BLK E KAW PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG
H S D 36901 PT 32028 MK KAJANG LOT XX BLK E KAW PERUSAHAAN BKT ANGKAT SG CHUA 43000 KAJANG SELANGOR
HS(D)70271 PT NO PARCEL 66XXX HS(D)70271 PT1580 41000 KLANG SELANGOR DARUL EHSAN
HS D 70271 PT NO PARCEL 66XXX HS D 70271 PT1580 41000 KLANG SELANGOR
15
CASE: Fire Risk Accumulation
CASE: Fire Risk Accumulation – data enhancement
Geo-coding locations
Latitude 37.775837
Longitude -122.39557
COMMERCIAL BREAK
Established 2003
Data Warehouse
• ETL
• Data Quality Assurance
• Data Cleansing
Data Re-EngineeringData Re-Engineering
Reta
in t
he B
est In
form
ation
Data Weights Histogram
0
500
1000
1500
2000
2500
3000
3500
4000
-50 -40 -30 -20 -10 0 10 20 30 40 50 60
# o
f P
air
s
UnMatched
Matched
Input File:
Address Line 1 Address Line 2
639 N MILLS AVENUE ORLANDO, FLA 32803306 W MAIN STR, CUMMING, GA 301303142 WEST CENTRAL AV TOLEDO OH 43606843 HEARD AVE AUGUSTA-GA-309041139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 309014275 OWENS ROAD SUITE 536 EVANS GA 308091775 RUSSELL CIRCLE MILLIS MASSACH USETTS 02038
Result File:House # Dir Str.
Name Type Unit No. NYSIIS City SOUNDEX State Zip ACCT#
639 N MILLS AVE MAL ORLANDO O645 FL 32803
306 W MAIN ST MAN CUMMING C552 GA 30130
3142 W CENTRAL AVE CANTRAL TOLEDO T430 OH 43606
843 HEARD AVE HAD AUGUSTA A223 GA 309041139 GREENE ST GRAN
AUGUSTA A223 GA 30901 1234 4275 OWENS RD STE 536 ON EVANS E152 GA 308091775 RUSSELL CIR RASAL MILLIS L260 MA 02038
Data Cleansing• Data Discovery• Data Standardization• Data Matching• Data SurvivalData Migration• Data Extraction• Data Cleansing• Data Conversion• Data Loading
Data Re-Engineering Data Re-Engineering
Professional Services• Data Warehouse Infrastructure
• Data Warehouse Design & Build
• Enterprise Data Integration
• Data Cleansing
BIData
Repository
FINANCIALCUBES
HRCUBES
Data Quality, Cleansing & ETL
• TMB Subscriber Profile Cleansing; CRM source data cleansing and filtering
• DBKL source data cleansing / migration
• CIMB 1Platform Data Warehouse migration
• BIMB Core Banking System DataCleansing/Re-Org
• CIMB Aviva EDW ETL and Cleansing
• Celcom Data Quality Profiling for DWH
Data Quality, Cleansing & ETL Local and Regional Clients• CIMB – EDW Platform migration
• CELCOM (IBM) – Data Quality Profiling for EDW
• DBKL (IAC) – SAP Data Cleansing & Migration
(MY)
• CIMB/Aviva (ACT) – Data Quality Assessment,
Cleansing, ETL (MY)
• CIMB Bank (IBM) – ETL, DataStage Enterprise
upgrade (MY)
• Maxis (IBM) – ETL, Dealer Incentive Analysis
(MY)
• CIMB Bank – ETL, Data Profiling and Data
Cleansing (MY)
• Bank Islam – Data Cleansing (MY)
• Telekom Malaysia, Data Cleansing – Customer
Segmentation, Group Marketing (MY)
• Hutchinson Indonesia (ACW,Aus) - Data
Warehouse – Call Behavior (Ind)
• Telekom Malaysia, Data Cleansing (Accenture) -
iCARE (MY)
• Telekom Malaysia, Data Mart – Call Usage
(MY)
• LHDN (IBM) - Data Profiling and Cleansing
(MY)
• General Hospital, CDC (IBM) - Data Cleansing
(MY)
• Brunei Prime Minister Office, CRM (BWN)
• Bernas, HR/Payroll (MY)
• Maxis, B.I. (MY)
• Thai Farmer Bank (IBM), ETL infra (TH)
• Bank of Thailand (IBM), ETL infra (TH)
(Partial client list, in reversed-chronological order)
Thank You