Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current...

21
Census Bureau Census Bureau DRIS DRIS Date: 01/16/2007

Transcript of Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current...

Page 1: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

Census BureauCensus Bureau

DRISDRIS

Date: 01/16/2007

Page 2: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

22

IndexIndex Data ModelingData Modeling Current DatafileCurrent Datafile Current DataloadCurrent Dataload Data OverlookData Overlook Two ApproachesTwo Approaches First ApproachFirst Approach Data DistributionData Distribution AdvantagesAdvantages DisadvantagesDisadvantages

Page 3: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

33

Second ApproachSecond Approach Basic ModelingBasic Modeling AdvantagesAdvantages Advance WorkAdvance Work Care neededCare needed Our RecommendationOur Recommendation TasksTasks

Page 4: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

44

Data modelingData modeling

Conversion of data from Legacy Conversion of data from Legacy (Fortran) to RDBMS (Oracle)(Fortran) to RDBMS (Oracle)

Hardware/softwareHardware/software Sun V890/E12K, OS Solaris Sun V890/E12K, OS Solaris

5.7,5.8,5.9,5.105.7,5.8,5.9,5.10 Database - Oracle 10gDatabase - Oracle 10g Oracle designer / ErwinOracle designer / Erwin

Page 5: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

55

Current datafileCurrent datafile

Big datafile

Geo

Census

Base Data

Legacy process

Data modelingOracle db

Reports

Data Feeds

Data updates Pl/SQL,Shell,C,ETL tool

Page 6: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

66

Current DataloadCurrent Dataload

UCNM dataUCNM data Fortran formatFortran format One big file w/ 180 M recordsOne big file w/ 180 M records Record length is 1543 bytesRecord length is 1543 bytes Most of the fields are varchar2Most of the fields are varchar2 Many fields are blank/no dataMany fields are blank/no data Performance too poor in OraclePerformance too poor in Oracle

Page 7: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

77

Data overlook (approx)Data overlook (approx)

State of NYState of NY State of CAState of CA State of TXState of TX

District of District of ColumbiaColumbia

DelawareDelaware ConnecticutConnecticut

20 M 31 G20 M 31 G 34 M 52 G34 M 52 G 25 M 38 G25 M 38 G

500 K 750 M500 K 750 M 1 M 1.5 G1 M 1.5 G 1 M 1.5 G1 M 1.5 G

Page 8: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

88

Two approachesTwo approaches

First ApproachFirst Approach

Break datafile on the basis of dataBreak datafile on the basis of data E.g. RO level (12)E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc.State level (54-56), including DC, Puerto Rico etc.

Second ApproachSecond Approach

Break datafile into multiple tables with Break datafile into multiple tables with change in field definitions using relational change in field definitions using relational modelmodel

Page 9: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

99

First approachFirst approachBreak datafile on the basis of Break datafile on the basis of

datadataCurrent datafile

Table_CA Table_NY Table_XX Table_YY Table_54

Page 10: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1010

Data distributionData distribution

Uneven data distributionUneven data distribution

Big data tables will be 30+ GBig data tables will be 30+ G

Small data tables will be close to < 1 Small data tables will be close to < 1 GG

Page 11: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1111

AdvantagesAdvantages

State level queries will be faster than State level queries will be faster than currentcurrent

If the data is separated by RO, the If the data is separated by RO, the data will be more distributed w/ less data will be more distributed w/ less tables (close to 12 instead 54-56)tables (close to 12 instead 54-56)

Page 12: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1212

DisadvantagesDisadvantages

Too many tablesToo many tables Many fields are empty and varchar2(100)Many fields are empty and varchar2(100) No normalizationNo normalization Existing queries need to be changed a lotExisting queries need to be changed a lot No normalization technique is used.No normalization technique is used.

For small tables, query will run fast but for For small tables, query will run fast but for big tables, there will be a lot of overheadbig tables, there will be a lot of overhead

Operational tables will be same in numberOperational tables will be same in number Too complicated to run queries, may confuse Too complicated to run queries, may confuse

users while joining main and operational users while joining main and operational tablestables

Page 13: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1313

Second approachSecond approachBreak datafile into few relational Break datafile into few relational

tables with change in field tables with change in field definitionsdefinitionsCurrent datafile

Table1

Table2

Table4

Table3

MAFIDM

AFI

D MAFID

MAFID MAFID

MAFID

Page 14: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1414

Basic ModelingBasic Modeling Database design/logical and physicalDatabase design/logical and physical

Relations will be defined based on a primary keyRelations will be defined based on a primary key In this case, it will be MAFID, which is uniqueIn this case, it will be MAFID, which is unique

varchar2(100) fields will be converted to smaller fields, say varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengthsvarchar2(60) or smaller/based on actual field lengths

All fields will be mapped with at least one of the fields in the All fields will be mapped with at least one of the fields in the new tablesnew tables

Data will be inserted in small multiple tablesData will be inserted in small multiple tables

Page 15: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1515

AdvantagesAdvantages

FasterFaster QueriesQueries UpdatesUpdates DeletesDeletes AdditionsAdditions

Less maintenanceLess maintenance Same approach can be used for Same approach can be used for

transactional/operational datatransactional/operational data

Page 16: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1616

Advance workAdvance work

Identify each and every field of UNM Identify each and every field of UNM datadata

Check/Define field lengths of each fieldCheck/Define field lengths of each field Map every field to new table fieldMap every field to new table field Can some fields be merged together?Can some fields be merged together? If yes, identify thoseIf yes, identify those Define tables and relationshipsDefine tables and relationships Break and load data into these tablesBreak and load data into these tables

Page 17: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1717

Care neededCare needed

Current datafile will be broken into Current datafile will be broken into multiple datafiles for data processingmultiple datafiles for data processing

Load one by one datafile into tablesLoad one by one datafile into tables Making sure that all datafiles are Making sure that all datafiles are

loaded into multiple tablesloaded into multiple tables No data is missing from the base No data is missing from the base

tabletable

Page 18: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1818

Our RecommendationOur Recommendation **** Second Approach Second Approach ****

Why ?Why ? Data distribution will be uniformData distribution will be uniform Less unwanted data is moved to separate tablesLess unwanted data is moved to separate tables This will reduce overhead on the queries of any This will reduce overhead on the queries of any

updatesupdates Existing queries can be used by little Existing queries can be used by little

modificationsmodifications Less maintenanceLess maintenance Additional data like from RPS can be easily Additional data like from RPS can be easily

uploaded using same queriesuploaded using same queries

Page 19: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

1919

TasksTasks

Design database using data modeling Design database using data modeling tool/ Oracle designer / Erwin etc.tool/ Oracle designer / Erwin etc.

Create test data from original datafileCreate test data from original datafile Load test data into database tablesLoad test data into database tables Create test scripts to check data Create test scripts to check data

consistency consistency Check indexes for required queriesCheck indexes for required queries Test old data vs. new data Test old data vs. new data

Page 20: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

2020

Continued…Continued…

Break data into small filesBreak data into small files Load full data into tablesLoad full data into tables Unit test on data for consistencyUnit test on data for consistency Run queries on the databaseRun queries on the database If needed, fine tune databaseIf needed, fine tune database Use same approach for transactional Use same approach for transactional

data like RPS datadata like RPS data

Page 21: Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.

2121

THE END