Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current...
-
Upload
gavin-roche -
Category
Documents
-
view
228 -
download
0
Transcript of Census Bureau DRIS Date: 01/16/2007. 2Index Data Modeling Data Modeling Current Datafile Current...
Census BureauCensus Bureau
DRISDRIS
Date: 01/16/2007
22
IndexIndex Data ModelingData Modeling Current DatafileCurrent Datafile Current DataloadCurrent Dataload Data OverlookData Overlook Two ApproachesTwo Approaches First ApproachFirst Approach Data DistributionData Distribution AdvantagesAdvantages DisadvantagesDisadvantages
33
Second ApproachSecond Approach Basic ModelingBasic Modeling AdvantagesAdvantages Advance WorkAdvance Work Care neededCare needed Our RecommendationOur Recommendation TasksTasks
44
Data modelingData modeling
Conversion of data from Legacy Conversion of data from Legacy (Fortran) to RDBMS (Oracle)(Fortran) to RDBMS (Oracle)
Hardware/softwareHardware/software Sun V890/E12K, OS Solaris Sun V890/E12K, OS Solaris
5.7,5.8,5.9,5.105.7,5.8,5.9,5.10 Database - Oracle 10gDatabase - Oracle 10g Oracle designer / ErwinOracle designer / Erwin
55
Current datafileCurrent datafile
Big datafile
Geo
Census
Base Data
Legacy process
Data modelingOracle db
Reports
Data Feeds
Data updates Pl/SQL,Shell,C,ETL tool
66
Current DataloadCurrent Dataload
UCNM dataUCNM data Fortran formatFortran format One big file w/ 180 M recordsOne big file w/ 180 M records Record length is 1543 bytesRecord length is 1543 bytes Most of the fields are varchar2Most of the fields are varchar2 Many fields are blank/no dataMany fields are blank/no data Performance too poor in OraclePerformance too poor in Oracle
77
Data overlook (approx)Data overlook (approx)
State of NYState of NY State of CAState of CA State of TXState of TX
District of District of ColumbiaColumbia
DelawareDelaware ConnecticutConnecticut
20 M 31 G20 M 31 G 34 M 52 G34 M 52 G 25 M 38 G25 M 38 G
500 K 750 M500 K 750 M 1 M 1.5 G1 M 1.5 G 1 M 1.5 G1 M 1.5 G
88
Two approachesTwo approaches
First ApproachFirst Approach
Break datafile on the basis of dataBreak datafile on the basis of data E.g. RO level (12)E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc.State level (54-56), including DC, Puerto Rico etc.
Second ApproachSecond Approach
Break datafile into multiple tables with Break datafile into multiple tables with change in field definitions using relational change in field definitions using relational modelmodel
99
First approachFirst approachBreak datafile on the basis of Break datafile on the basis of
datadataCurrent datafile
Table_CA Table_NY Table_XX Table_YY Table_54
1010
Data distributionData distribution
Uneven data distributionUneven data distribution
Big data tables will be 30+ GBig data tables will be 30+ G
Small data tables will be close to < 1 Small data tables will be close to < 1 GG
1111
AdvantagesAdvantages
State level queries will be faster than State level queries will be faster than currentcurrent
If the data is separated by RO, the If the data is separated by RO, the data will be more distributed w/ less data will be more distributed w/ less tables (close to 12 instead 54-56)tables (close to 12 instead 54-56)
1212
DisadvantagesDisadvantages
Too many tablesToo many tables Many fields are empty and varchar2(100)Many fields are empty and varchar2(100) No normalizationNo normalization Existing queries need to be changed a lotExisting queries need to be changed a lot No normalization technique is used.No normalization technique is used.
For small tables, query will run fast but for For small tables, query will run fast but for big tables, there will be a lot of overheadbig tables, there will be a lot of overhead
Operational tables will be same in numberOperational tables will be same in number Too complicated to run queries, may confuse Too complicated to run queries, may confuse
users while joining main and operational users while joining main and operational tablestables
1313
Second approachSecond approachBreak datafile into few relational Break datafile into few relational
tables with change in field tables with change in field definitionsdefinitionsCurrent datafile
Table1
Table2
Table4
Table3
MAFIDM
AFI
D MAFID
MAFID MAFID
MAFID
1414
Basic ModelingBasic Modeling Database design/logical and physicalDatabase design/logical and physical
Relations will be defined based on a primary keyRelations will be defined based on a primary key In this case, it will be MAFID, which is uniqueIn this case, it will be MAFID, which is unique
varchar2(100) fields will be converted to smaller fields, say varchar2(100) fields will be converted to smaller fields, say varchar2(60) or smaller/based on actual field lengthsvarchar2(60) or smaller/based on actual field lengths
All fields will be mapped with at least one of the fields in the All fields will be mapped with at least one of the fields in the new tablesnew tables
Data will be inserted in small multiple tablesData will be inserted in small multiple tables
1515
AdvantagesAdvantages
FasterFaster QueriesQueries UpdatesUpdates DeletesDeletes AdditionsAdditions
Less maintenanceLess maintenance Same approach can be used for Same approach can be used for
transactional/operational datatransactional/operational data
1616
Advance workAdvance work
Identify each and every field of UNM Identify each and every field of UNM datadata
Check/Define field lengths of each fieldCheck/Define field lengths of each field Map every field to new table fieldMap every field to new table field Can some fields be merged together?Can some fields be merged together? If yes, identify thoseIf yes, identify those Define tables and relationshipsDefine tables and relationships Break and load data into these tablesBreak and load data into these tables
1717
Care neededCare needed
Current datafile will be broken into Current datafile will be broken into multiple datafiles for data processingmultiple datafiles for data processing
Load one by one datafile into tablesLoad one by one datafile into tables Making sure that all datafiles are Making sure that all datafiles are
loaded into multiple tablesloaded into multiple tables No data is missing from the base No data is missing from the base
tabletable
1818
Our RecommendationOur Recommendation **** Second Approach Second Approach ****
Why ?Why ? Data distribution will be uniformData distribution will be uniform Less unwanted data is moved to separate tablesLess unwanted data is moved to separate tables This will reduce overhead on the queries of any This will reduce overhead on the queries of any
updatesupdates Existing queries can be used by little Existing queries can be used by little
modificationsmodifications Less maintenanceLess maintenance Additional data like from RPS can be easily Additional data like from RPS can be easily
uploaded using same queriesuploaded using same queries
1919
TasksTasks
Design database using data modeling Design database using data modeling tool/ Oracle designer / Erwin etc.tool/ Oracle designer / Erwin etc.
Create test data from original datafileCreate test data from original datafile Load test data into database tablesLoad test data into database tables Create test scripts to check data Create test scripts to check data
consistency consistency Check indexes for required queriesCheck indexes for required queries Test old data vs. new data Test old data vs. new data
2020
Continued…Continued…
Break data into small filesBreak data into small files Load full data into tablesLoad full data into tables Unit test on data for consistencyUnit test on data for consistency Run queries on the databaseRun queries on the database If needed, fine tune databaseIf needed, fine tune database Use same approach for transactional Use same approach for transactional
data like RPS datadata like RPS data
2121
THE END