Census Bureau

23
Census Bureau Census Bureau DRIS DRIS Date: 01/23/2007

description

Census Bureau. DRIS. Date: 01/23/2007. Oracle conversion example: Initial approaches for UCM source data. Index. Data Modeling Current Datafile Current Dataload Data Overlook Two Approaches First Approach Data Distribution Advantages Disadvantages. Second Approach Basic Modeling - PowerPoint PPT Presentation

Transcript of Census Bureau

Census BureauCensus Bureau

DRISDRIS

Date: 01/23/2007

Oracle conversion example:Oracle conversion example:Initial approaches for UCM Initial approaches for UCM

source datasource data

33

IndexIndex Data ModelingData Modeling Current DatafileCurrent Datafile Current DataloadCurrent Dataload Data OverlookData Overlook Two ApproachesTwo Approaches First ApproachFirst Approach Data DistributionData Distribution AdvantagesAdvantages DisadvantagesDisadvantages

44

Second ApproachSecond Approach Basic ModelingBasic Modeling AdvantagesAdvantages Advance WorkAdvance Work Care neededCare needed Our RecommendationOur Recommendation TasksTasks Next stepsNext steps ? Questions? Questions

55

Data modelingData modeling

Conversion of data from Legacy Conversion of data from Legacy (Fortran) to RDBMS (Oracle)(Fortran) to RDBMS (Oracle)

Hardware/softwareHardware/software Sun E6900, OS Solaris 5.10/12 cpu/96 G Sun E6900, OS Solaris 5.10/12 cpu/96 G

RAMRAM Database - Oracle 10gDatabase - Oracle 10g Oracle designer / ErwinOracle designer / Erwin

66

Current datafileCurrent datafile

Big datafile

Geo

Census

Base Data

Legacy process

Data modelingOracle db

Reports

Data Feeds

Data updates Pl/SQL,Shell,C,ETL tool

77

Current DataloadCurrent Dataload

UCM dataUCM data Fortran formatFortran format One big file w/ 180 M recordsOne big file w/ 180 M records Record length is 1543 bytesRecord length is 1543 bytes Most of the fields are varchar2Most of the fields are varchar2 Many fields are blank/no dataMany fields are blank/no data Performance in Oracle inadequate without Performance in Oracle inadequate without

schema redesign to leverage RDMS schema redesign to leverage RDMS capabilitiescapabilities

88

Data Overview (approx)Data Overview (approx)

State of NYState of NY State of CAState of CA State of TXState of TX

District of District of ColumbiaColumbia

DelawareDelaware ConnecticutConnecticut

20 M 31 G20 M 31 G 34 M 52 G34 M 52 G 25 M 38 G25 M 38 G

500 K 750 M500 K 750 M 1 M 1.5 G1 M 1.5 G 1 M 1.5 G1 M 1.5 G

99

Two approachesTwo approaches

First ApproachFirst Approach

Break datafile on the basis of dataBreak datafile on the basis of data E.g. RO level (12)E.g. RO level (12) State level (54-56), including DC, Puerto Rico etc.State level (54-56), including DC, Puerto Rico etc.

Second ApproachSecond Approach

Break datafile into multiple tables with Break datafile into multiple tables with change in field definitions using relational change in field definitions using relational modelmodel

1010

First approachFirst approachBreak datafile on the basis of Break datafile on the basis of

datadataCurrent datafile

Table_CA Table_NY Table_XX Table_YY Table_54

1111

Data distributionData distribution

Uneven data distributionUneven data distribution

Big data tables will be 30+ GBig data tables will be 30+ G

Small data tables will be close to < 1 Small data tables will be close to < 1 GG

1212

Advantages of this kind of Advantages of this kind of segmenting/partitioning:segmenting/partitioning:

State level queries will be faster than State level queries will be faster than currentcurrent

If the data is separated by RO, the If the data is separated by RO, the data will be more distributed w/ less data will be more distributed w/ less tables (close to 12 instead 54-56)tables (close to 12 instead 54-56)

1313

DisadvantagesDisadvantages

Too many tablesToo many tables Many fields are empty and varchar2(100)Many fields are empty and varchar2(100) No normalizationNo normalization Existing queries need to be changed a lotExisting queries need to be changed a lot No normalization technique is used.No normalization technique is used.

For small tables, query will run fast but for For small tables, query will run fast but for big tables, there will be a lot of overheadbig tables, there will be a lot of overhead

Operational tables will be same in numberOperational tables will be same in number Too complicated to run queries, may confuse Too complicated to run queries, may confuse

users while joining main and operational users while joining main and operational tablestables

1414

Second approachSecond approachBreak datafile into few relational Break datafile into few relational

tables with change in field tables with change in field definitionsdefinitionsCurrent datafile

Table1

Table2

Table4

Table3

MAFIDM

AFI

D MAFID

MAFID MAFID

MAFID

1515

Basic ModelingBasic Modeling Database design/logical and physicalDatabase design/logical and physical

Relations will be defined based on a primary keyRelations will be defined based on a primary key In this case, it will be MAFID, which is uniqueIn this case, it will be MAFID, which is unique

varchar2(100) field could be converted to smaller fields varchar2(100) field could be converted to smaller fields based on actual field lengthsbased on actual field lengths

All fields will be mapped with at least one of the fields in the All fields will be mapped with at least one of the fields in the new tablesnew tables

Data will be inserted in multiple efficient tables based on Data will be inserted in multiple efficient tables based on updated data model using relational database design updated data model using relational database design principlesprinciples

1616

AdvantagesAdvantages

FasterFaster QueriesQueries UpdatesUpdates DeletesDeletes AdditionsAdditions

Less maintenanceLess maintenance Same approach can be used for Same approach can be used for

transactional/operational datatransactional/operational data

1717

Advance workAdvance work

Identify each and every field of UNM dataIdentify each and every field of UNM data Check/Define field lengths of each fieldCheck/Define field lengths of each field Map every field to new schemaMap every field to new schema Can some fields be merged together?Can some fields be merged together? Identify and remove duplicate data Identify and remove duplicate data

elements in modelelements in model Define tables and relationships and create Define tables and relationships and create

new schemanew schema Break and load data into these tablesBreak and load data into these tables

1818

Care neededCare needed

Current datafile will be broken into Current datafile will be broken into multiple datafiles for data processingmultiple datafiles for data processing

Load one by one datafile into tablesLoad one by one datafile into tables Test and demonstrate completeness Test and demonstrate completeness

of new modelof new model Craft comparison to Craft comparison to proveprove source and source and

new schema properly include all new schema properly include all Census dataCensus data

1919

Our RecommendationOur Recommendation **** Second Approach Second Approach ****

Why ?Why ? Data distribution will be uniformData distribution will be uniform Less unwanted data is moved to separate tablesLess unwanted data is moved to separate tables This will reduce overhead on the queries of any updatesThis will reduce overhead on the queries of any updates Existing queries can be used with little modificationsExisting queries can be used with little modifications Ongoing data maintenance will be more efficient in Ongoing data maintenance will be more efficient in

RDBMS RDBMS Additional data like RPS can be easily uploaded using Additional data like RPS can be easily uploaded using

same queriessame queries

2020

TasksTasks

Design database using data modeling Design database using data modeling tool/ Oracle designer / Erwin etc.tool/ Oracle designer / Erwin etc.

Create test data from original datafileCreate test data from original datafile Load test data into database tablesLoad test data into database tables Create test scripts to check data Create test scripts to check data

consistency consistency Check indexes for required queriesCheck indexes for required queries Test old data vs. new data Test old data vs. new data

2121

Continued…Continued…

Break data into small filesBreak data into small files Load full data into tablesLoad full data into tables Unit test on data for consistencyUnit test on data for consistency Run queries on the databaseRun queries on the database If needed, fine tune databaseIf needed, fine tune database Use same approach for transactional Use same approach for transactional

data like RPS datadata like RPS data

2222

Next steps…Next steps…

Continued collaboration with Census Continued collaboration with Census team to improve domain team to improve domain understanding for new team understanding for new team membersmembers

Access to Oracle database tools on Access to Oracle database tools on team’s workstationsteam’s workstations

Access to operational Oracle instance Access to operational Oracle instance to begin development of approachto begin development of approach

2323

? Questions