Manage Dimension Tables in InfoSphere Information Server DataStage

17
7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 1/17 Share: developerWorks Premium An allaccess pass to building your next great app! Sign up Information Server DataStage® Version 8.0 introduced the Slowly Changing Dimension (SCD) stage. This tutorial provides stepbystep instructions on how to use the SCD stage for processing dimension table changes. It also shows you how to use the output of the stage to update an associated fact table. The tutorial includes a fully operational download. Brian Caufield is a software architect in IBM Silicon Valley Lab. Brian has been working in the DataStage development organization for 10 years and was involved in the design of the Slowly Changing Dimension Stage. 12 March 2009 Before you start The Slowly Changing Dimension stage was added in the 8.0 release of InfoSphere Information Server DataStage. It is designed specifically to support the types of activities required to populate and maintain records in star schema data models, specifically dimension table data. The Slowly Changing Dimension stage encapsulates all of the dimension maintenance logic — finding existing records, generating surrogate keys, checking for changes, and what action to take when changes occur. In addition, you can associate dimension record surrogate key values with source records, which eliminates the need for additional lookups in later processing. About this tutorial This tutorial is designed to introduce you to using the Slowly Changing Dimension stage on the Information Server DataStage parallel canvas. The tutorial uses a simplified example scenario that focuses on Slowly Changing Dimension functionality. Actual business scenarios may require different approaches to the job design used in this tutorial's example. The volume of data processed in the tutorial is intentionally small to make it easier to understand the processing that is taking place. The material in the SCD_Tutorial.zip file in the Download section is built to run on a Windows platform with a DB2 database. You can modify the material to run on a different platform or to use a different database. Objectives In this tutorial, you will learn how to design a job that uses the Slowly Changing Dimension stage to perform updating and loading of dimension and fact tables. After completion, you will be able to configure the SCD stage for historytracking changes and inplace changes, and use the output of the stage to update an associated fact table. Prerequisites This tutorial is written for DataStage developers who are familiar with the DataStage Parallel Edition developerWorks Technical topics Information Management Technical library Manage dimension tables in InfoSphere Information Server DataStage How to use the Slowly Changing Dimension stage

description

ds

Transcript of Manage Dimension Tables in InfoSphere Information Server DataStage

Page 1: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 1/17

Share:

developerWorks Premium An allaccess pass to building your next great app!

Sign up

Information Server DataStage® Version 8.0 introduced the Slowly Changing Dimension (SCD) stage. This tutorialprovides stepbystep instructions on how to use the SCD stage for processing dimension table changes. It also showsyou how to use the output of the stage to update an associated fact table. The tutorial includes a fully operationaldownload.

Brian Caufield is a software architect in IBM Silicon Valley Lab. Brian has been working in the DataStage development organization for 10years and was involved in the design of the Slowly Changing Dimension Stage.

12 March 2009

Before you startThe Slowly Changing Dimension stage was added in the 8.0 release ofInfoSphere Information Server DataStage. It is designed specifically tosupport the types of activities required to populate and maintain records instar schema data models, specifically dimension table data. The SlowlyChanging Dimension stage encapsulates all of the dimension maintenancelogic — finding existing records, generating surrogate keys, checking forchanges, and what action to take when changes occur. In addition, you can associate dimension recordsurrogate key values with source records, which eliminates the need for additional lookups in laterprocessing.

About this tutorialThis tutorial is designed to introduce you to using the Slowly Changing Dimension stage on theInformation Server DataStage parallel canvas. The tutorial uses a simplified example scenario thatfocuses on Slowly Changing Dimension functionality. Actual business scenarios may require differentapproaches to the job design used in this tutorial's example. The volume of data processed in the tutorialis intentionally small to make it easier to understand the processing that is taking place.

The material in the SCD_Tutorial.zip file in the Download section is built to run on a Windows platformwith a DB2 database. You can modify the material to run on a different platform or to use a differentdatabase.

ObjectivesIn this tutorial, you will learn how to design a job that uses the Slowly Changing Dimension stage toperform updating and loading of dimension and fact tables. After completion, you will be able to configurethe SCD stage for historytracking changes and inplace changes, and use the output of the stage toupdate an associated fact table.

PrerequisitesThis tutorial is written for DataStage developers who are familiar with the DataStage Parallel Edition

developerWorks Technical topics Information Management Technical library

Manage dimension tables in InfoSphere InformationServer DataStageHow to use the Slowly Changing Dimension stage

Page 2: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 2/17

design canvas. You will also benefit if you already have a knowledge of star schema design concepts(including fact and dimension tables), the use of surrogate keys, and the usual methodology for updatingdimension tables.

System requirementsTo create the job in this tutorial, you need an Information Server DataStage 8.x installation that is licensedto use the parallel engine. You also need a DataStage Designer client and access to a DataStage projectwhere you can create, import, compile, and run DataStage jobs.

To use the sample scripts in the SCD_Tutorial.zip download, your Information Server must be installed ona Windows® OS with access to a DB2 database. However, you can also modify the scripts to work onother operating systems and with a different database.

Star schemas and Slowly Changing DimensionsStar schemas are a method of data modeling in which the data that is being measured, called the facts,are stored in one table, called the Fact table. Business Objects are the entities that are involved in theevents being measured. Business Objects consist of identifying information and attributes that describethe object. These objects are stored in tables called dimension tables. The facts in the fact table arelinked to the business objects in the associated dimension tables using foreign keys.

Figure 1. Example Star Schema

Because fact tables record the measurements generated from business events, they tend to grow rapidly.Dimension tables, on the other hand, tend to grow or change less frequently. In the example used in thistutorial, the fact table records information about sales transactions. Every transaction results in a new rowin the fact table. The product dimension in the example only grows when a new product is introduced, orif information about an existing product is changed.

You typically handle changes to attribute information in one of two ways:

Overwrite — The existing row in the dimension table is updated to contain the new attribute values; theold values are no longer available. This is commonly referred to as a Type1 change.

Tracking History — The existing row in the dimension table is modified to indicate that it is no longercurrent (that is, it has been expired), and a new row is inserted with the current attribute values. This iscommonly referred to as a Type2 change.

Surrogate KeysSurrogate Keys are values that are generated specifically for the purpose of uniquely identifyingdimension table rows. The primary reasons you would use a surrogate key rather than the usual businesskey of the object in the dimension table are:

When tracking history in the dimension table, there will be multiple rows in the dimension table for the

Page 3: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 3/17

same business key. Therefore, it is not possible to use the business key as the primary key.

Typical fields that are used as business keys generally don't change, but situations can arise where theydo change. For example, US citizens can be assigned a new social security number, or accountnumbers may be reassigned after a merger.

Surrogate keys provide a way for the dimension table to have a reliable, unique, and neverchangingprimary key.

Tutorial scenarioThe scenario used for this tutorial has one fact table and two dimension tables that will be updated. Thesource file contains sales transaction records. The information in the source file is used to update the factand dimension tables.

Figure 2. Scenario schemas

Source dataThe source data file is named SaleDetail.dat and is contained in the SCD_Tutorial.zip download. Itcontains five records that, when processed, apply changes to the fact and dimension tables. Table 1shows the contents of the file.

Table 1. Source dataStoreId StoreName StoreMgr ProdSKU ProdBrand ProdDescr SaleAmt SaleUnits

A1111 Stuff Washington 1111111111 Bob's Red box 00436.14 13

A1112 MoreStuff Adams 2222222222 Squeaky Blue Chair 00456.56 14

A1113 Stuffy's Jefferson 3333333333 Sunshine Yellow Duckie 00203.38 7

A1114 McStuff Madison 4444444444 AAAAA fork 00308.87 2

A1115 Stuff Jr. Monroe 5555555555 Best lawn mower 00024.40 11

Product dimensionThe product dimension is a table in the target database. Initially this table contains records for threeproducts. When the source data is processed, the table is updated to contain new product records, and totrack the history of changed product information. The Setup.bat file in the SCD_Tutorial.zip downloadcontains a script that creates and populates this table with the data shown in Table 2.

Table 2. Initial product dimension dataProdSK SKU Brand Descr Curr EffDate ExpDate

Page 4: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 4/17

1 3333333333 Sunshine Yellow Duckie Y 20040101 20991231

2 4444444444 AAAAA spoon Y 20040101 20991231

10 5555555555 AAAAA grass cutter Y 20040101 20991231

Store dimensionThe store dimension is a table in the target database. Initially this table contains records for three stores.When the source data is processed, the table is updated to contain new store records, and to overwritechanged store information. The Setup.bat file in the SCD_Tutorial.zip download contains a script thatcreates and populates this table with the data shown in Table 3.

Table 3. Initial store dimension dataStoreSK ID Name Mgr

1 A1113 Stuffy's Jefferson

2 A1114 McStuff Adams

5 A1115 Lil Stuff Monroe

Fact tableThe fact dimension is a table in the target database. Initially this table contains no records. When thesource data is processed, the table is updated with the sales facts and references to the correspondingdimension records. The Setup.bat file in the SCD_Tutorial.zip download contains a script that creates thetable as shown in Table 4.

Table 4. Initial Fact table data

ProdSK StoreSK SaleAmt SaleUnits

Setting up the tutorialTo set up the tutorial, save the SCD_Tutorial.zip file from the Download section to your local file systemand follow these steps:

1. Check if you already have the following directory structure: C:\IBM\Demo\DataStage. If not, create it.2. Extract the contents of SCD_Tutorial.zip into C:\IBM\Demo\DataStage. Be sure to select the option inyour extraction program that indicates you want to use the folder or directory names when extracting.You should end up with the directory C:\IBM\Demo\DataStage\SCD, which contains several files andan empty subdirectory named SKG.

3. Run C:\IBM\Demo\DataStage\SCD\setup.bat.4. In the DataStage Administrator client, set the environment variable APT_DB2INSTANCE_HOME to thelocation where the db2nodes.cfg file exists. Typically this is C:\IBM\SQLLIB\DB2. This configures theproject to access DB2 as a source or target for the DB2 Enterprise Stage.

5. Using the DataStage Designer client, import C:\IBM\Demo\DataStage\SCD\SCD_Tutorial.dsx into yourDataStage project.

Verify the state of the databaseRun the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory. This displays thecontents of the product and store dimensions as well as the fact table. Review the output to verify that thetables have been initialized properly.

Resetting the tutorialOnce the tutorial has been run the first time, the contents of the database will have changed. Therefore,

Page 5: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 5/17

subsequent runs would see different behavior. If you want to reset the database tables back to their initialstate, run the zReset executable shortcut in the C:\IBM\Demo\DataStage\SCD directory.

Initializing the surrogate keysThe tutorial uses surrogate key generators that use state files to record the key values that have beenused. This ensures that unique values are always generated. Because the dimension tables are createdwith data in them, you need to make the surrogate key generators aware of what values have alreadybeen used.

Compile and run the Demo\DataStage\Slowly Changing Dimensions\Surrogate KeyGeneration\CreateAndUpdate_File job to initialize the state files. The job reads the product dimensiontable and the store dimension table, then creates and updates the respective surrogate key generatorstate files.

Building the Slowly Changing Dimensions jobIn this step you build a job that reads the SalesDetail.dat source file, updates the product and storedimensions, and inserts records into the fact table. For reference, a completed version of the job namedDemo\DataStage\Slowly Changing Dimensions\SCD_All is included in the download.

Draw the job design as illustrated below in Figure 3.

Figure 3. Job design

The primary flow of records is from left to right in the job design. The source records are read fromSaleDetail, passed to the first SCD stage to process the Product dimension, then passed to the next SCDstage to process the store dimension, and finally to the fact table. No records are added or removed onthis flow of data. Every record read from the source is inserted into the fact table. As part of theprocessing in the SCD stages, the surrogate key values that are associated with the source records areobtained from the dimension table and added to the data being passed to the fact table.

Looking at the job design from top to bottom, the product and store dimension tables are referencesources to the SCD stages. These tables are used to initialize the lookup cache. Only records that areconsidered current are stored in the lookup cache. Any historical records in the dimension tables areautomatically filtered out during initial processing. The SCD stage uses the data values from the primaryinput link to lookup into the cache and check for changes. If any changes are required to the dimensiontable, they are written to the secondary output link of the SCD stage, which is called the dimensionupdate link. Target database stages are connected to the dimension update link to apply the changes tothe actual dimension table in the database.

Each record on the primary input link of the SCD stage will go out on the primary output link, and mayproduce zero, one, or two records on the dimension update link. The number of records produced

Page 6: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 6/17

depends on what, if any, action needs to be taken on the dimension table.

Zero recordsUnchanged records require no action to the dimension table, so no records are written on the dimensionupdate link.

One recordNew records and overwriting updates (Type1) require a one row change to the dimension table. Thechange is either an insert or an update. One record is written on the dimension update link to reflectthese types of changes.

Two recordsChanged records that are tracking history (Type2) require a two row change to the dimension table. Theexisting record must be updated to reflect that it is no longer current, and a new record must be insertedfor the new set of values. Two records are written to the dimension update link to reflect these changes.

Configuring the stagesNow that you have built the high level job design, you are ready to perform the next set of steps in whichyou:

Configure the individual stages to access the source data.

Process the dimension tables.

Update the fact table.

Configure the primary source stageThe source stage must be configured to read the SaleDetail.dat file. Complete the following steps toconfigure the SaleDetail sequential file stage:

1. On the Output|Properties tab, set the File property to C:\IBM\Demo\DataStage\SCD\SaleDetail.dat.2. On the Output|Format tab, add the Record delimiter string property and set it to DOS Format.3. On the Output|Format tab, remove the Final delimiter property.4. Load the Demo\DataStage\Slowly Changing Dimensions\TableDefs\SaleDetail table definition onto theoutput link.

Figure 4. Source stage

The source stage should now be configured to read the SaleDetail.dat file. Use View Data to confirm thatthe data is being read from the database properly.

Configure the stages to process the Product dimensionThree stages are used to process the Product dimension. Reading the job design from top to bottom:

Page 7: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 7/17

Fast Path controlThe SCD stage has two input links andtwo output links. This results in a highnumber of property linktab combinations.Use the Fast Path control to move directlyto the tabs that are required to configurethe stage.

The first stage specifies how to read the data from the dimension table.

The SCD stage determines what changes need to be made to the dimension table and those changesare written to the dimension update link.

The dimension update link is connected to the dimension update target stage, which specifies how toupdate the actual database table with the data produced by the SCD stage.

Configure the Product dimension source stageComplete the following steps to configure the Product dimension DB2 Enterprise stage:

1. On the Output|Properties tab, set the Read method property to Table.2. On the Output|Properties tab, set the Table property to SCD.ProdDim.3. On the Output|Properties tab, set the Use Default Database and Use Default Server properties toFalse.

4. On the Output|Properties tab, set the Database property to SCDDemo.5. On the Output|Properties tab, set the Server property to DB2.6. Load the Demo\DataStage\Slowly Changing Dimensions\TableDefs\SCD.ProdDim table definition ontothe output link.

Figure 5. Product dimension source

The stage should now be configured to read the SCD.ProdDim table. Use View Data to confirm that thedata is being read from the database properly.

Configure the Product dimension SCD stageThe Fast Path control of the SCD stage editor lets you navigate directly to the tabs that require input inorder to complete the stage configuration. The control is in the lower left corner of the editor. Use thearrow buttons to move forward or backward through the tabs.

Open the product dimension SCD stage editor and use the Fast Path control to set the properties asshown:

Fast Path page 1: Setting the output linkBy default, the first output link connected to the stage is used asthe primary output link. Look at the link name that is displayed inthe Select output link property. Use the drop down list to selectthe output link that is leading to the next SCD stage. This is theprimary output of the stage. The other link automatically becomesthe dimension update link.

Figure 6. Product dimension SCD stage, Fast Path page 1

Page 8: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 8/17

Fast Path page 2: Define the lookup condition and purpose codesThe first task on this page is to define what the various columns of the dimension table are used for.This information is used in a number of ways in the SCD processing. The choices for purpose codesare:

Surrogate Key— This column is the primary key of the dimension table and is populated with asurrogate key value.

Business Key— This column is the identifier of the business objects that the dimension table isrepresenting, but is not the primary key of the dimension table. This column is typically used as alookup column and corresponds to a key or some other field of the source data that identifies theassociated business object. The lookup is used to find the dimension table row that corresponds to asource data row.

Type 2— Check this column for a change in value. If the value has changed, perform a history trackingchange to the dimension table.

Type 1— Check this column for a change in value. If the value has changed, perform an overwritingchange to the dimension table.

Current Indicator— This column is used as a flag to indicate whether it is the most current record fora particular business key.

Effective Date— This column is used to specify when a record first became the most current record,that is, when it became the active record.

Expiration Date— This column is used to specify the ending date of when a record was the activerecord. For currently active records, this value is typically a future date or NULL.

SK Chain— This column is used to store the surrogate key of the previous or next record in the historyfor a particular business key.

(blank) — This column is not used for anything with respect to SCD processing. Data for this field isinserted into the table when a new row is inserted, but this column will not be checked for changesagainst the source data.

Set purpose codes for the columns as shown below in Figure 7. Because this dimension table istracking history, it contains columns to track whether a row is current and the date range for when it was

Page 9: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 9/17

current.

Click on the ProdSKU source field and drag it to the SKU dimension column to create the lookupcondition.

Figure 7. Product dimension SCD stage, Fast Path page 2

Although this tab looks similar to a mapping tab, it is actually defining the lookup keys from the sourcerecord to the dimension record. Any source column can be associated with any one dimension column.This creates an equality lookup condition between those columns. If more than one source column isassociated with a dimension column, then those equality conditions are AND'ed together. In thismanner, multicolumn lookup keys can be used.

Fast Path page 3: Configuring the surrogate key generatorSurrogate key generation capabilities are integrated into the SCD stage. This tab specifies howsurrogate keys are generated for this stage. Surrogate key generation can use DataStage's file basedsurrogate generation, or use DB2 or Oracle database sequence object based generation. This tutorialuses the file based method.

Set the Source name property to C:\IBM\Demo\DataStage\SCD\SKG\ProdDim as shown in Figure 8.This is the surrogate key state file you created by running the Demo\DataStage\Slowly ChangingDimensions\Surrogate Key Generation\CreateAndUpdate_File job. Leave the defaults for the otherproperties unchanged.

Figure 8. Product dimension SCD stage, Fast Path page 3

Page 10: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 10/17

Fast Path page 4: Defining the slowly changing dimension behavior and derivationsThe DimUpdate tab is used to define several critical elements of SCD processing. The Derivationcolumn is used to specify how to map elements of a source row to elements of the dimension table. TheExpire column is used to specify what values need to change if an existing record needs to be expired.Expire expressions are only enabled when there are Type2 columns specified, and are only available forCurrent Indicator and Expiration Date columns.

If no matching record is found when the lookup is performed, the derivation expressions are applied anda record is written on the dimension update link to indicate a new record needs to be added to thedimension table. If a matching record is found, the derivation expressions are applied to the sourcecolumns, and then the results are compared to the corresponding columns of the dimension table.Columns specified as Type2 are compared first. If there is a change, two records are written on thedimension update link. The first record is an update record, to expire the matched row. The Expireexpressions are used to calculate the values for the update row. The second record is a new record thatcontains all of the new values for all columns. If no Type2 columns have changed, the Type1 columnsare compared. If there are any changes, one record is written on the dimension update link thatindicates an update to the dimension table. The derivation expressions are used to calculate the valuesfor the update record.

Set the Derivation expressions and the Expire expressions as shown below in Figure 9.

Figure 9. Product dimension SCD stage, Fast Path page 4

Note that you are specifying these properties on the dimension update link. The output columns for thislink were automatically propagated with their purpose codes from the dimension input link. The SCDstage only does this when the set of columns on the dimension update link is empty. It is possible toload a set of columns directly on the dimension update link, however, they must exactly match thosespecified on the dimension input link.

Fast Path page 5: Selecting the columns for Output LinkThe Output Map tab is used to define what columns will leave this stage on the primary output link. Thistab operates much like the Mapping tab of other stages. The only difference is that you can selectcolumns from the primary input link and columns from the reference link to output. The columns comingfrom the primary source have the same values they entered the stage with. The columns coming fromthe reference link represent the values from the dimension table that correspond to the source row. Notethat because the SCD processing has been done by the stage, every record from the primary sourcedata will have a corresponding record in the dimension.

Select the columns for output as shown below in Figure 10. The output link is initially empty. Create andmap the output columns by dragging and dropping from the source to the target. Because the productdimension has now been processed, the source columns that contain those attributes are no longer

Page 11: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 11/17

needed. Instead, the primary key associated with the source row is appended because that is the valuethat is required to be inserted into the fact table.

Figure 10. Product dimension SCD stage, Fast Path page 5

The stage is now configured to perform the dimension maintenance on the Product dimension table.

Configure the Product dimension target stageThis stage processes the dimension update link records produced by the product dimension SCD stageto update the actual dimension table in the database. Because incoming records represent both insertsand updates to the table, a Upsert write method must be used. Autogenerated update and insertstatements take the purpose codes specified in the SCD stage into account to generate the correctupdate statement for this usage.

Complete the following steps to configure the Product dimension update DB2 Enterprise stage:

1. On the Input|Properties tab, set the Write Method property to Upsert.2. On the Input|Properties tab, set the Upsert Mode property to Autogenerated Update and Insert.3. On the Input|Properties tab, set the Table property to SCD.ProdDim.4. On the Input|Properties tab, set the Use Default Database and Use Default Server to False.5. On the Input|Properties tab, set the Database property to SCDDemo.6. On the Input|Properties tab, set the Server property to DB2.Figure 11. Product dimension target

The stage is now configured to write to the SCD.ProdDim dimension table.

Configure the stages to process the Store dimension

Page 12: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 12/17

Configure the Store dimension source stageComplete the following steps to configure the Store dimension DB2 Enterprise stage:

1. On the Output|Properties tab, set the Read Method property to Table.2. On the Output|Properties tab, set the Table property to SCD.StoreDim.3. On the Output|Properties tab, set the Use Default Database and Use Default Server to False.4. On the Output|Properties tab, set the Database property to SCDDemo.5. On the Output|Properties tab, set the Server property to DB2.6. Load the Demo\DataStage\Slowly Changing Dimensions\TableDefs\SCD.StoreDim table definition ontothe output link.

Figure 12. Store dimension source stage

The stage should now be configured to read the SCD.StoreDim table. Use View Data to confirm that thedata is being read from the database properly.

Configure the Store dimension SCD stageOpen the store dimension SCD stage editor and use the Fast Path control to set the properties as shown:

Fast Path page 1: Setting the Output LinkUse the Select output link drop down list to select the link leading to the fact table. This is the primaryoutput of the stage. The other link automatically becomes the dimension update link.

Figure 13. Store dimension SCD stage, Fast Path page 1

Fast Path page 2: Define the lookup condition and purpose codesSet purpose codes for the columns as shown below in Figure 14. Because this dimension table is nottracking history, it does not contain columns to track whether a row is current or not. The Name columnhas a blank purpose code, which indicates that this column will not be checked for changes.

Click on the StoreId source field and drag it to the dimension column Id to create the lookup condition.Figure 14. Store dimension SCD stage, Fast Path page 2

Page 13: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 13/17

Fast Path page 3: Configuring the surrogate key generatorSet the file path property to C:\IBM\Demo\DataStage\SCD\SKG\StoreDim as shown in Figure 15. Thisis the surrogate key state file you created by running the Demo\DataStage\Slowly ChangingDimensions\Surrogate Key Generation\CreateAndUpdate_File job. Leave the defaults for the otherproperties.

Figure 15. Store dimension SCD stage, Fast Path page 3

Fast Path page 4: Defining the slowly changing dimension behavior and derivationsSet the Derivation expressions as shown below in Figure 16. Because the Name column has nopurpose code, the SCD stage does not check this column for changes when a matching dimensionrecord is found on the lookup. Because there are no Type2 columns in this dimension table, the Expireexpression is not enabled for any column.

Figure 16. Store dimension SCD stage, Fast Path page 4

Fast Path page 5: Selecting the columns for Output LinkSelect the columns for output as shown below in Figure 17. Because the store dimension has now beenprocessed, the source columns that contain those attributes are no longer needed. Instead, thesurrogate key associated with the source row is appended because that is the value that is required tobe inserted into the fact table.

Figure 17. Store dimension SCD stage, Fast Path page 5

Page 14: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 14/17

The stage is now configured to perform the dimension maintenance on the store dimension table.

Configure the Store dimension target stageThis stage processes the dimension update records produced by the store dimension SCD stage toupdate the actual dimension table in the database.

Complete the following steps to configure the Store dimension target DB2 Enterprise stage:

1. On the Input|Properties tab, set the Write method property to Upsert.2. On the Input|Properties tab, set the Upsert Mode property to Autogenerated Update and Insert.3. On the Input|Properties tab, set the Table property to SCD.StoreDim.4. On the Input|Properties tab, set the Use Default Database and Use Default Server to False.5. On the Input|Properties tab, set the Database property to SCDDemo.6. On the Input|Properties tab, set the Server property to DB2.Figure 18. Store dimension target stage

The stage is now configured to write to the SCD.StoreDim dimension table.

Configure the Fact table target stageThis stage processes the source records that have been passed through the primary output links toupdate the actual fact table in the database. At this point, the original input source records have beenprocessed so that the only columns on this link are the measurements (SaleAmt and SaleUnits) and thesurrogate key values for the associated Product and Store.

Complete the following steps to configure the Fact table target DB2 Enterprise stage:

1. On the Input|Properties tab, set the Write Method property to Write.2. On the Input|Properties tab, set the Write Mode property to Append.3. On the Input|Properties tab, set the Table property to SCD.Facttbl.4. On the Input|Properties tab, set the Use Default Database and Use Default Server to False.

Page 15: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 15/17

5. On the Input|Properties tab, set the Database property to SCDDemo.6. On the Input|Properties tab, set the Server property to DB2.Figure 19. Fact table target stage

The stage is now configured to write to the SCD.Facttbl dimension table.

Final stepsYou have now completed the job design and are ready to compile. Click the Compile button to start thecompile.

Note that the SCD stage processing makes use of the transform operator. So for the job to compilesuccessfully, the C++ compiler settings for the project must be correct. The Resources page contains alink to an article in the information center for IBM Information Server with details on configuring yourenvironment correctly for your C++ compiler. See the Information Server Configuration Guide for detailson how to configure the environment correctly for your C++ compiler. If any compile errors occur, checkyour job and stages against the settings specified in the tutorial and make any necessary changes.

Running the tutorialAt this point, you are now ready to compile and run the job.

Run the Results executable shortcut in the C:\IBM\Demo\DataStage\SCD directory to see the initialcontents of the database tables. The Results shortcut displays the contents of the product dimension, thestore dimensions, and the fact table.

Run the job by clicking the Run button in the DataStage Designer.

After the job finishes successfully, run the Results shortcut again to see the changes that were made tothe database tables.

Summary of changes to database tablesThe contents of the database tables should now appear as follows:

The product dimension has two update records, and four new records. Two of the new records are newobjects to the dimension table, and two existing records had Type2 changes, resulting in the twoupdates and two of the new records.

Change ProdSK SKU Brand Descr Curr EffDate ExpDate

No Change 1 3333333333 Sunshine YellowDuckie

Y 20040101 20991231

Expired (Type2) 2 4444444444 AAAAA spoon N 20040101 Today'sDate

Expired (Type2) 10 5555555555 AAAAA grass cutter N 20040101 Today'sDate

New Record 3 1111111111 Bob's Red Box Y Today's 20991231

Page 16: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 16/17

ResourcesLearn

Dig deeper into Informationmanagement ondeveloperWorks

Date

New Record 4 2222222222 Squeaky Blue Chair Y Today'sDate

20991231

New Record(Type2)

5 4444444444 AAAAA fork Y Today'sDate

20991231

NewRecord(Type2)

6 5555555555 Best lawn mower Y Today'sDate

20991231

The store dimension has one updated record, and two new records. The updated record had a Type1change and the two new records are new objects to the dimension table.

Change StoreSK ID Name Mgr

No Change 1 A1113 Stuffy's Jefferson

Update 2 A1114 McStuff Madison

No Change 5 A1115 Lil Stuff Monroe

New Record 3 A1111 Stuff Washington

New Record 4 A1112 MoreStuff Adams

The fact table has five new records, one for each source record processed. The surrogate key values inthis table correspond to the current records in the dimension tables.

ProdSK StoreSK SaleAmt SaleUnits

3 3 436.14 13

4 4 456.56 14

1 1 203.38 7

5 2 308.87 2

6 5 24.40 11

The contents of the dimension tables have now changed. If you were to run the job again, what resultswould you expect to see? Hint: The dimension tables and the source file are now insync.

This completes the Slowly Changing Dimensions tutorial. To reset the database tables to their originalstate, run the zReset executable shortcut .

ConclusionYou can use the Slowly Changing Dimension stage to greatly reduce the time you spend creating jobs forprocessing star schemas. In this tutorial you have learned how to configure the Slowly ChangingDimension stage to process historytracking changes and inplace changes to dimension tables. Youhave also seen how you can reduce fact table processing by augmenting the source data with associateddimension table surrogate keys that eliminate the need for an additional lookup.

DownloadDescription Name Size

Supporting scripts and DS jobs for this tutorial SCD_Tutorial.zip 16KB

Page 17: Manage Dimension Tables in InfoSphere Information Server DataStage

7/3/2016 Manage dimension tables in InfoSphere Information Server DataStage

http://www.ibm.com/developerworks/data/tutorials/dm0903datastageslowlychanging/ 17/17

In the InfoSphere area on developerWorks, get the resources you need toadvance your InfoSphere product skills.

C++ compiler for job development topic in the information center for IBMInformation Server.

Browse the technology bookstore for books on these and other technicaltopics.

Get products and technologies

Download IBM product evaluation versions and get your hands on applicationdevelopment tools and middleware products from DB2®, Lotus®, Rational®,Tivoli®, and WebSphere®.

Discuss

Participate in the discussion forum.

Check out developerWorks blogs and get involved in the developerWorkscommunity.

Overview

New to Information management

Technical library (tutorials and more)

Forums

Community

Downloads

Products

Events

developerWorks PremiumExclusive tools to build your nextgreat app. Learn more.

developerWorks LabsTechnical resources for innovatorsand early adopters to experimentwith.

IBM evaluation softwareEvaluate IBM software andsolutions, and transformchallenges into opportunities.