IBM InfoSphere DataStage and DB2 PureXML,Part 1-Integrate XML Operational Data Into a Data Warehouse

© Copyright IBM Corporation 2009 TrademarksIBM InfoSphere DataStage and DB2 pureXML, Part 1: IntegrateXML operational data into a data warehouse

of 28

IBM InfoSphere DataStage and DB2 pureXML, Part 1:Integrate XML operational data into a data warehouseBranislav Barnak ([email protected])Software ArchitectIBM

Amir Bar-or ([email protected])Senior Architect and ManagerIBM

Cynthia M. SaraccoSenior Solutions ArchitectIBM

Paul Stanley ([email protected])Senior ArchitectIBM

20 August 2009

Learn how to integrate business-critical XML data into your data warehouse using IBM®InfoSphere™ DataStage 8.1 and DB2® 9.5 pureXML®. This two-part article series providesstep-by-step instructions for using pureXML as both a source and target data source forInfoSphere DataStage jobs.

View more content in this series

It's no secret: XML is everywhere. Web-based applications, service-oriented architectures(SOA), and industry-specific data exchange formats all rely heavily on XML to represent criticalbusiness data. This growing use of XML is prompting many firms to explore how to integrate XMLoperational data into their data warehouses and business intelligence environments.

In this article series, learn how IBM DB2 pureXML and InfoSphere DataStage can be usedtogether to manage operational XML data, transform XML data as needed, and populate arelational-only or hybrid relational/XML data warehouse. (To keep things simple, subsequentsections of this article series will refer to "pureXML" and "DataStage" instead of the full productnames.) Step-by-step instructions guide you through two common integration scenarios.

In the first part of this article series, explore how to build a data warehouse that contains bothrelational and XML data using pureXML data as input. Additionally, learn important tips about

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/developerworks/ibm/trademarks/

mailto:[email protected]



http://www.ibm.com/developerworks/views/data/libraryview.jsp?search_by=purexml+infosphere+series

developerWorks® ibm.com/developerWorks/

IBM InfoSphere DataStage and DB2 pureXML, Part 1: IntegrateXML operational data into a data warehouse

of 28

using the DataStage DB2 Connector. In the second part of this article series, you'll learn how totransform flat file data into XML and store the data in a data warehouse with pureXML.

Introducing key DataStage and pureXML technologiesIf you're not familiar with DataStage or pureXML, you'll find brief introductions to each in thefollowing sections. However, prior knowledge of data warehousing, ETL (extract-transform-load)technology, and XML will help when reading this article series.

DataStage overviewIBM InfoSphere DataStage enables firms to extract, transform, and load (ETL) data from avariety of sources into a data warehouse. Built-in support for multi-processor hardware enablesDataStage to provide high levels of scalability and work with large volumes of data efficiently.Various "connectors" support a wide range of source and target data formats, including popularIBM and OEM database management systems, ODBC data sources, third-party applications, real-time messages generated by messaging queuing software and Web services, and popular fileformats.

DataStage provides these capabilities—and others—through a number of software components.The scenarios in this article series use a subset of these. Specifically, the article scenarios useDataStage Designer to construct ETL jobs. Each job consists of multiple "stages," and each stageperforms a given task. Examples of such tasks include reading information from a data source,transforming input data using built-in functions, converting data from one type to another, andso on. The article examples define stages that involve the DB2 Connector, XML operations, andvarious processing operations. Two DataStage technologies are critical for the scenarios: the DB2Connector and the XML Pack 2.0. This article describes these shortly.

After using DataStage Designer to create and compile your job, you'll use DataStage Director toexecute your work.

DataStage DB2 ConnectorThis article relies on the DB2 Connector for all DB2 database access. This is the main DB2connectivity option in InfoSphere Information Server and is based on the Common ConnectorFramework. The connector contains a superset of features found in the legacy DataStageconnectivity components and offers improved function and performance.

The DB2 Connector is available for Information Server 8.1 as a separate connector patchinstallation. It supports local and remote connections to DB2 9.1 and 9.5 servers. (DB2 9.7 supportis planned for a future Information Server release.) The connector requires DB2 Client 9.1 FixPack 6 (or later) or DB2 9.5 Fix Pack 3 (or later).

Key features of the DB2 Connector include:

• Support for XML and LOB (large object) data types of any size. Two transfer mechanismsare available: passing by value (inline) and by reference. When passing LOB or XML datainline, the maximum size is determined by the DataStage engine. When data is passed by


ibm.com/developerWorks/ developerWorks®


of 28

reference, the downstream (or receiving) stage uses the reference to get the data by invokingthe source connector directly. The downside of this method is that the LOB data can onlybe moved from one connector to another and cannot be transformed in the job (as only thereference is passed through the links). The connector requires that the ArraySize property tobe set to 1 when reading or writing XML columns.

• Ability to execute standard SQL statements such as SELECT, INSERT, UPDATE, and DELETE,combinations of these statements, user-defined SQL, as well as bulk load.

• Support for parallel (multi-process) execution.• Support for DB2 DPF (Database Partitioning Feature). The connector can work with

partitioned databases in parallel or sequential mode. When running in parallel, a separateDataStage process is allocated for each DB2 partition. The connector can read and write inparallel and can load partitioned databases in parallel.

• Support for guaranteed data delivery through XA (two-phase commit) transactions whenused together with the Distributed Transaction Stage (DTS). DTS provides a mechanismfor multiple DB2 databases to be updated as part of one XA transaction. Note that DTS isavailable with a separate patch for DataStage 8.1.

• Support for metadata import through the Common Connector Import Wizard available in theDataStage Designer.

DB2 Connector usage contextsThe connector can be used in a source, target, or lookup context. In this article series, you use theDB2 Connector in source and target contexts.

The source context refers to the case when there is an output link coming out of the connectorstage. In this context, the connector is used to retrieve data.

The target context refers to the case when there is an input link going into the connector stage,which means that data is being passed to the connector. In this context, the connector is usedto execute INSERT, UPDATE, DELETE, UPSERT (combination of INSERT/UPDATE/DELETE), user-definedSQL, or bulk load statements.

The lookup context refers to the case when the connector is used in conjunction with the lookupstage. In this context, the data is both passed to and retrieved from the connector. In such ascenario, the connector executes a SELECT statement with a WHERE clause.

DB2 Connector XML supportThe scenarios in this article series focus on the DB2 Connector's support for XML data. As such,it's worthwhile to learn more about XML support in the DB2 Connector.

DataStage representation of XML columnsDataStage 8.1 uses NVarChar or LongNVarChar data types to represent XML data. Duringmetadata import, the user can select which type to use for XML columns. The difference is thatXML columns represented as NVarChar columns are used to transfer data inline and are subjectto the DataStage engine's size requirements, while the long data type is used to treat the XMLcolumns as large objects (LOBs) and allows for them to be passed by reference.



of 28

The connector supports:

• XQueries• XQueries with embedded SQL• SQL/XML queries• Update and delete operations with XML columns

Source context: Reading XML columnsThe connector can read a whole XML document or a fragment. The fragment can be a sequenceor an atomic value. When an XQuery is used, the connector does not support passing the XMLresult by reference due to difficulties in constructing a LOB locater.

The following examples (Listings 1 - 6) contain several common SQL queries and XQueries thatthe connector supports. These queries are based on samples included in popular DB2 pureXMLintroductory articles. (See the Resources section for details.) Note that DB2 Connector support forSQL, SQL/XML, and XQuery isn't limited to these examples.

Listing 1. Using simple SQL to retrieve full XML documentsselect xml_col from table

If the XML column was represented with the LongVarChar data type in the DataStage job, theconnector would give the user an option to select whether this column will be passed to the nextstage as a reference or inline.

Listing 2. Using XQuery to retrieve full XML documentsxquery db2-fn:xmlcolumn('TABLE.COL_NAME')

Note: The result for Listing 2 cannot be passed by reference.

Listing 3. Using XQuery to extract XML document fragmentsxquery db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client[Address/zip="10011"or Address/city="San Jose"]/Address


Listing 4. Using an XQuery FLWOR expression to extract XML documentfragmentsxqueryfor $y in db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Clientwhere $y/Address/zip="10011" or $y/Address/city="San Jose"return $y/email[1]/text()


Listing 5. Using XQuery with embedded SQL to extract full XML documentsxquery db2-fn:sqlquery('select xml_col from table')



of 28


Listing 6. Using a SQL/XML query to extract XML document fragmentsselect xmlquery('$c/Client/Address'passing contactinfo as "c")from clientswhere status = 'Gold'and xmlexists('$c/Client/email' passing contactinfo as "c")

In Listing 6, the connector can either pass the result inline or by reference.

Prefix for expression columnsWhen the connector executes a query, it tries to match the column names the query returns withthe column names specified on the output link of the connector stage. However, XQuery columnnames are represented with a number. The connector prefixes these with a string that can beconfigured by setting the connector source context property Prefix for Expression Columns. Thedefault prefix used is EXPR. The XQuery column names then become EXPR1, EXPR2, and so on. Thisis important to note as the connector would look for those columns in the list of link columns.

Target context: Inserting/updating/deleting XML columnsAs mentioned earlier, the DB2 Connector can act as a target for incoming data and can write thisdata to the appropriate DB2 database. The following examples (Listings 7 - 11) illustrate how theDB2 Connector can insert, update, or delete information in XML columns.

Some of these examples incorporate a reserved DataStage keyword—ORCHESTRATE—that is usedto pass parameters to the statement. The column name specified after the ORCHESTRATE keywordneeds to match a column in the result set associated with the appropriate DataStage link.

Listing 7. Inserting an XML columninsert into table values(ORCHESTRATE.col1)

The column name col1 indicates the column as specified on the input link. This column can beeither NVarChar or LongNVarChar and can be passed inline or by reference.

Listing 8. Updating full XML document using a simple SQL update statementupdate clients setcontactinfo=( xmlparse(document ‘<email>[email protected] </email> ' ) )where id = 3227

Listing 9. Updating part of an XML document value using a parameterupdate xmlcustomerset info = xmlquery('copy $new := $INFOmodify do replace value of $new/customerinfo/phone with $zreturn $new' passing cast(ORCHESTRATE.col2 as varchar(15)) as "z")where cid = ORCHESTRATE.col1

Listing 9 uses two parameters. The value of col2 replaces the existing /customerinfo/phone data inthe XML document; the value of col1 restricts the rows affected by the update.



of 28

Listing 10. Deleting an XML record based on internal XML data filterdelete from clientswhere xmlexists ('$c/Client/Address[zip="95116"]'passing clients.contactinfo as "c")

Listing 11. Deleting an XML record using a parameterdelete from clientswhere xmlexists ('$c/Client/Address[zip=$z]'passing clients.contactinfo as "c", cast(ORCHESTRATE.col2 as varchar(15)) as z)

Listing 11 uses the value of col2 as a parameter to restrict DELETE operations based on the value ofan XML element (the zip or zip code element of a client's address).

DataStage XML PackThis article series relies on the DataStage XML Pack 2.0 to provide key functions for bothscenarios. The XML Pack consists of three stages: XML Input, XML Output, and XML Transformer.This series uses the XML Input and XML Output stages. The XML Transformer stage, which isbased on XSLT, isn't used in the article scenarios.

XML Input stageXML Input converts XML data to "flat" relational tables. With XML Input, you can:

• Extract data using standard XPath expressions• Use XML Schemas and optionally validate XML data• Support complex transformations using custom stylesheets• Support multiple output links using different data sets• Handle errors

Figure 1 illustrates a simple transformation done by the XML Input stage:

Figure 1. The XML Input stage transforms hierarchical XML data into "flat"tables

XML Output stageXML Output converts tabular data, such as relational tables, to XML hierarchical structures. WithXML Output, you can:

• Generate XML using a subset of standard XPath expressions



of 28

• Generate multiple documents from the same input.• Support XML namespaces• Configure XML document generation in a variety of ways, such as one document per input

row• Generate output documents on disk or through an output link• Handle errors

Figure 2 illustrates a simple transformation done by XML Output:

Figure 2. The XML Output stage transforms "flat," tabular structures into XMLhierarchies

DB2 pureXML overviewDB2 provides firms with a common application programming interface and database managementplatform for data modeled in tables as well as XML hierarchies. Figure 3 illustrates how DB2enables firms to manage XML messages and documents alongside traditional relational data. WithpureXML, there is no need to "shred" or map business-critical XML data into a relational format.Instead, XML data can be stored intact in its native hierarchical format. Applications can readilyaccess the XML data they need as well as integrate XML and relational data with little effort.

Figure 3. DB2's architecture offers relational and XML data managementsupport



of 28

The scenarios in this article series rely on DB2's pureXML capabilities, which include optimizedstorage of XML data in its native hierarchical format and support for querying XML data in SQL orXQuery languages.

Sample development and runtime environmentTo create the sample scenarios described in this article series, we used the following software:

• Information Server 8.1, which includes the DataStage Designer and DataStage Directorcomponents. In addition, we installed the DataStage 8.1 DB2 Connector, which is contained ina patch for DataStage 8.1 called "Connectors Rollup Patch 1." Because one of the scenariosuses a Transformer Stage, we also installed a prerequisite C++ compiler. (We used MicrosoftVisual Studio .NET 2003.)

• DB2 9.5 Enterprise Server Edition.

Both DataStage and DB2 were installed on the same Windows system.

The sample operational data includes information about financial portfolios. We derived thisdata from a subset of the open source Transaction Processing over XML (TPOX) benchmark.Specifically, we used data representing information about customer accounts and their holdings (orinvestments). You'll learn more about the operational data and the data warehouse design as youstep through each scenario. (For more information about TPOX, see the Resources section.)

Using pureXML as input to the warehouseThe scenario described in this article uses operational XML data stored in DB2 pureXML as inputto the ETL job. As shown in Figure 4, you want DataStage to extract data stored in one DB2 XMLcolumn and map this data to two tables in a DB2 data warehouse. One of these tables containsonly traditional relational columns, while the second contains relational columns and one XMLcolumn.

Figure 4. Operational data stored in DB2 pureXML serves as input to the datawarehouse

This data model represents a common scenario in which portions of XML data are often"shredded" into a relational structure. Such portions represent data that business users mightfrequently analyze and query. Many business intelligence tools are optimized to support relationalstructures, so shredding frequently queried XML data into relational columns can be quite effective.However, business needs can vary over time, so it can be difficult for administrators to determine



of 28

which relational columns should be created. Maintaining full XML data in the data warehouseallows users to immediately access important business data that wasn't previously shredded into arelational format.

To simplify the sample scenarios in this article, we use a single DB2 database (named "TPOX") tostore both the operational data and the warehouse data. Of course, in a production environment,operational data and warehouse data would be managed in separate databases, usually onseparate servers.

Since we use DB2 to manage both operational data and warehouse data, we could have electedto use built-in DB2 technologies to perform much of the ETL work. However, offloading thiswork to DataStage minimizes impact on DB2 operations, which is a common goal in productionenvironments. In addition, DataStage provides a number of transformation and cleansing functionsbeyond those found in DB2. Finally, many firms need to populate their data warehouses with datafrom heterogeneous data sources, and DataStage provides critical services to help them do so.Such capabilities are well-documented elsewhere and are beyond the scope of this article series.

Design overviewThe sample scenario in this article stores operational XML data in the TPOXADMIN.ACCOUNTtable, which serves as the source table to DataStage for this scenario. The ACCOUNT tablecontains one relational column (ID) and one XML column (INFO). Listing 12 shows how easy it isto create this table:

Listing 12. SQL to create the source ACCOUNT tablecreate table tpoxadmin.account(id int, info xml)

The INFO column contains details about the account, including its title, opening date, workingbalance, portfolio holdings, and other information. Listing 13 shows a portion of one XMLdocument stored in the ACCOUNT table. The DB2 script accompanying this article contains thefull set of XML account records (see Download).

Listing 13. Portion of one XML document stored in the ACCOUNT table<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc"> <Category>6</Category> <AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle> <ShortTitle>Lapidot EUR</ShortTitle> <Mnemonic>LapidotEUR</Mnemonic> <Currency>EUR</Currency> <CurrencyMarket>3</CurrencyMarket> <OpeningDate>1999-02-20</OpeningDate> <AccountOfficer>Soraya Lagarias</AccountOfficer> <LastUpdate>2004-02-10T22:33:58</LastUpdate> <Balance> <OnlineActualBal>896882</OnlineActualBal> <OnlineClearedBal>337676</OnlineClearedBal> <WorkingBalance>430147</WorkingBalance> </Balance> . . . <Holdings>



of 28

<Position> <Symbol>ZION</Symbol> <Name>Zions Bancorporation</Name> <Type>Stock</Type> <Quantity>1927.719</Quantity> </Position> . . . </Holdings>. . .</Account>

For testing simplicity, the target data warehousing database is also configured to be TPOX.Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to twotables—the DWADMIN.ACCT table that contains information about the overall account, andthe DWADMIN.HOLDINGS table that contains information about various investments (portfolioholdings) of a given account. Listing 14 shows how these tables are defined:

Listing 14. Definitions for target data warehouse tables-- DWADMIN.ACCT contains general information about accounts.create table dwadmin.acct ( id int primary key not null, title varchar(100), currency char(3), workingbalance int, totalholdings int, holdingtypes int, officer varchar(50), datechanged date, timechanged time, fullrecord xml)

-- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given accountcreate table dwadmin.holdings ( id int references dwadmin.acct on delete cascade, symbol varchar(10), type varchar(25), quantity decimal(12,2))

To understand how the XML source data (in the INFO column of TPOXADMIN.ACCOUNT) ismapped to the various columns of the data warehouse tables, see Table 1. (The ID columns ofboth data warehouse tables are populated from the values of the ID column in the operation.)

Table 1. XML source data for each column in the data warehouse tables

DWADMIN.ACCT DWADMIN.HOLDINGS

Column name Data source (XPATH or...) Column name Data source(XPATH expression)

title /Account/AccountTitle symbol /Account/Holdings/Position/Symbol

currency /Account/Currency type /Account/Holdings/Position/Type

workingbalance /Account/WorkingBalance quantity /Account/Holdings/Position/Quantity

officer /Account/AccountOffice



of 28

datachanged extract from /Account/LastUpdated

timechanged extract from /Account/LastUpdated

fullrecord entire XML document as originallystored in DB2 source table

As you might expect, there are several ways to build a DataStage job for this scenario. This articleemploys an incremental development approach. In particular, the initial steps guide you throughcreating a portion of the overall DataStage job that extracts, transforms, and loads data into theDWADMIN.ACCT table. Once this portion is complete and tested, the article guides you throughenhancing the job to extract, transform, and load XML data into the DWADMIN.HOLDINGS table.First, though, you need to create the appropriate DB2 tables to support this scenario.

Prepare the DB2 database environmentThe article download contains a script that creates the necessary operational (source) table anddata warehouse (target) tables. It also populates the operational table with sample XML accountinformation. Download and unzip the DSsetup.zip file, and then open a DB2 command window andissue the following command:

Listing 15. Invoking the DB2 script accompanying this article series db2 –td@ -vf DSsetup.db2

Note: This script is designed to support a DB2 9.5 server running on Windows.

Create a parallel job and outline its contentsTo begin, define a new DataStage parallel job with placeholders for various stages (worktasks) that need to be included. Later, you'll edit each stage to specify appropriate properties tosuccessfully execute the job.

1. Launch DataStage Designer.2. To create a new parallel job, right-click on the Jobs folder in the Repository pane, and select

New > Parallel job.3. Add two DB2 Connector stages to the parallel job. One connector will represent the

TPOXADMIN.ACCOUNT source table, and the other will represent the DWADMIN.ACCTtarget table.

a. Select the Database tab in the Palette pane.b. Drag the DB2 Connector stage to the parallel job.c. Repeat the previous step to create a second DB2 Connector stage.d. Place these two connectors at opposite sides of the parallel job canvas.

4. Add an XML Input stage to the job. (This stage will transform the hierarchical XML data aboutaccounts into a tabular result structure.)

a. Select the Real Time tab in the Palette pane.b. Locate the XML Input stage and drag it to the parallel job.c. Place the icon near the first DB2 Connector stage.



of 28

5. Add a Transformer stage to the job. (This stage will split a single XML element value into twovalues that will populate two different columns in the target table.)

a. Select the Processing tab in the Palette pane.b. Locate the Transformer stage and drag this icon to the parallel job pane.c. Place the icon between the XML Input stage and the final DB2 Connector.

6. Link the stages together.a. To link the first DB2 Connector to the XML Input, hold down the right mouse button, click

on the DB2 Connector, and drag the mouse to the XML Input stage. An arrow betweenthe two stages will appear.

b. Link the XML Input to the Transformer stage.c. Link the Transformer stage to the final DB2 Connector.

7. If desired, customize your job with descriptive names for each stage and link using standardDataStage facilities. (Consult the Resources section for links to DataStage tutorials anddocumentation, if needed.)

8. Verify that your parallel job design is similar to Figure 5, which shows the various stageslinked together, as described in Step 6:

Figure 5. DataStage job skeleton for the first part of the integrationscenario

(View a larger version of Figure 5.)9. To save your work, click on File > Save from the main toolbar, and specify a name of your

choice for the job.

You now have a parallel job skeleton for the first part of your DataStage job.

Import table definitions

The DB2 TPOX database already contains source and target tables, so you need to importmetadata about these tables into DataStage.

1. From the main toolbar, select Import > Table Definitions > Start Connector Import Wizard.2. A new window appears for "Connector metadata import – Connector selection." Select DB2

Connector (Variant 9.1), and then click on Next.

sidefile1.html



of 28

3. Enter the appropriate details for the connection, including the instance type (DB2), databasename (TPOX), and a valid user ID and password.

4. Click on the Test connection option in the upper right corner of this window to verify that youcan connect to the DB2 TPOX database.

5. After a successful connection, click on Next and then on OK.6. Accept the default values for the data source location. These include a host name of DB2 and

a database name of TPOX (DB2).7. Click on Next.8. Select the TPOXADMIN schema from the Filter drop-down list, verify that the include tables

option is checked, and then click on Next.9. Select the ACCOUNT table from the list of available tables for this schema.

10. Leave all options unchecked, including the "XML Columns as LOBs" option. Your DataStagejob needs to process and transform the XML, so it will treat it as string data (rather than as anunstructured large object or LOB).

11. Verify that the TPOXADMIN.ACCOUNT table is slated for import, and click on Import.12. A pop-up window appears, prompting you to select a folder for the metadata import. Select

Table Definitions, and click on OK.13. Repeat the prior steps to import the two data warehouse target tables.

• In Step 8, select the DWADMIN schema, instead of the TPOXADMIN schema.• In Step 9, select the ACCT and HOLDINGS tables. (You'll only use the DWADMIN.ACCT

table definition initially, but it will save time if you import the DWADMIN.HOLDINGS tabledefinition now as well.)

14. To confirm that you successfully imported metadata for all necessary tables, expand the TableDefinitions folder in the Repository pane in the upper left corner and confirm that there areentries for TPOXADMIN.ACCOUNT, DWADMIN.ACCT, and DWADMIN.HOLDINGS.

15. Save your work.

You're now ready to edit each stage of the job.

Edit the DB2 source table stage

The TPOXADMIN.ACCOUNT table contains operational information needed for the warehouse.The properties of the source DB2 Connector stage need to be edited to reflect appropriateinformation about this input table.

1. Double-click on the DB2 Connector that represents the input source.2. A window appears. In the Properties tab, specify the appropriate connection information. This

includes the instance type (DB2), database name (TPOX), and a valid user ID and password,as shown in Figure 6:



of 28

Figure 6. Setting properties for the DB2 source table

3. Click on Test in the upper right corner of the pane to verify that you can successfully connectto the database.

4. Scroll down to the Usage section of the Properties tab, and specify the following settings (asalso shown in Figure 6):

• Generate SQL: Yes• Table name: TPOXADMIN.ACCOUNT• Array size: 1

5. Click on View Data in the right-hand side of the Usage row to verify that you can successfullyquery the table, then click on OK to save your settings.

6. Click on the Columns tab, and select Load at the bottom of the pane.7. A window appears with table definitions. Select the TPOXADMIN.ACCOUNT table, and click

on OK.8. A window appears with the columns of the table. Accept the default in which all columns of

the table are selected. (Note that the INFO column, which was created in DB2 as an XMLcolumn, appears here with an SQL type of NVarChar, which represents a Unicode string. Thisis fine.)

9. Click on OK.



of 28

10. Specify an appropriate length for the INFO column. For this sample data, a length of 5000bytes is sufficient.

11. Click on OK.12. Save your work.

The DB2 Connector stage for the source table is ready. In the next step, you'll customize the DB2Connector representing the data warehouse target table.

Edit the DB2 target table stage

The DWADMIN.ACCT table is one of two data warehouse tables to be populated from the inputsource. As mentioned earlier, this table contains several relational columns for specific informationabout financial accounts as well as one pureXML column with full account data in XML format. Theproperties of the target DB2 Connector stage need to be edited to reflect appropriate informationabout this table.

1. To open the DB2 Connector that represents the target table, double-click on its icon.2. In the Properties tab, specify the appropriate connection information. This includes the

instance type (DB2), database name (TPOX), and a valid user ID and password.3. Test that you can successfully connect to the database.4. In the Usage section of the Properties tab, specify the following settings, as shown in Figure

7:• Write mode: Insert• Generate SQL: Yes• Table name: DWADMIN.ACCT• Array size: 1• Table action: Truncate



of 28

Figure 7. Setting properties for one of the DB2 target tables

5. Click on View Data to verify that you can successfully query the table. (The first time you runyour job, this table will be empty.)

6. Click on OK to save these settings.7. Click on the Columns tab, and select Load.8. Select the DWADMIN.ACCT table, and click on OK.9. Accept the default in which all columns of the table are selected. (Note that the

FULLRECORD column, which was created in DB2 as an XML column, appears here with anSQL type of NVarChar.)

10. Click on OK.11. Specify an appropriate length for the FULLRECORD column. For the sample data, a length of

5000 bytes is sufficient.12. Click on OK.13. Save your work.

With the DB2 source and target stages defined, it's time to work on the stages that process thedata.



of 28

Edit the XML Input stageYou need to customize the XML Input stage to extract XML data from the INFO column of theTPOXADMIN.ACCOUNT table and generate a tabular result set, which will ultimately be used topopulate the DWADMIN.ACCT table.

1. To open the XML Input stage, double-click on its icon.2. Click on the Input tab.3. A second set of tabs appears. Click on the XML Source sub-tab (if needed).4. Specify the XML source column. Use the drop-down tab to select the INFO column.5. Specify XML document (rather than URL/File path) as the Column content.6. Click on the Column sub-tab.7. Verify that the ID and INFO columns are listed.8. Set the data type for the INFO column to VarBinary, and set the column length to 5000 bytes.

(Using VarBinary enables the XML Input stage to handle the data in its native UTF-8 formatand avoids unnecessary character set conversions.)

9. Click on the main Output tab.10. A new set of tabs appears. Click on the Transformation Settings sub-tab.11. Verify that the Repetition element required setting is checked. This is needed to indicate

that you intend to extract multiple records from a single record that contains XML. The XMLInput stage will loop through the repetition element and produce an output record for eachoccurrence of that element. In a following step, you'll see how to indicate the path to therepetition element.

12. Select the include namespace declaration check box.13. The XML source data contains a namespace, so you need to declare that here. Enter the

following:xmlns:ns= "http://tpox-benchmark.com/custacc"

14. Click on the Columns sub-tab, and select Load.15. A list of table definitions appears. Select the DWADMIN.ACCT table.16. Accept the default to load all columns.17. Verify that the Ensure all Char columns use Unicode check box is selected, then click OK.18. Modify the output structure from this job stage as follows (to match the content of Figure 8):

a. Specify a length of 5000 bytes for the FULLRECORD column.b. Change the data type of the TIMECHANGED column to NVarChar with a length of

20. Treating the data as a string will enable you to take advantage of certain built-inDataStage transformation functions, as you'll see shortly.

c. Specify the appropriate XPATH expression that DataStage should use to extractinformation from the input XML data. Do this for every column except the ID column(which is extracted from a relational column in the source TPOXADMIN.ACCOUNTtable). The appropriate description settings include:

• TITLE: /ns:Account/ns:AccountTitle/text()• CURRENCY: /ns:Account/ns:Currency/text()• WORKINGBALANCE: /ns:Account/ns:Balance/ns:WorkingBalance/text()• OFFICER: /ns:Account/ns:AccountOfficer/text()• DATECHANGED: /ns:Account/ns:LastUpdate/text()• TIMECHANGED: /ns:Account/ns:LastUpdate/text()



of 28

• FULLRECORD: /ns:Accountd. Identify the TITLE column as a key. To do so, change the Key value for TITLE to Yes.

This instructs DataStage to use this XML element value as the repetition elementidentifier. For every occurrence of ns:AccountTitle, the stage will produce an outputrecord. In other words, it will produce a record for each Account since every accountcontains an AccountTitle element. In this scenario, other columns could also serve thispurpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns,since all of them are mandatory elements of the Account element. We selected TITLE asthe key column for convenience.

Figure 8. Column definitions for output results from the XML Input stage

(View a larger version of Figure 8.)Note: VarChar types with a Unicode extension are equivalent to NVarChar types to the DB2Connector.

19. Click on OK and save your work.

You've now programmed the XML Input stage to extract data from a DB2 pureXML column andprepare it for mapping into a DB2 target data warehouse table that is largely relational. Next, you'llcustomize the Transformer stage so that the data will be modified as needed before it's loaded intothe target table.

Edit the Transformer stageThe Transformer stage needs to extract time information from an input string containing timestampdata in order to correctly populate the TIMECHANGED column in the target DWADMIN.ACCTtable. DataStage provides a number of built-in functions for manipulating date/time data,performing data type conversions, and so on. You need to edit the Transformer stage to invokeappropriate functions for your data conversion.

1. To open the Transformer stage, double-click on its icon.2. Click on the tool bar's Column Auto-Match icon to instruct DataStage to automatically map

columns by name between the input and output links. (You defined the result set for thisstage's input link in "Edit the XML Input stage" section. You defined the result set for thisstep's output link in the "Edit the DB2 target table stage" section.)

sidefile2.html



of 28

3. Click OK.4. Modify the Derivation setting for the TIMECHANGED column of the output link to transform

the data as needed. Recall that the input string contains a full timestamp with date and timeinformation, and you want to populate the TIMECHANGED column in the DB2 target tablewith only a time value.

a. Highlight the appropriate derivation setting, right-click, and select Edit Derivation.b. A blank pane appears. Use the built-in wizards to select appropriate transformation

function calls, or enter the following code: TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED,"%yyyy-%mm-%ddT%hh:%nn:%ss"))

In case you're curious, the inner function call converts the input string into a timestampthat complies with a specific format. The outer function takes this timestamp andconverts it to a time value. For details on these functions or the Transformer stage, seethe Resources section.

5. Verify that your transformation appears similar to Figure 9, which illustrates the mappingbetween the input and output links (generated in Step 2) as well as the derivation that youedited in Step 4:

Figure 9. Modified properties for the Transformer stage

(View a larger version of Figure 9.)6. Click on OK.

You've now completed the modifications necessary for the initial portion of your DataStage job. It'stime to compile and test your work.

sidefile3.html



of 28

Compile and test your work

Follow the standard DataStage procedures to compile, validate, and run your job. Consult theResources section for links to DataStage documentation, if needed. Here's a quick summary:

1. Press the F7 key, or select the Compile icon from the toolbar. Ensure that the job compiledwithout error.

2. Press Ctrl+F5, or select the Run icon from the toolbar.3. Set Warnings to no limit, and click on Run.4. DataStage will attempt to execute your job. Launch the DataStage and QualityStage Director

to inspect the results of your job.5. If needed, switch to the Status page and highlight your job. To do so select View > Status, or

click on the Status icon in the toolbar.6. Switch to the Log page to review details about the job's execution. To do so, select View >

Log, or click on the Log icon on the toolbar. The job should execute successfully, althoughwarning messages may appear.

At this point, your can inspect the data in the target table using standard DB2 tools. Or, if youprefer, you can view the data from the DataStage Designer by doing the following:

1. Double-click on the DB2 Connector that represents the target table.2. Click on the View Data link to display the data in the target table.

Add more stages to populate the final warehouse table

To complete the scenario, you need to extract the appropriate information from the XML Inputsource and load this data into the DWADMIN.HOLDINGS table. Doing so requires adding two newstages to the existing job.

1. Select a Copy stage from the Processing section of the palette, and place the Copy stagejust below the existing XML Input stage.

2. Place a new DB2 Connector stage on the palette to the right of the Copy stage.3. Link the XML input stage to the new Copy stage, and link the new Copy stage to the new DB2

Connector stage.4. If desired, rename your new links and stages in a more descriptive manner.5. Verify that your job design appears similar to Figure 10, which shows the various stages

linked together, as described in Step 3:



of 28

Figure 10. Revised DataStage job for this scenario

(View a larger version of Figure 10.)

Edit the new DB2 Connector stage:

1. Specify the appropriate connection information, including the instance type (DB2), databasename (TPOX), and a valid user ID and password.

2. Test the connection.3. Specify the appropriate usage information:

• Write Mode: Insert• Generate SQL: Yes• Table name: DWADMIN.HOLDINGS• Table action: Truncate• Array Size: 1

4. Select View Data to verify that the specifications are valid.5. Click on the Column tab.6. Load information from the table definition for DWADMIN.HOLDINGS. Import all columns.7. Set the Key value of the ID column to Yes.8. Specify the XPATH expressions that DataStage should use to extract information from the

input XML data (shown in Figure 11). Do this for every column except the ID column. Theappropriate description settings include:

• SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()• TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()• QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()

sidefile4.html



of 28

Figure 11. Edited column values of the DB2 Connector for theDWADMIN.HOLDINGS table

(View a larger version of Figure 11.)

Edit the existing XML Input stage:

1. Click on the Output tab and use the drop-down menu beneath the Output Name box toidentify the appropriate DataStage link to modify. (In Figure 10, HoldingInfo is the appropriatelink, as it controls the data flow from the XML Input stage to the Copy stage that you justadded to the job.)

2. Click on the Transformation Settings tab.3. Verify that the Repetition element required setting is checked.4. Select the include namespace declaration check box.5. Enter the following XML namespace definition in the white box:

xmlns:ns="http://tpox-benchmark.com/custacc"

6. Click on the Columns sub-tab.7. Load information for the DWADMIN.HOLDINGS table. Select all columns.8. Specify the XPath expressions that DataStage should use to extract information from the

input XML data (shown in Figure 12). Do this for every column except the ID column. Theappropriate description settings include:

• SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()• TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()• QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()

sidefile5.html



of 28

Figure 12. Column definitions for output from XML Input stage to newCopy stage

(View a larger version of Figure 12.)9. Set the Key value for the SYMBOL column to Yes. This causes the stage to produce an

output record for each occurrence of the Symbol element. Since Symbol occurs in everyPosition element, the stage produces a record for each Position element.

10. Click on OK.

Edit the new Copy stage. (This stage passes information between the XML Input and DB2Connector stages, each of which associates different meanings with Key values. The XML Inputstage uses a Key column to identify the repetition element, which determines how the XML data isgenerated. When you edited the XML Input stage in the previous section, the work you performedin Steps 3 and 9 identified the Key value and repetition element. The DB2 Connector, of course,uses Key to identify a column that is a database key. The Copy stage allows data to be passedbetween these stages without introducing any operational conflicts due to the way each stagetreats key values. To do so, its input and output links identify different columns as keys.)

1. From the Output tab, select Mappings.2. Click on Auto-Match, and select OK.3. If desired, inspect the Key definitions for the columns on the input and output links.4. Click on the Input tab, and then click on the Columns sub-tab.5. Verify that the SYMBOL column is identified as the Key.6. Click on the Output tab, and then click on the Columns sub-tab.7. Verify that the ID column is identified as the Key.8. Click on OK to finish editing this stage.

Test your revised job:

1. If you haven't already done so, save your job.2. Compile the job.3. Verify that there are no errors.4. Run the job.

sidefile6.html



of 28

5. Use DataStage Director to inspect the log and verify that the job finished successfully.6. Optionally, inspect the data in the target table by selecting View Data in the stage editor for

the target DB2 stage.

Summary

Increased use of XML as a preferred format for data exchange is prompting data architectsand administrators to evaluate options for integrating business-critical XML data into their datawarehouses. In this first installment of this two-part series, you learned how IBM InfoSphereDataStage can extract and transform XML data managed by DB2 pureXML. In addition, youexplored how DataStage can load this data into two tables: one with traditional SQL data types,and one that features both relational and XML columns.

The second part of this article series explores another important scenario: using DataStage to readinformation from a flat file, convert the data into an XML format, and load this XML data into a datawarehouse that contains a table with a DB2 pureXML column.

Acknowledgments

Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on thisarticle.




of 28

Downloads

Description Name SizeSample scripts and data DSsetup.zip 141KB

http://www.ibm.com/developerworks/apps/download/index.jsp?contentid=421809&filename=DSsetup.zip&method=http&locale=



of 28

Resources

Learn

• IBM InfoSphere DataStage: Get an overview of IBM InfoSphere DataStage.• "Parallel Job Tutorial" (IBM, 2008): Learn more about creating parallel jobs (publication

SC18-9889-01).• IBM InfoSphere DataStage documentation (IBM Information Server Information Center): Get

more details about DataStage capabilities and how to use them.• DB2 pureXML wiki: Find a comprehensive set of links to demos, free downloads, technical

papers, and more.• DB2 pureXML Cookbook (IBM Press, August 2009): Explore this comprehensive guide to

DB2 pureXML technology for all supported platforms.• "Query XML Data with SQL" (developerWorks, March 2006): Learn how to query data stored

in XML columns using SQL and SQL/XML, and explore many of the sample queries includedin this article.

• "Query XML Data with XQuery" (developerWorks, April 2006): Learn how to query datastored in XML columns using XQuery, and explore many of the sample queries included inthis article.

• "Enhance business insight and scalability of XML data with new DB2 9.7 pureXMLfeatures" (developerWorks, April 2009): Get an overview of the latest DB2 pureXML features.

• Transaction Processing over XML (TPOX) benchmark: Learn more about the TransactionProcessing over XML (TPOX).

• developerWorks Information Management zone: Learn more about Information Management.Find technical documentation, how-to articles, education, downloads, product information,and more.

• Stay current with developerWorks technical events and webcasts.

Get products and technologies

• DB2 Express-C 9.7: Download a free version of DB2 Express database server for thecommunity that includes pureXML.

• DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7.• Build your next development project with IBM trial software, available for download directly

from developerWorks.

Discuss

• Participate in the discussion forum for this content.• Participate in the DB2 pureXML forum or one of the DataStage forums.• Participate in developerWorks blogs and get involved in the My developerWorks community;

with your personal profile and custom home page, you can tailor developerWorks to yourinterests and interact with other developerWorks users.

http://www.ibm.com/software/data/infosphere/datastage/

http://publibfp.boulder.ibm.com/epubs/pdf/c1898891.pdf

http://publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.roadmap.doc/topics/ds_roadmap.html

http://www.ibm.com/developerworks/wikis/display/db2xml/Home

https://www.ibm.com/developerworks/wikis/display/db2xml/DB2+pureXML+Cookbook

http://www.ibm.com/developerworks/data/library/techarticle/dm-0603saracco2/

http://www.ibm.com/developerworks/data/library/techarticle/dm-0604saracco/

http://www.ibm.com/developerworks/data/library/techarticle/dm-0904db297purexml/

http://www.ibm.com/developerworks/data/library/techarticle/dm-0904db297purexml/

http://tpox.sourceforge.net/

http://www.ibm.com/developerworks/data/

http://www.ibm.com/developerworks/offers/techbriefings/

http://www.ibm.com/developerworks/downloads/im/udbexp/learn.html

http://www.ibm.com/developerworks/downloads/im/udb/learn.html

http://www.ibm.com/developerworks/downloads/

http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1423

http://www.ibm.com/developerworks/forums/forum.jspa?forumID=1423

http://www.ibm.com/developerworks/data/community/#iis

http://www.ibm.com/developerworks/blogs/

http://www.ibm.com/developerworks/community



of 28

About the authors

Branislav Barnak

Branislav Barnak is a software architect at IBM located in Boca Raton, FL. He workson IBM InfoSphere DataStage connectivity and is focusing on IBM technologies suchas DB2, WebSphere MQ, mainframe data storage systems, and others. Branislav hasbeen involved with the development of connectivity technologies for more than 10years. He has worked on several connectivity frameworks interfacing with databaseand messaging products such as Oracle, PeopleSoft, Teradata, DB2, WebSphereMQ, and others. He holds a BS degree in Electrical Engineering from the University ofBelgrade, Serbia.

Amir Bar-or

Amir Bar-or is a senior architect and manager in the Enterprise InformationManagement Group at the Massachusetts Laboratory. He has more than 10 years ofexperience in database research and development as a researcher in HP Labs andthen in IBM SWG. Today, he is leading the development of DataStage future XMLcapabilities.

Cynthia M. Saracco

Cynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratorywho specializes in emerging technologies and database management topics. Shehas 23 years of software industry experience, has written 3 books and more than 60technical papers, and holds 7 patents.

Paul Stanley

Paul Stanley is a senior architect in the Enterprise Information Management Groupin Boca Raton, FL. He has been architecting and managing the development ofconnectivity components for WebSphere Transformation Extender and InfoSphereDataStage for more than 12 years.



of 28

© Copyright IBM Corporation 2009(www.ibm.com/legal/copytrade.shtml)Trademarks(www.ibm.com/developerworks/ibm/trademarks/)

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com/developerworks/ibm/trademarks/

IBM InfoSphere DataStage and DB2 PureXML,Part 1-Integrate XML Operational Data Into a Data Warehouse

Documents

Transcript of IBM InfoSphere DataStage and DB2 PureXML,Part 1-Integrate XML Operational Data Into a Data Warehouse