Data Stage Fa My Whitepaper

34
DataStage ® Product Family Architectural Overview

Transcript of Data Stage Fa My Whitepaper

Page 1: Data Stage Fa My Whitepaper

DataStage® Product FamilyArchitectural Overview

Page 2: Data Stage Fa My Whitepaper

I. Executive SummaryIn today’s information age, organizations have grown beyond the question "Should we build a data warehouse?" to "Canwe consolidate our global business data such that it is reliable and makes sense?" With many organizations implement-ing e-business, customer relationship management (CRM), enterprise applications and decision support systems, theprocess of collecting, cleansing, transforming, and preparing business data for analysis has become an enormous inte-gration challenge. A solid data integration environment, by definition, will attract more data consumers since the ultimategoal of business intelligence is to discover new sales or cross-selling opportunities as well as reduce operating costs.However, more users translate into more requirements in terms of performance, features, overall support, as well as dif-ferent business intelligence tools. Very quickly developers find themselves having to incorporate more complex datasources, targets, applications and meta data repositories into an existing data environment.

How an organization chooses to integrate its business data has an impact on costs in terms of time, money, and theresources required to expand and maintain the data environment. If an organization chooses the wrong tool for theirenvironment they may find that they have to write custom code or manually enter data in order to integrate it with theexisting infrastructure and toolset. Writing custom code is often the simplest approach at first, but quickly exhausts theorganization’s ability to keep up with demand. Change is very difficult to manage when the transformation rules existburied in procedural code or when meta data has to be manually entered again and again. Purchasing a full solutionfrom an RDBMS vendor can be an attractive initial choice but it requires a commitment to their database and tool suite.

Ideally, the best approach should allow flexibility in the data integration architecture so that the organization can easilyadd other best-of-breed tools and applications over time. The most flexible and cost effective choice over the long termis to equip the development team with a toolset powerful enough to do the job as well as allow them to maintain andexpand the integrated data environment.

That choice is the DataStage Product Family.

With the DataStage Product Family, developers can:

• Provide data quality and reliability for accurate business analysis

• Collect, integrate, and transform data from diverse sources such as the web, enterprise applications (SAP, PeopleSoft)and mainframes with structures ranging from simple to highly complex without overloading available resources

• Integrate meta data across the data environment to maintain consistent analytic interpretations

• Capitalize on mainframe power for extracting and loading legacy data as well as maximize network performance

Providing developers with power features is only part of the equation. Ascential Software has long supported developersin their efforts to meet tight deadlines, fulfill user requests, and maximize technical resources. DataStage offers a devel-opment environment that:

• Works intuitively thereby reducing the learning curve and maximizing development resources

• Reduces the development cycle by providing hundreds of pre-built transformations as well as encouraging code re-use via APIs

• Helps developers verify their code with a built-in debugger thereby increasing application reliability as well as reducingthe amount of time developers spend fixing errors and bugs

• Allows developers to quickly "globalize" their integration applications by supporting single-byte character sets andmulti-byte encoding

• Enhances developer productivity by adhering to industry standards and using certified application interfaces

Page 3: Data Stage Fa My Whitepaper

Another part of the equation involves the users of integrated data. Ascential Software is committed to maximizing theproductivity of the business users and decision-makers. An integrated data environment built with DataStage offers datausers:

• Quality management which gives them confidence that their integrated data is reliable and consistent

• Meta data management that eliminates the time-consuming, tedious re-keying of meta data as well as provides accu-rate interpretation of integrated data

• A complete integrated data environment that provides a global snapshot of the company’s operations

• The flexibility to leverage the company’s Internet architecture for a single point of access as well as the freedom touse the business intelligence tool of choice

Ascential’s productivity commitment stands on the premise that data integration must have integrity. To illustrate thispoint, company XYZ uses Erwin and Business Objects to derive their business intelligence. A developer decides that it isnecessary to change an object in Erwin but making that change means that a query run in Business Objects will nolonger function as constructed. This is a common situation that leaves the business user and the developer puzzled andfrustrated. Who or rather what tool is responsible for identifying or reconciling these differences? Ascential Software’sDataStage family addresses these issues and makes data integration transparent to the user. In fact, it is the only prod-uct today designed to act as an "overseer" for all data integration and business intelligence processes.

In the end, building a data storage container for business analysis is not enough in today’s fast-changing business cli-mate. Organizations that recognize the true value in integrating all of their business data, along with the attendant com-plexities of that process, have a better chance of withstanding and perhaps anticipating significant business shifts.Individual data silos will not help an organization pinpoint potential operation savings or missed cross-selling opportuni-ties. Choosing the right toolset for your data environment can make the difference between being first to market andcapturing large market share or just struggling to survive.

Page 4: Data Stage Fa My Whitepaper

Table of Contents

I. EXECUTIVE SUMMARY 2

II. INTRODUCTION 6

GETTING READY FOR ANALYSIS 6

THE PITFALLS 7

NOT ALL TOOLS ARE CREATED EQUAL 7

III. THE DATASTAGE PRODUCT FAMILY 8

DATASTAGE XE 8

Data Quality Assurance 8

Meta Data Management 9

Data Integration 9

DATASTAGE XE/390 10

DATASTAGE XE PORTAL EDITION 10

DATASTAGE FOR ENTERPRISE APPLICATIONS 10

IV. THE ASCENTIAL APPROACH 11

DEVELOPMENT ADVANTAGES 11

Application Design 11

Debugging 12

Handling Multiple Sources and Targets 12

Native Support for Popular Relational Systems 12

Integrating Complex Flat Files 13

Applying Transformations 13

Merge Stage 13

Graphical Job Sequencer 13

ADVANCED MAINTENANCE TOOLS 14

Object Reuse and Sharing 14

Version Control 14

Capturing Changed Data 15

National Language Support 15

EXPEDITING DATA TRANSFER 16

Named Pipe Stage 16

FTP Stage 16

USING EXTERNAL PROCESSES 16

The Plug-In API 16

Invoking Specialized Tools 16

Page 5: Data Stage Fa My Whitepaper

PERFORMANCE AND SCALABILITY 17

Exploiting Multiple CPU’s 17

Linked Transformations 17

In-memory Hash Tables 18

Sorting and Aggregation 18

Bulk Loaders 18

SUPPORT FOR COMPLEX DATA TYPES 18

XMLPack 18

ClickPACK 19

Legacy Sources 19

MQSeries 20

INTEGRATING ENTERPRISE APPLICATIONS 21

DataStage Extract PACK for SAP R/3 22

DataStage Load PACK for SAP BW 22

DataStage PACK for Siebel 22

DataStage PACK for PeopleSoft 22

V. NATIVE MAINFRAME DATA INTEGRATION 23

VI. END-TO-END META DATA MANAGEMENT 24

META DATA INTEGRATION 25

Semantic Meta Data Integration 25

Bidirectional Meta Data Translation 26

Integration of Design and Event Meta Data 26

META DATA ANALYSIS 26

Cross Tool Impact Analysis 26

Data Lineage 27

built-In Customizable query and Reporting 27

META DATA SHARING, REUSE, AND DELIVERY 27

Publish and Subscribe Model for Meta Data Reuse 27

Automatic Change Notification 28

On-Line Documentation 28

Automatic Meta Data Publishing to the DataStage XE Portal Edition 28

VII. EMBEDDED DATA QUALITY ASSURANCE 28

VIII. USING THE WEB TO ACCESS THE DATA INTEGRATION ENVIRONMENT 29

IX. CONCLUSION 30

X. APPENDIX A 32

PLATFORMS 31

STAGES 31

METABROKERS 33

Sybase PowerDesigner 33

CHANGED DATA CAPTURE SUPPORT ENVIRONMENTS 33

Page 6: Data Stage Fa My Whitepaper

II. IntroductionIn today’s information age, organizations have grown beyond the question "Should we build a data warehouse?" to "Canwe consolidate our global business data such that it is reliable and makes sense and what tools or toolsets are availableto help us?" With many organizations implementing e-business, CRM, enterprise applications and decision support sys-tems, the process of collecting, cleansing, transforming, and preparing business data for analysis has become an enor-mous integration challenge. Unfortunately, the headlong rush to get these systems up and running has resulted inunforeseen problems and inefficiencies in managing and interpreting the data. How were these problems created andbetter yet, how can they be avoided?

In the beginning, the organizational requirement for the data environment was to provide easy and efficient transfer ofusable data ready for analysis from the transaction systems to the data consumers. This requirement has expanded to:

• Merging single data silos into a reliable, consistent "big picture"

• Developing and maintaining the infrastructure’s meta data such that developers, administrators, and data consumersunderstand the data and all its inter-relationships

• Leveraging the Internet and intranet infrastructure so that administrators and data consumers can access systems anddata, anytime and anywhere

• Allowing the data consumers to use their business intelligence (BI) tool of choice

In order to meet all these requirements, developers have had three options for cleansing and integrating enterprise datasuch that it is reliable and consistently interpreted. They could:

• Write custom code

• Purchase a packaged solution from an RDBMS vendor

• Equip the infrastructure and data designers with powerful, "best of breed" tools

Choosing a development option without considering the organization’s data requirements or the complex issues involv-ing data integration is the first pitfall. From there the problems start to build because organizations grow, data usersbecome more sophisticated, business climates change, and systems and architectures evolve. Before building or updat-ing their data environment, developers and managers need to ask:

• How will the development options and data environment architecture affect overall performance?

• How will the environment handle data complexities such as merging, translating and maintaining meta data?

• How will we correct invalid or inaccurate data?

• How will we integrate data and meta data especially when it comes from obscure sources?

• How will we manage the data environment as the organization expands and changes?

Simply building a storage container for the data is not enough. Ultimately, the integrated environment should have reli-able data that is "analysis ready".

Getting Ready for Analysis Exactly how does the stored transaction data become suitable for mining, query, and reporting? The answer is it must betaken through five major stages. First, all of the organization’s business data must be collected from its various sources,consolidated, and then normalized in order for the data consumer to see "the big picture". Simply loading the data intosilos will not support a holistic view across the enterprise and in fact will create more of a "legacy" data environment.

Second, since the data coming from online systems is usually full of inaccuracies it has to be "cleansed" as it is movedto the integrated data environment. Company mergers and acquisitions that bring diverse operational systems togetherfurther compound the problem. The data needs to be transformed in order to aggregate detail, expand acronyms, identify redundancies, calculate fields, change data structures, and to impose standard formats for currency, dates,names, and so forth. In fact, data quality is a key factor in determining the accuracy and reliability of the integrated dataenvironment as a whole.

6

Page 7: Data Stage Fa My Whitepaper

7

Third, the details about the data also need to be consolidated into classifications and associations that turn them intomeaningful, accessible information. The standard data definitions, or meta data, must accompany the data in order forusers to recognize the full value of the information. Realizing the data’s full potential means that the users must have aclear understanding of what the data actually means, where it originated, when it was extracted, when it was lastrefreshed, and how it has been transformed.

Fourth, the "cleansed" integrated data and meta data should be accessible to the data users from an Internet portal, aninternal company network or direct system connection. Ideally, they also should be able to mine the data and update themeta data with any BI tool they choose. In the end, report generation should support the organization’s decision-makingprocess and not hinder or confuse it.

Avoiding the Pitfalls The fifth and final stage concerns maintenance. How an organization chooses to integrate its business data and metadata has an impact on costs in terms of time, money, and the resources required to expand and maintain the data envi-ronment. If an organization chooses the wrong tool for their environment they may find that they have to write customcode or manually enter data in order to integrate it with the existing infrastructure and toolset. Writing custom code isoften the simplest approach at first, but quickly exhausts the organization’s ability to keep up with demand. Change isvery difficult to manage when the transformation rules exist buried in procedural code or when meta data has to be man-ually entered again and again. Purchasing a full solution from an RDBMS vendor can be an attractive initial choice but itrequires a commitment to their database and tool suite.

Ideally, the best approach should allow flexibility in the integrated data architecture so that the organization can easilyadd other best-of-breed tools over time. The most flexible and cost effective choice over the long term is to equip thedevelopment team with a toolset powerful enough to do the job as well as allow them to maintain and expand the dataenvironment.

Not All Tools are Created Equal The successful toolset for constructing the data environment will have a relentless focus on performance, minimize thetradeoff between integration and flexibility, and have a clearly demonstrable return on investment. In order to accomplishthese goals, designers and developers need a toolset that maximizes their performance by:

• Handling architecture complexity in terms of integrating distributed and non-distributed systems, synchronization,mainframe and client/server sources

• Providing the choice to harness the power of mainframe systems for ‘heavy-duty’ extract, transform, loadprocessing

• Dealing effectively with transformation complexity such as structured and unstructured data sources, integrating web data, and changed data

• Providing extensive built-in transformation functions

• Permitting the re-use of existing code through APIs thereby eliminating redundancy and retesting of establishedbusiness rules

• Offering a full debugging environment for testing and exit points for accessing specialized external or third-party tools

• Closing the meta data gap between design and business intelligence tools

• Supplying an easy deployment mechanism for completed transformation applications

• Making data quality assurance an inherent part of the integrated data environment

In addition, the toolset must offer high performance for moving data at run-time. It should pick the data up once and putit down once, performing all manipulations in memory and using parallel processing wherever possible to increasespeed. Since no tool stands alone, it must have an integration architecture that provides maximum meta data sharingwith other tools without imposing additional requirements upon them. Indeed, each tool should be able to retain its ownmeta data structure yet play cooperatively in the environment.

Page 8: Data Stage Fa My Whitepaper

And what about the web? Today’s development team must be able to integrate well-defined, structured enterprise datawith somewhat unstructured e-business data. Ideally, the toolset should accommodate the meta data associated with the"e-data" as well as provide integration with the organization’s business intelligence (BI) toolset. Without this integration,the organization runs the risk of misinterpreting the effect of e-commerce on their bottom line. Further, most global enter-prises have taken advantage of the Internet infrastructure to provide business users with information access anytime,anywhere. Following suite, the integrated data environment needs to be accessible to users through their web browseror personalized portal.

Finally, the toolset’s financial return must be clearly measurable. Companies using a toolset that fulfills these require-ments will be able to design and run real-world integrated data environments in a fraction of the average time.

III. The DataStage Product FamilyIs there a toolset today that can fulfill all the requirements? The answer is yes. The DataStage® Product Family fromAscential Software is the industry’s most comprehensive set of data integration products. It enables companies world-wide to create and manage scalable, complex data integration infrastructures for successful implementation of businessinitiatives such as CRM analytics, decision support, enterprise applications and e-business.

Since each enterprise has its own data infrastructure and management requirements, Ascential Software offers differentconfigurations of the DataStage Product Family:

• DataStage XE — the core product package that includes data quality, meta data management, and data integration

• DataStage XE/390 — for full mainframe integration with the existing data environment,DataStage XE Portal Edition –for web-enabled data integration development, deployment, administration, and access

• DataStage XE Portal Edition — for delivering a complete, personalized view of business intelligence and information assets across the enterprise

• DataStage Connectivity for Enterprise Applications — for integrating specific enterprise systems such as SAP,PeopleSoft and Siebel

DataStage XE The foundation of the DataStage Product Family is DataStage XE. DataStage XE consists three main components: dataquality assurance, meta data management, and data integration. It enables an organization to consolidate, collect andcentralize high volumes of data from various, complex data structures. DataStage XE incorporates the industry’s premiermeta data management solution that enables businesses to better understand the business rules behind their informa-tion. Further, DataStage XE validates enterprise information ensuring that it is complete, accurate and retains its superi-or quality over time. Using DataStage XE companies derive more value from their enterprise data that allows them toreduce project costs, and make smarter decisions, faster.

DATA QUALITY ASSURANCEDataStage XE’s data quality assurance component ensures that data integrity is maintained in the integrated environ-

8

Valu

e

DataStage XEDataStage XEDataStage XEDataStage XEDataStage XE

Dat

aSta

ge C

onne

ctiv

ityfo

r Ent

erpr

ise

App

licat

ions

Enterprise Scope

ServicesServices

Dat

aSta

ge C

onne

ctiv

ityD

ataS

tage

Con

nect

ivity

Dat

aSta

ge C

onne

ctiv

ityD

ataS

tage

Con

nect

ivity

Dat

aSta

ge C

onne

ctiv

ityi

Ali

tifo

r Ent

erpr

ise

App

licat

ions

for E

nter

pris

e A

pplic

atio

ns fo

r Ent

erpr

ise

App

licat

ions

for E

nter

pris

e A

pplic

atio

ns

Figure1: The DataStage Product Family

Page 9: Data Stage Fa My Whitepaper

9

ReportingData MiningQueryAnalyticApplications

Dat

a Q

ualit

y A

ssur

ance

ata

IMS, DB2ADABAS

VSAMMQ Series

Oracle, UDBSybase

MicrosoftInformix

IIS, Apache,Netscape,

Outlook

ComplexFlat Files

FTP

XML, Infomover,EMC, Trillium

FirstLogic, SiebelPeopleSoft

DataStageDataStageDataStageDataStageDataStageDataStageataS gDataStageStagDataStageSSerSerSerrSerrS

DataStageSSerSSer

DataStageataSDataStageSverververver

ggDataStagegDataStagegDataStageStagvServrverrSerrverSerr

DataStageDataStageSver

DataStageDataStageS

Managerg g rManagerManager Designer DirectorManager r DesignerManager Designer Directorr Designer rManager Designer Directorrr DirectorManager Designer Directorr Director

DM

DW

BW

Meta Data ManagementgMeta Data ManagementMeta Data ManagementMeta Data ManagementMeta Data Management

DataStage XEArchitecture Overview

Figure 2: The combination of data integration, meta data management and data quality assurance makes DataStage XE theindustry's most complete data integration platform.

ment. It gives development teams and business users the ability to audit, monitor, and certify data quality at key pointsthroughout the data integration lifecycle. They can identify a wide range of data quality problems and business rule viola-tions that can inhibit data integration efforts as well as generate data quality metrics for projecting financial returns.

By improving the quality of the data going into the transformation process, organizations improve the data quality of thetarget. The end result is validated data and information for making smart business decisions and a reliable, repeatable,and accurate process for making sure that quality information is constantly used throughout the enterprise.

META DATA MANAGEMENTMost data integration environments are created using a wide variety of tools that typically cannot share meta data. As aresult, business users are unable to understand and leverage enterprise data because the contextual information ormeta data required is unavailable or unintelligible. Based on patented technology, DataStage XE’s meta data manage-ment module offers broad support for sharing meta data among third-party data environments. This level of sharing ismade possible in great part by Ascential Software’s MetaBrokersTM. MetaBrokers translate the form, meaning, and con-text of the meta data generated by a tool. This translation provides a basis for interpreting your strategic information bypromoting a common understanding of the meaning and relevancy of the integrated data.

DATA INTEGRATIONDataStage XE optimizes the data collection, transformation and consolidation process with a client/server data integra-tion architecture that supports team development and maximizes the use of hardware resources. The client is an inte-grated set of graphical tools that permit multiple designers to share server resources and design and manage individualtransformation mappings. The client tools are:

• Designer — Provides the design platform via a "drag and drop" interface to create a visual representation of the data flow and transformations in the creation of ‘jobs’ that execute the data integration tasks

• Manager — Catalogs and organizes the building blocks of every project including objects such as table definitions,centrally coded transformation routines, and meta data connections

• Director — Provides interactive control for starting, stopping, and monitoring the jobs

The server is the actual data integration workhorse where parallel processes are controlled at runtime to send databetween multiple disparate sources and targets. The server can be installed in either NT or UNIX environments andtuned to take advantage of investments in multiple CPUs and real memory. Using the many productivity features includ-ed in DataStage XE, an organization can reduce the learning curve, simplify administration and maximize developmentresources thereby shrinking the development and maintenance cycle for data integration applications.

Page 10: Data Stage Fa My Whitepaper

DataStage XE/390DataStage XE/390 is the industry’s only data integration product to offer full support for client-server and mainframeenvironments in a single cross-platform environment. By allowing users to determine where to execute jobs—at thesource, the target, or both—DataStage XE/390 provides data integration professionals with capabilities unmatched bycompetitive offerings. No other tool provides the same quick learning curve and common interface to build data integra-tion processes that will run on both the mainframe and the server. This capability gives the enterprise greater operationalefficiency.

Furthermore, by extending the powerful meta data management and data quality assurance capabilities to the industry-leading OS/390 mainframe environment, users are ensured that the processed data is reliable and solid for businessanalysis.

DataStage XE Portal EditionBuilding upon DataStage XE’s functionality, DataStage XE Portal Edition organizes and delivers business information tousers across an enterprise by providing a complete, personalized view of business intelligence and information assets.Specifically, the Portal Edition provides web-enabled data integration development, deployment, administration andusage features. Administrators have the flexibility to "command and control" the data integration through any web-browser.

For the business user, Ascential Software offers the first portal product that integrates enterprise meta data. With itsunique Business Information Directory (BID), DataStage XE Portal Edition links, categorizes, personalizes, and distrib-utes a wide variety of data and content from across the enterprise: applications, business intelligence reports, presenta-tions, financial information, text documents, graphs, maps, and other images.

The Portal Edition gives end users a single point of web access for business intelligence. Users get a full view of theenterprise by pulling together all the key information for a particular subject area. For example, the business user canview a report generated by Business Objects without ever leaving their web browser. Furthermore, the business usercan also access the report’s meta data from the web browser thereby increasing the chances of interpreting the dataaccurately. The end result is web access to an integrated data environment as well as remote monitoring for developers,administrators, and users, anytime and anywhere.

DataStage Connectivity for Enterprise ApplicationsBusiness analysts from many organizations depend upon critical data from enterprise application (EA) systems such as:SAP, PeopleSoft and Siebel. However, to gain a complete view of the business it is often necessary to integrate datafrom enterprise applications with other enterprise data sources.

Integrating data from enterprise applications is a challenging and daunting task for developers. These systems usuallyconsist of proprietary applications and complex data structures. Integrating them often requires writing low-level APIcode to read, transform and then output the data in the other vendor’s format. Writing this code is difficult, time-consuming and costly. Keeping the interfaces current as the packaged application changes presents yet another chal-lenge.

To ease these challenges, Ascential Software has developed a set of packaged application connectivity kits (PACKs) tocollect, consolidate and centralize information from various enterprise applications to enhance business analysis usingopen target environments. These PACKs are the DataStage Extract PACK for SAP R/3, DataStage Load PACK for SAPBW, DataStage PACK for Siebel and the DataStage PACK for PeopleSoft.

10

Page 11: Data Stage Fa My Whitepaper

IV. The Ascential Approach DataStage XE’s value proposition stems from Ascential’s underlying technology and its approach to integrating data andmeta data. The product family is a reflection of the company’s extensive experience with mission critical applications andleading technology. Supporting such systems requires an intimate understanding of operating environments, perfor-mance techniques, and development methodologies. Combining this understanding with leading technology in areassuch as memory management, user interface design, code generation, data transfer, and external process calls givesdevelopers an edge in meeting data user requirements quickly and efficiently. The sections that follow detail key ele-ments that make Ascential’s approach to data integration unique.

Development AdvantagesWith a premium on IT resources, DataStage XE offers developers specific advantages in building and integrating dataenvironments. Starting with designing and building applications through integrating data sources, transforming data, run-ning jobs, and debugging, DataStage XE provides short cuts and automation designed to streamline the developmentprocess and get integration applications up and running reliably.

APPLICATION DESIGNThe graphical palette of the DataStage Designer is the starting place for data integration job design. It allows developersto easily diagram the movement of data through their environment at a high level. Following a workflow style of thinking,designers select source, target, and process icons and drop them onto a "drafting table" template that appears initiallyas an empty grid. These icons called "stages" exist for each classic source/target combination. The Designer connectsthese representative icons via arrows called "links" that illustrate the flow of data and meta data when execution begins.Users can annotate or write notes onto the design canvas as they create DataStage jobs, enabling comments, labels orother explanations to be added to the job design. Further, DataStage uses the graphical metaphor to build table lookups,keyed relationships, sorting and aggregate processing.

Designing jobs in this fashion leads to faster completion of the overall application and simpler adoption of logic by othermembers of the development team thereby providing easier long-term maintenance. Developers drag and drop iconsthat represent data sources, data targets, and intermediate processes into the diagram. These intermediate processesinclude sorting, aggregating, and mapping individual columns. The Designer also supports drag-and-drop objects fromthe Manager to the Designer palette. Further, developers can add details such as table names and sort keys via thissame graphical drag and drop inter-face. DataStage pictorially illus-trates lookup relationships by con-necting primary and foreign keys,and it makes function calls by sim-ply selecting routine names from adrop down menu after clicking theright mouse button. Developers canselect user-written routines from acentrally located drop down list.Sites may import pre-existingActiveX routines or develop theirown in the integrated editor/testfacility. At run-time DataStage doesall of the underlying work requiredto move rows and columns throughthe job, turning the graphicalimages into reality.

11

Figure 3: DataStage Job Design

Page 12: Data Stage Fa My Whitepaper

This same metaphor is extended in "drill down" fashion to the individual field level, where column-to-column mappingsare created visually. The developer draws lines that connect individual columns of each source table, row, or answer setdefinition to their corresponding targets. Developers can specify complex Boolean logic, string manipulation, and arith-metic functions within the Designer by using the extensive list of built-in functions listed in a drop-down menu.Developers construct transformation expressions in a point-and-click fashion. If the developer needs to use externaluser-written routines, they too can be selected from the same menu. Filters and value-based constraints that support thesplitting of rows to alternative or duplicate destinations are also designed and illustrated using this model.

The Designer helps take the complexity out of long-term maintenance of data integration applications. Team memberscan share job logic and those who inherit a multi-step transformation job will not be discouraged or delayed by the needto wade through hundreds of lines of complex programming logic.

DEBUGGINGThe DataStage Debugger is fully integrated into the Designer. The Debugger is a testing facility that increases productiv-ity by allowing data integration developers to view their transformation logic during execution. On a row-by-row basis,developers set break points and actually watch columns of data as they flow through the job. DataStage immediatelydetects and corrects errors in logic or unexpected legacy data values. This error detection enhances development pro-ductivity especially when working out date conversions or when debugging a complex transformation such as rows thatbelong to a specific cost center or why certain dimensions in the star schema are coming up blank or zero. Laboringthrough extensive report output or cryptic log files is unnecessary because the DataStage Debugger follows the ease-of-use metaphor of the DataStage Designer by graphically identifying user-selected break points where logic can be veri-fied.

HANDLING MULTIPLE SOURCES AND TARGETSDataStage makes it possible to compare and combine rows from multiple, heterogeneous sources—without writing low-level code. Multiple sources might be flat files, whose values are compared in order to isolate changed data, or hetero-geneous tables joined for lookup purposes. The Server engine issues the appropriate low-level SQL statements, com-pares flat files, and handles connectivity details. Most critical for performance are file comparisons that are performed "instream," and rows of data meeting selected join characteristics that are continuously fed into the rest of the transforma-tion process.

The DataStage design supports the writing of multiple target tables. In the same job tables can concurrently:

• Be of mixed origin (different RDBMS or file types)

• Have multiple network destinations (copies of tables sent to different geographic regions)

• Receive data through a variety of loading strategies such as bulk loaders, flat file I/O, and direct SQL

By design, rows of data are selectively divided or duplicated when the developer initially diagrams the job flows.DataStage can optionally replicate rows along multiple paths of logic, as well as split rows both horizontally according tocolumn value, and vertically with different columns sent to each target. Further, DataStage can drive separate job flowsfrom the same source. The Server engine from one graphical diagram can split single rows while in memory into multiplepaths or invoke multiple processes in parallel to perform the work. There are no intermediate holding areas required tomeet this objective. The resulting benefit is fewer passes of the data, less I/O, and a clear, concise illustration of real-world data flow requirements.

NATIVE SUPPORT FOR POPULAR RELATIONAL SYSTEMSDataStage reduces network overhead by communicating with popular databases using their own dedicated applicationprogramming interfaces. Relational databases are a key component of just about every decision-support application.They are often the medium for storing the integrated data. In many scenarios, they are also the source of business infor-mation. Interacting with these database systems often requires the use of SQL for retrieving data from source structuresor for writing results to a target. Even though bulk loaders typically offer greater performance, a data integration routinemay need to use dynamic SQL for record sequencing, updates, or other complex row processing. It is critical that dataintegration tools support the most efficient method for communicating with these databases and the vast array of vendorconfigurations. ODBC is the most flexible vehicle for providing this type of interaction, but may not always provide the

12

Page 13: Data Stage Fa My Whitepaper

best performance with the least amount of overhead. DataStage supports ODBC and provides drivers for many of thepopular RDBMS. It also concentrates on delivering direct API access allowing the data integration developer to bypassODBC and "talk" natively to the source and target structures using direct calls. Wherever possible, DataStage exploitsspecial performance settings offered by vendor APIs and avoids the management overhead associated with the configu-ration and support of additional layers of software.

A complete listing of Stages is contained in Appendix A.

INTEGRATING COMPLEX FLAT FILESDataStage users are often confronted with the need to extract data from flat files with complex internal structures. Theselegacy file formats, generally designed and constructed using COBOL applications on the mainframe, contain repeatinggroups (OCCURS) of information, record types, and mixed (character and binary) data types. The Complex Flat FileStage enables users to easily integrate these file types into their data movement applications. The Complex Flat FileStage first interprets the meta data stored in COBOL copybooks or other forms of COBOL source code. This step pro-vides DataStage XE with the column names and field offsets required for correct interpretation of the data. At run time,the Complex Flat File Stage performs automatic data type conversions, checks record type identifiers at the time ofretrieval, and normalizes repeating "arrays" of information. The processing of the Complex Flat File Stage allows devel-opers to concentrate on business rules without having to worry about items such as the transformation of mainframePACKED fields.

APPLYING TRANSFORMATIONSDataStage includes an extensive library of functions and routines for column manipulation and transformation.Developers may apply functions to column mappings directly or use them in combination within more complicated trans-formations and algorithms. After applying functions and routines to desired column mappings, the developer clicks onthe "compile" icon. First, the complier analyzes the job and then determines how to best manage data flows. Next, it cre-ates object code for each of the individual transformation mappings and uses the most efficient method possible tomanipulate bytes of data as they move to their target destinations.

Within DataStage there are more than 120 granular routines available for performing everything from sub-string opera-tions through character conversions and complex arithmetic. The scripting language for transformations is an extendeddialect of BASIC that is well suited for data warehousing as it contains a large number of functions dedicated to stringmanipulation and data type conversion. Further, this language has a proven library of calls and has been supporting notonly data integration applications but also on-line transaction processing systems for more than 15 years. In addition,DataStage has more than 200 more complex transformations that cover functional areas such as data type conversions,data manipulation, string functions, utilities, row processing, and measure conversions, including distance, time, andweight included in DataStage.

MERGE STAGEIncluded in DataStage is the Merge Stage. It allows developers to perform complex matching of flat files without havingto first load data into relational tables or use cryptic SQL syntax. The Merge Stage supports the comparison of flat fileswith the option of selecting up to seven different Boolean set matching combinations. More versatile than most SQLjoins, the Merge Stage allows developers to easily specify the intersection of keys between the files, as well as theresulting rows they wish to have processed by the DataStage job. Imagine comparing an aging product listing to theresults of annual revenue. The Merge Stage lets data users ask questions such as "Which products sold in what territo-ries and in what quantities?" and "Which products have not sold in any territory?" The Merge Stage performs a compari-son of flat files, which is often the only way to flag changed or newly inserted records, to answer questions such as"Which keys are present in today’s log that were not present in yesterday’s log?" When flat files are the commonsources, the Merge Stage provides processing capability without having to incur the extra I/O of first loading the struc-tures into the relational system.

GRAPHICAL JOB SEQUENCERA hierarchy of jobs in DataStage is referred to as a "batch" and can execute other jobs and invoke other batches.DataStage provides an intuitive graphical interface as the means of constructing batch jobs or Job Sequences. Using the

13

Page 14: Data Stage Fa My Whitepaper

graphical interface, "stages" which represent activities (e.g. run a job) and "links" which represent control flow betweenactivities (i.e. conditional sequencing) are used for constructing batch jobs or job sequences.

Control of job sequencing is important because it supports the construction of a job hierarchy complete with inter-process dependencies. For instance in one particular data integration application, "AuditRevenue" may be the first jobexecuted. No other jobs can proceed until (and unless) it completes with a zero return code. When "AuditRevenue" fin-ishes, the Command Language signals DataStage to run two different dimension jobs simultaneously. A fourth job,"ProcessFacts" waits on the two dimension jobs, as it uses the resulting tables in its own transformations. These jobswould still function correctly if run in a linear fashion, but the overall run time of the application would be much longer.The two dimension jobs run concurrently and in their own processes. On a server box configured with enough resourcesthey will run on separate CPUs as series of sequences can become quite complicated.

Advanced Maintenance Once up and running, the integrated data environment presents another set of challenges in terms of maintenance andupdates. Industry experts say that upwards of 70 percent of the total cost associated with data integration projects lies inmaintaining the integrated environment. Ascential’s approach takes maintenance challenges into account right from thestart. Providing functions such as object reuse, handling version control and capturing changed data helps developersmeet ongoing data user requirements quickly and without burdening IT resources.

OBJECT REUSE AND SHARINGDevelopers can re-use transformation objects developed in DataStage within the same project and across otherDataStage Servers. Further, they can easily share objects through various mechanisms such as containers from exportto integration with DataStage’s central control facility. Using containers, developers can group several stages in a dataintegration process together to make them a single unit. Containers can be local to a job or shared across jobs. Localcontainers are private to a job and can only be edited within the job in which were created. Shared containers are inde-pendent DataStage objects that can be included in many jobs with each job referencing the same source. Shared con-tainers are useful for repetitive data integration tasks such as reusing generating test data or capturing job statistics toan external reporting structure.

VERSION CONTROLIn addition, DataStage offers version control that saves the history of all the data integration development. It preservesapplication components such as table definitions, transformation rules, and source/target column mappings within a two-part numbering scheme. Developers can review older rules and optionally restore entire releases that can then bemoved to remote locations. The DataStage family of products allows developers to augment distributed objects locally asthey are needed. The objects are moved in a read-only fashion so as to retain their "corporate" identity. Corporate ITstaff can review local changes and "version" them at a central location. Version control tracks source code for develop-ers as well as any ASCII files such as external SQL scripts, shell command files and models. Version control also pro-tects local read-only versions of the transformation rules from inadvertent network and machine failures, so one point offailure won’t bring down the entire worldwide decision support infrastructure.

14

Remote SiteTesting / QA

DataStageE

DataStageDataStageDataStageDataStageE

DataStageDataStageaStaXEXEXEXEXE

DataStageXE

DataStageXEXEXE

DataStage

DataStageEXEXEXEXEXEXE

CorporateDevelopment

Production SitesBranch Offices &Other Locations

Figure 4: Version Control

Page 15: Data Stage Fa My Whitepaper

CAPTURING CHANGED DATAHandling changed data is another item critical to the performance of data integration applications. Decision support appli-cations require updating on a periodic basis. The load time required for such an update is determined by variables suchas the:

• Frequency of update

• Availability of the target

• Ability to access sources

• The anticipated volume of changed data

• The technical structure of the OLTP database

Network overhead and lengthy load times are unacceptable in volume-intensive environments. The support for ChangedData Capture minimizes the load times required to refresh the target environments.

Changed Data Capture describes the methodologies and tools required to support the acquisition, modification, and migra-tion of recently entered or updated records. Ascential Software realizes the inherent complexities and pitfalls of capturingchanged data and has created a multi-tiered strategy to assist developers in building the most effective solution based onlogs, triggers, and native replication. There are two significant problems to solve when dealing with changed data:

• The identification of changed records in the operational system

• Applying the changes appropriately to the data integration infrastructure

To assist in the capture of changed data, Ascential Software provides tools that obtain new rows from the logs and trans-actional systems of mainframe and popular relational databases such as DB2, IMS, and Oracle. In fact, each version ofChanged Data Capture is a unique product that reflects the optimal method of capturing changes given the technicalstructure of the particular database. The Changed Data Capture stage specifically offers properties for timestamp check-ing and the resolution of codes that indicate the transaction type such as insert, change, or delete. At the same time, itsdesign goals seek to select methods that leverage the native services of the database architecture, adhere to the data-base vendor’s document formats and APIs as well as minimize the invasive impact on the OLTP system.

See Appendix A for a list of supported environments for Changed Data Capture.

NATIONAL LANGUAGE SUPPORTIn today’s world of global conglomerates, multinational corporations and internet companies the need for internationalizedsoftware solutions for all aspects of business has become of paramount importance. Even one piece of software in a com-pany’s IT matrix that does not support international conventions or character sets can cause enormous difficulties.

Ascential has designed the DataStage product family for Internationalization from the ground up. The approach that Ascentialhas adopted provides a unified and industry-standard approach. The cornerstones of the implementation are as follows:

Windows Client Standards — All DataStage client interfaces fully support the 32-bit Windows code page standards in termsof supporting localized versions of Windows NT and Windows 2000. For example, developers using the DataStageDesigner in Copenhagen, Denmark; Moscow, Russia and Tel Aviv, Israel would all function correctly using their variousDanish, Russian and Israeli Windows versions.

DataStage Server — DataStage servers are not dependent upon the underlying operating system for its internationaliza-tion support. Instead, each server can be enabled to use a unified and single internal character set (Unicode) for all likelycharacter mapping, and POSIX-based national conventions (locales) for Sort, Up-casing, Down casing, Currency, Time(Date), Collation and Numeric representations.

Localization — All DataSatge XE components have the necessary client infrastructure in place to enable full localizationof the GUI interface into a local language. This includes Help messages, Documentation, Menus, Install routines, andindividual components such as the Transformer stage. DataStage is already available as a fully localized Kanji version.

15

Page 16: Data Stage Fa My Whitepaper

Expediting Data Transfer Moving data within the integrated environment and across corporate networks can have a large impact on performanceand application efficiency. Ascential’s approach takes this challenge into account and as a result has built in additionalstages specifically designed for data transfers.

NAMED PIPE STAGENamed Pipes capability makes it easy for developers to transfer data between two or more jobs or between a job and anexternal program. A feature of many operating systems, the named pipes mechanism allows the transparent movementof data through memory between critical processes of a data integration application. Named pipes are powerful butsometimes difficult to enable. With DataStage, developers simply select the named pipe option, and DataStage takescare of any lower level operating system details. Further, DataStage can seamlessly share data via the named pipe sup-port of other tools such as specialized bulk loaders. Large jobs can be split into a number of more manageable jobs, orcan be shared among a number of developers. At execution, individual job streams run in concurrent but integratedprocesses, passing rows between one another. Using named pipes optimizes machine resources and reduces designand development time.

FTP STAGEPreparing data for loading into the target environment can involve the use of file transfer protocol to move flat filesbetween machines in the corporate network. This process is supported generally by shell programs and other user writ-ten scripts and requires that there be enough disk space on the target machine to retain a duplicate copy of the file.While this process works, it does not efficiently use available storage. DataStage provides FTP support in a stage thatuses file transfer protocol, but skips the time consuming I/O step needed when executing a stand-alone command pro-gram. While blocks of data are sent across the network, the Server engine pulls out the pre-mapped rows, moving themdirectly into the transformation process. In addition, FTP can be used to write flat files to remote locations. The FTPStage provides great savings in time as no extra disk I/O is incurred before or after job execution.

Using External Processes Since each data integration environment is unique, Ascential’s approach incorporates flexibility and extensibility so thatdevelopers can accommodate special data requirements. The DataStage Product Family allows developers to capitalizeon existing integration code that results in shortening the development cycle. Further, it gives developers the ability tobuild custom plug-in stages for special data handling.

THE PLUG-IN APIEach Stage icon in DataStage XE contains a dialog box that has been designed exclusively for the technologyaddressed by the Stage. For example when a designer clicks on the Sort stage, the software requests a list of propertiesfor the developer to define such as sort keys, order, and performance characteristics. These properties appear as theresult of DataStage calling a lower level program that is written using the Plug-In API. Well-documented and fully sup-ported, this API gives subject and technical matter experts the facility for creating their own extensions to the product.Building custom Plug-Ins provides the opportunity for integrators, re-sellers, and large MIS shops to extend DataStagefor their own proprietary or heavily invested interfaces. Since Plug-In Stages are not dependent on any one release ofthe DataStage engine, developers can download them from the web and install them separately. Ascential Software fre-quently develops new Stages and distributes them to existing customers in this fashion.

INVOKING SPECIALIZED TOOLSThroughout a job, the DataStage developer can define specific exit points. These exit points are used to make calls tothe operating system or to invoke existing code and utilities. Powerful cleansing tools such as FirstLogic and Trillium areoften called in this manner. These cleansing tools perform data clean-up operations, such as address parsing and com-pany/product name pattern matching. The developer drags and drops the Stage icons that contain the definition of theseexit points onto the Designer canvas. DataStage provides unlimited flexibility, as there is essentially no limit to the num-ber of exit points that a user might include in a particular job. Some of the benefits of calling external programs are theability to perform clean up operations, issue an email message, or send a page based on some event within the applica-tion.

16

Page 17: Data Stage Fa My Whitepaper

Performance and Scalability Every aspect of the DataStage Server engine is optimized to move the maximum number of rows per second througheach process. By default, jobs are designed to pick up data just once from one or more sources, manipulate it as neces-sary, and then write it once to the desired target. Consequently DataStage applications do not require intermediate filesor secondary storage locations to perform aggregation or intermediate sorting. Eliminating these steps avoids excessiveI/O, which is one of the most common performance bottlenecks. Data does not have to be loaded back into a relationalsystem to be reprocessed via SQL. As a result, DataStage jobs complete in a shorter period of time.

If multiple targets are critical to the application, DataStage will split rows both vertically and horizontally and then sendthe data through independent paths of logic—each without requiring a temporary staging area. DataStage makes it a pri-ority to do as much as possible in-memory and to accomplish complex transformations in the fewest possible passes ofthe data. Again, the benefit of this approach is fewer input/output operations. The less disk activity performed by the job,the faster the processes will complete. Target systems can be on-line and accessible by end users in a shorter period oftime. The following sections describe the core concepts within DataStage that support this methodology.

EXPLOITING MULTIPLE CPUSRiding the curve of acceleration technologies, DataStage incorporates improvements made in operating systems andhardware. The data integration engine exploits multiple CPU environments by automatically distributing independent jobflows across multiple processes. This feature ensures the best use of available resources and speeds up overall pro-cessing time for the application. The Server analyzes jobs at the start of execution by highlighting dependent operationsand separately flagging independent flows of data. For a given job, the DataStage Engine will potentially spawn manyparallel processes at run-time. These processes will run concurrently and have their labor divided among multipleprocessors on a machine with multiple CPUs. Further, a new feature called Multiple Job Instances exploits multiple-processor machines for organizations that need to integrate massive data volumes. Using Multiple Job Instances thesame DataStage job can run it across multiple processors simultaneously. Also, the same set of functions can be run onone or all data streams. As a result, DataStage’s use of multiple CPUs offers peak performance for time sensitive organi-zations.

LINKED TRANSFORMATIONSDataStage performs mapping and transformation operations in memory as rows pass through the process. The Severengine exploits available memory to reduce I/O and provide faster job execution. The Transformation Stage provides theuser with the ability to graphically illustrate column-to-column mappings, build arithmetic or string manipulation expres-sions, and apply built-in or user-written functions. Sometimes it is necessary to incorporate dependent or "linked" trans-formations into a data integration process. The results of intermediate column calculations are often re-used in subse-quent statements. The engine processes linked transformations as regular subroutines fired in iterative order. No rowsare written to disk. Instead, they flow continuously through one or many transformation stages as DataStage ensures aseamless flow of data and increased overall throughput. In addition, DataStage allows the developer to graphically linktransformations thereby making order of operations crystal clear to anyone who needs to support the job in the future.

Multiple Job Instances

Execute multiple instances of the same job

¥ ce

¥ Full support in DS APIs Full support in DS APIs¥

Independent job launch and identities

¥ Independent Monitoring¥¥¥ Independent Control

depe de o o Independent Control

p¥¥ Independent Control

¥ Independent Logs Independent Logs¥ DataStageServer

A1

A2

A3

A4

A5

A1A1A1A1

A2A2A2A2

A3A3A3A3

A4A4A4A4

A5A5A5A5

Job A

Figure 5: DataStage products feature theability to run multiple copies (or instances)of the same job in parallel. Each instance isuniquely identified from the "parent" joband can be managed in the same way asany other DataStage job.

17

Page 18: Data Stage Fa My Whitepaper

IN-MEMORY HASH TABLESBuilt into every DataStage server is the ability to automatically create and load dynamic hash tables, which dramaticallyincreases the performance of lookups. Software algorithms use lookups for comparing keys and bringing back dimensionor fact data such as prices and descriptions, or to validate data such as determining if a key exists within a reserved corpo-rate listing. Developers load hash tables directly from source information and then select the "pre-load to memory" optionto gain optimal performance. The hash tables can be loaded directly within the same job as the required lookup or at anearlier time and saved for use within many jobs of the application. Performance gains over disk-intensive operations aresignificant especially when there are many different fields requiring comparison. Other tools typically perform lookups byissuing PREPAREd SQL statements back into a relational database, a workable but tedious mechanism.

SORTING AND AGGREGATIONDataStage further enhances performance and reduces I/O with its built-in sorting and aggregation capabilities. The Sortand Aggregation stages work directly on rows as they pass through the engine while other tools rely on SQL and inter-mediate tables to perform these critical set-oriented transforms. DataStage allows the Designer to place sort functionalitywhere needed in the job flow. This placement may be before or after calculations, before or after aggregation, and if nec-essary, along the independent paths of logic writing to multiple targets. For example, the application can performAggregation immediately following the generation of new key or grouping values and prior to INSERT into the RDBMS.Developers can set properties in the sort stage to fully use available memory. If the data is already in sorted order, theAggregation stage will deliver rows to the target as each aggregate group is complete. As long as sufficient resourcesare available, the Sort and Aggregation stages perform their work in memory. Once again, this approach reduces physi-cal I/O and helps ensure that DataStage jobs execute in the shortest timeframe.

BULK LOADERSAscential Software is committed to supporting the high speed loading technologies offered by partners and other indus-try-leading vendors. Bulk loaders or "fast load" utilities permit high-speed insertion of rows into a relational table by typi-cally turning off logging and other transactional housekeeping performed in more volatile on-line applications. DataStagesupports a wide variety of such bulk load utilities either by directly calling a vendor’s bulk load API or generating the con-trol and matching data file for batch input processing. DataStage developers simply connect a Bulk Load Stage icon totheir jobs and then fill in the performance settings that are appropriate for their particular environment. DataStage man-ages the writing and re-writing of control files especially when column names or column orders are adjusted. In the caseof bulk API support, DataStage takes care of data conversion, buffering, and all the details of issuing low-level calls. Thedeveloper gets the benefit of bulk processing without having to incur the learning curve associated with tricky codingtechniques.

A complete listing of Bulk Loader Stages is contained in Appendix A.

Support for Complex Data Types Some data integration environments have data sources whose structure or source is complex and requires significantprocessing in order for it to be analyzed by a business intelligence tool. Without this processing, data consumers wouldnot be able to "compare apples to apples". Ascential offers answers to address specific complex data requirements suchas XML, web-data and MQ Series messages. The MQ Series Stage, allows developers to use message-oriented, "near"real-time transactional middleware as sources and targets. This extends DataStage XE’s functionality without addingcomplexity to the development or the analysis environments.

XML PACKXML Pack provides support for business data exchange using XML. With XML Pack, users can read XML-formatted textand write out data into an XML format, then exchange both business-to-business and application-to-application data andmeta data.

Specifically, the Pack’s XML Writer and XML Reader enables companies to leverage their legacy investments forexchanging infrastructures over the Internet without the need for writing specific point-to-point interchange programs andinterfaces. Using XML Writer, DataStage can read data from legacy applications, ERP systems, or operational

18

Page 19: Data Stage Fa My Whitepaper

databases, transform or integrate the data with other sources, and then write it out to an XML document. The XML Readerallows the reading or import of XML documents into DataStage. Further, DataStage can integrate data residing in the XMLdocument with other data sources, transform it, and then write it out to XML or any other supported data target.

Users can employ XML not only as a data encapsulation format but also as a standard mechanism for describing metadata. By means of an XML Pack Document Type Definition (DTD) for extracting meta data, DataStage supports theexporting and importing of all meta data related to the data integration process. When exporting the meta data into anXML document, DataStage supplies the DTD thus enabling other tools to understand the structure and contents of theexported document. In addition, XML Pack enables users to export XML meta data from any report, including catalogs,glossaries, and lineage reports, to an XML document. From there users can display the documents in an XML-capablebrowser or imported into any other application that supports XML.

CLICKPACKClickPack, an intrinsic part of DataStage, extracts and integrates data from web server log files and email servers andtransforms it as part of a complex data integration task. It speeds up the development time for extracting important visi-tor behavior information. Specifically ClickPack provides:

Web server log file readers — The content of web server log files reveal who has visited a web site, where they travelfrom click to click, and what they search for or download. ClickPack provides two plug-in stages for the DataStageDesigner to extract data from these log files, and using its built-in transforms and routines, process the data for integra-tion elsewhere. These stages optimized performance by allowing preprocessing of the log file to filter out unwanted data.ClickPack provides basic log file reading capability using the LogReader plug-in stage, and advanced log reading usingthe WebSrvLog stage. In the former, the format is determined by tokenizing patterns in each column definition.Tokenizing patterns use Perl regular expressions to specify the format of each data field read from the web log file. Thepre-supplied templates enable reading of generic format web log server files in the following formats: Common log for-mat (CLF), Extended Common Log Format (ECLF or NCSA format) and Microsoft IIS log format (W3C format).

The WebSrvLog stage is a more advanced toolset that includes a plug-in stage and sample jobs. These provide buildingblocks that enable the definition of jobs that provide more performant web log analysis, particularly where the log for-mats have been customized. Either stage is inserted into the Designer canvas in the same way as all other stages. TheWebSrvLog stage represents the directory in which the log files are held.

POP3 compliant Email server reader — ClickPack provides a plug-in stage to read email messages from mail serversthat support the POP3 protocol. It can separate the messages into columns representing different parts of the message,for example, the From, To, and Subject fields. New ClickPack transforms support both Mail ID parsing and the extractionof tagged head or body parts from messages.

ClickPack Utilities — ClickPack provides various transforms, routines, table definitions and sample jobs, which can beused in processing the output from the web log reader stages and the email reader stage.

LEGACY SOURCESLegacy data sources should be as transparent to the data integration developer as the computer system that they runon. Ascential enables developers to effortlessly and efficiently pull data and the appropriate meta data from legacysources. Once collected, the data is easily represented graphically.

IMSSophisticated gateway technology provides the DataStage transformation engine with SQL-like access to IMS, VSAM, flatfiles, as well as other potential legacy sources. This connectivity software consists of programs residing locally on themainframe, usually in one or more long running address spaces, a listener and a data access component. The listenerwaits for communication from DataStage running on Unix or NT connected to the mainframe via TCP/IP. The Data Accesscomponent receives SQL upon connection from the remote application, and translates that SQL into lower level IMS callsfor retrieval and answer set processing. Once an answer set is processed, the resulting rows are sent down to the awaitingclient application. These resulting rows are sent down a "link" for further processing within the DataStage engine and ulti-mately placed into appropriate targets or used for other purposes. Specifically, Ascential’s IMS connectivity features:

19

Page 20: Data Stage Fa My Whitepaper

- Meta data import from COBOL FD’s, IMS DBD’s and PSBs as well as PCB’s within PSBs.

- Full support of IMS security rules. RACF and Top-Secret security classes are supported in addition to userid accessto datasets that may be allocated. Security logon is checked by the DataStage job at run time as well as whenobtaining meta data. The data integration developer can easily switch PSB’s and access multiple IMS databaseswithin the enterprise if necessary. New PSB’s do not have to be generated.

- Exploitation of all performance features. Full translation of SQL WHERE clauses into efficient SSA’s, the screeningconditions in IMS. All indexes are supported. In addition, IMS ‘fields specifically mentioned in the DBD itself and non-sensitive fields or "bytes" in the DBD that are then further defined by a COBOL FD or other source are also supported.

- Efficient support for multiple connections. This could be multiple data integration jobs running on the NT or Unix client,or simply one job that is opening up multiple connections for parallel processing purposes. Ascential’s IMS connectivi-ty uses DBCTL, which is the connection methodology developed by IBM for CICS to talk to IMS. No additional address spaces are required and new thin thread connections to IMS are established dynamically when additional jobs or connections are spawned within DataStage.

- Support for all types of IMS databases, including Fast Path, partitioned and tape based HSAM IMS database types.

- Ability to access and join VSAM, QSAM, and/or DB2 tables in the same SQL request.

- Dedicated API connection—not using ODBC.

ADABASAn option for DataStage called DataStage ADABAS Connect offers developers native access to ADABAS, a powerfulmainframe based data management system that is still used heavily in OLTP systems. DataStage ADABAS Connectprovides two important tools for retrieving data locked in ADABAS:

- NatQuery, which automatically generates finely tuned requests using Natural, the standard ADABAS retrieval language

- NatCDC, which further expands the power of Ascential’s Changed Data Capture toolset

Using NatQuery, developers import meta data from ADABAS DDM’s and construct queries by selecting desired columnsand establishing selection criteria. NatQuery understands how to retrieve complex ADABAS constructs such as:Periodic-Group fields (PEs), Multiple-Valued fields (MUs), as well as MUs in PEs. Additionally, NatQuery fully exploits allADABAS key structures, such as descriptors and super-descriptors. NatQuery is a lightweight solution that generatesNatural and all associated runtime JCL.

NatCDC is a highly optimized PLOG processing engine that works by providing developers with "point in time" snap-shots using the ADABAS Protection Logs. This assures that there is no burden on the operational systems, and simpli-fies the identification and extraction of new and altered records. NatCDC recognizes specific ADABAS structures withinthe PLOG, performs automatic decompression, and reads the variable structure of these transaction specific recoverylogs.

MQSERIES Many enterprises are now turning to message-oriented, "near" real-time transactional middleware as a way of movingdata between applications and other source and target configurations. The IBM MQSeries family of products is currentlyone the most popular examples of messaging software because it enables enterprises to integrate their businessprocesses.

The DataStage XE MQSeries Stage treats messages as another source or target in the integrated environment. In otherwords, this plug-in lets DataStage read from and write to MQSeries message queues. The MQSeries Stage can be usedas:

20

Page 21: Data Stage Fa My Whitepaper

• An intermediary between applications, transforming messages as they are sent between programs• A conduit for the transmission of legacy data to a message queue• A message queue reader for transmission to a non-messaging target

Message-based communication is powerful because two interacting programs do not have to be running concurrently asthey do when operating in a classic conversational mode. Developers can use DataStage to transform and manipulatemessage contents. The MQSeries Stage treats messages as rows and columns within the DataStage engine like anyother data stream. The MQSeries Stage fully supports using MQSeries as a transaction manager, browsing messagesunder the "syncpoint control" feature of MQ and thus guaranteeing that messages between queues are delivered to theirdestination.

The MQSeries Stage is the first step in providing real-time data integration and business intelligence. With the MQSeriesStage, developers can apply all of the benefits of using a data integration tool to application integration.

Integrating Enterprise ApplicationsDataStage Connectivity for Enterprise Applications is a set of packaged application kits (PACKS) for native extractionfrom a wide range of enterprise applications. The Enterprise Application PACKs collect and consolidate information fromvarious enterprise applications to enhance business forecasting and actions resulting from analysis. DataStageConnectivity for Enterprise Applications includes the DataStage Extract PACK for SAP R/3, DataStage Load PACK forSAP BW, DataStage PACK for Siebel and DataStage PACK for PeopleSoft.

Figure 6 depicts the architecture for an integrated data integration environment with multiple enterprise applicationsincluding SAP, PeopleSoft, and Siebel.

21

DataStageServer

DataStageDataStageDataStageDataStageServerServerServerServer

Load PACK

Open Targets Data Warehouses

Data Marts

BAPI

SAP BW

DataIntegration

DataIntegration

PeopleSoft Siebel SAP R/3

PACK PACK ExtractPACK

EIM ABAP IDoc

EnterpriseApplications

Interface

DataStage PACKs

Direct

DataIntegration

Direct

Figure 6: DataStage Connectivity for EnterpriseApplications features Connectivity Packs for seamless integration of industry leading enterprise applications.

Page 22: Data Stage Fa My Whitepaper

DATASTAGE EXTRACT PACK FOR SAP R/3The DataStage Extract PACK for SAP R/3 enables developers to extract SAP R/3 data and integrate it with data fromother enterprise sources thereby providing an enterprise-wide view of corporate information. Once extracted, DataStagecan manipulate the R/3 data by mapping, aggregating and reformatting it for the enterprise data environment. Further, itgenerates native ABAP code (Advanced Business Application Programming) thereby eliminating manual coding andspeeding up deployment of SAP data integration. To optimize environment performance and resources, developers canopt to have the generated ABAP code uploaded to the R/3 system via a remote function call (RFC). Or the developercan manually move the R/3 script into the system using FTP and the R/3 import function. Developers can chose to con-trol job scheduling by using the DataStage Director or SAP’s native scheduling services. The DataStage Extract PACKfor SAP R/3 employs SAP’s RFC library and IDocs; the two primary data interchange mechanisms for SAP R/3 access,to conform to SAP Interfacing Standards.

Further, the Extract PACK integrates R/3 meta data and ensures that it is fully shared and re-used throughout the dataenvironment. Using DataStage’s Meta Data object browser, developers and users can access transparent, pool, view,and cluster tables.

With over 15,000 SAP database tables and their complex relationships, the Meta Data Object Browsers provides easynavigation through the Info Hierarchies before joining multiple R/3 tables. Overall, making the meta data integrationprocess much simpler as well as error-free.

DATASTAGE LOAD PACK FOR SAP BWThe DataStage Load PACK for SAP BW enables an organization to rapidly integrate non-SAP data into SAP BW. Theprocess of loading non-SAP data is often complicated and cumbersome because depending on the source, the datamay need to be cleansed, summarized, and reformatted or converted into ASCII. Further, the meta data of the sourcefiles needs to match the meta data of the SAP BW target structure.

The DataStage Load PACK for SAP BW offers automated data transformation and meta data validation. Browsing andselecting SAP BW source system and information structure meta data is simplified. Data integration is automated as theRFC Server listens and responds to SAP BW request for data. The Load Pack for SAP BW uses high-speed bulk BAPIs(Business Application Programming Interface), SAP’s strategic technology, to link components into the business frame-work. In fact, SAP has included the DataStage Load Pack for SAP BW in their mySAP product offering. The end result isan accurate, enterprise-wide view of the organization’s data environment.

DATASTAGE PACK FOR SIEBELCRM systems are a source of strategic business intelligence but because of complex data structures and proprietaryapplications data integration is difficult if not impossible in some cases. DataStage’s PACK for Siebel enables organiza-tions to directly extract and integrate vast amounts of CRM data. The Siebel PACK simplifies the extraction processbecause it uses Siebel’s Enterprise Integration Manager (EIM) thereby eliminating the need to access the underlyingdatabase. The user-friendly graphical interface lets developers define the data extraction properties as well as performcomplex transformations with drag-and-drop operations. With the Siebel PACK, users can customize extractions auto-matically meaning they can access specific tables, business components, or groups of tables. They can create and vali-date the EIM configuration and then launch the EIM application from within DataStage.

DataStage’s PACK for Siebel eliminates tedious meta data entry because it extracts the meta data too. It captures thebusiness logic and associated expressions resulting in easy maintenance as well as faster deployment. DataStageallows developers to integrate customer data residing in Siebel with other ERP and legacy systems thus providing acomplete business view. True enterprise data integration means harnessing valuable customer information that will inturn to improved customer satisfaction, leveraged cross-selling and up-selling opportunities, and increased profitability.

DATASTAGE PACK FOR PEOPLESOFTFor organizations that rely on PeopleSoft Enterprise systems, the DataStage PACK for PeopleSoft offers tools for dataintegration and meta data management. By building a business intelligence infrastructure that includes PeopleSoft data,

22

Page 23: Data Stage Fa My Whitepaper

organizations can make strategic business decisions based on accurate information that is easy to interpret. ThePeopleSoft PACK allows developers to view and select PeopleSoft meta data using familiar PeopleSoft terms and menustructures. The developer selects the meta data and DataStage then extracts, transforms, and loads the data into thetarget environment.

In addition, developers can use any PeopleSoft business process or module as a data source as well as map data ele-ments to target structures in the warehouse. Once in the warehouse, PeopleSoft data can be integrated with mainframedata, relational data or data from other ERP sources. For ongoing maintenance, this PeopleSoft PACK allows develop-ers to capture and share meta data as well as view and manipulate the meta data definitions. All of these features givedevelopers a clear view of the warehouse design, where and when data changed, and how those changes originated.

The PeopleSoft PACK supports a consistent user interface for a short learning curve and multiple platforms (NT,UNIX) and globalization (multi-byte support and localization) for cost-effective flexibility. The end result is a consistent,reliable infrastructure that includes PeopleSoft data without months of waiting or the extra expense of custom develop-ment.

V. Native Mainframe Data Integration DataStage XE/390 is the product family member that generates native COBOL programs for executing data integrationtasks in the mainframe environment providing the greatest possible efficiency. Data integration developers now have theoption to harness the power and the investment made in their mainframe systems. The end result is an efficient use ofthe network environment and reduced data traffic because only the "transformed" data is sent to the target environment.

DataStage XE/390 provides data developers with a capability unmatched by competitive offerings. No other tool pro-vides the same quick learning curve and common interface to build data integration processes that will run on both theMainframe and the Server.

Specifically, DataStage XE/390 offers developers the ability to:

• Integrate multi-format files

• Utilize the most appropriate join, lookup and aggregation strategy: hash, nested loop, file matching

• Use dynamic or static calls to any language supported by Cobol

• Support external source and target routines

• Use include statements for customized code

• Supply variables to JCL such that the owner of the libraries can be supplied from Job Properties versus being hard-coded

• Analyze and manage all of the meta data in the integrated environment including mainframe specific meta data suchas DCLgens, JCL templates, library specifications

• Integrate seamlessly with DataStage Unix and NT mappings

Once the job design is completed, DataStage XE/390 will generate three files that are moved onto the Mainframe, aCOBOL program, and two JCL jobs. One JCL job compiles the program and the other runs the COBOL program.

Typically, an organization using mainframes will follow one of the two deployment philosophies:

• Both their transaction system and integrated data will be on the mainframe or

• Minimum processing is done on their mainframe based transactional systems and the integrated data is deployedon a Unix or NT server

DataStage XE/390 excels in both instances. In the first instance, all processing is kept on the mainframe while data inte-gration processes are created using the DataStage GUI to implement all the complex transformation logic needed. In the second instance, DataStage XE/390 can be used to select only the rows and columns of interest. These rows can

23

Page 24: Data Stage Fa My Whitepaper

have complex transformation logic applied to them before being sent via FTP to the UNIX or NT server runningDataStage. Once the file has landed, the DataStage XE server will continue to process and build or load the target envi-ronment as required.

VI. End-to-End Meta Data Management Today, meta data—the key to information sharing—exists in a patchwork of repositories and proprietary meta datastores. The DataStage Product Family integrates meta data among disparate tools in a way that eliminates many of thecommon barriers and problems associated with tool integration today. Using innovative and patented technology, TheDataStage Product Family provides a way to share meta data among software tools that:

• Minimizes the impact on tools, especially their data structures

• Manages the impact of change within shared environments

• Eliminates meta data loss during sharing

• Maximizes the meta data that is available to each tool

• Retains and uses information about the shared environment itself

Typically, there are two common methods to integrate meta data. The theoretical "quickest and easiest" method is towrite a bridge from one tool to another. Using this approach, a developer must understand the schema of the source tooland the schema of the target tool before writing code to extract the meta data from the source tool’s format, transform it,and put it into the target tool’s format. If the exchange needs to be bidirectional then the process must also be done inreverse. Unfortunately this approach breaks down as the number of integrated tools increases. Two bridges support theexchange between two tools. It takes six to integrate three tools, twelve to integrate four tools, and so forth. If one of thefour integrated tools changes its schema then the developer must change six integration bridges to maintain the integra-tion. Writing and maintaining tool-to-tool bridges are an expensive and resource-consuming proposition over time for anyIS organization or software vendor.

Another integration method is the common meta data model, which generally takes one of two forms. The first form,characterized as a "least common denominator" model, contains a set of meta data that a group of vendors agree to usefor exchange. However, this exchange is limited to these items described in the common model. The more vendorsinvolved in the agreement, the smaller the agreed set tends to become. Alternatively, a vendor can establish their ownmeta data model as "the standard" and require all other vendors to support that model in order to achieve integration.This method benefits the host vendor’s product, but typically doesn’t comprise all the meta data that is available to beshared within tool suites. Sometimes a common model will have extension capabilities to overcome the shortcomings ofthis approach. But, the extensions tend to be private or tool-specific. They are either not available to be shared, orshared via a custom interface to access this custom portion of the model, i.e. by writing custom bridges to accomplishthe sharing.

Ascential takes a different approach to meta data sharing in the DataStage Product Family. The core components of thetechnology are the Meta Data Integration Hub, a patented transformation process, and the MetaBroker productionprocess. The Meta Data Integration Hub is the integration schema and is vendor neutral in its approach. It is composedof granular, atomic representations of the meta data concepts that comprise the domain of software development. Assuch, it represents a union of all of the meta data across tools, not a least common denominator subset. To use an anal-ogy from the science of chemistry, it could be considered a "periodic table of elements" for software development. It isthis type of model that allows all the meta data of all tools to be expressed and shared.

Software tools express their meta data not at this kind of granular level, but at a more "molecular" level. They "speak" interms of tables, columns, entities, and relationships, just as we refer to one of our common physical molecules as salt,not as NaCL. In an ideal integration situation the tools do not need to change their meta data representation in any waynor be cognizant of the underlying integration model. The MetaBroker transformation process creates this result. TheMetaBroker for a particular tool represents the meta data just as it is expressed in the tool’s schema. It accomplishes theexchange of meta data between tools by automatically decomposing the meta data concepts of one tool into their atomic

24

Page 25: Data Stage Fa My Whitepaper

elements via the Meta Data Integration Hub and recomposing those elements to represent the meta data concepts fromthe perspective of the receiving tool. In this way, all meta data and their relationships are captured and retained for useby any of the tools in the integrated environment.

Ascential’s MetaBroker production process is a model-based approach that automatically generates the translationprocess run-time code—a translation engine (.dll) that converts tool objects into Meta Data Integration Hub objects andvice versa for the purpose of sharing meta data. It uses a point-and-click graphical interface to create granular semanticmaps from the meta data model of a tool to the Meta Data Integration Hub. This powerful process allows for easy cap-ture, retention, and updating of the meta data rules that are encoded in the MetaBroker thus providing huge timesavingsin creating and updating data sharing solutions. The current set of MetaBrokers facilitates meta data exchange betweenDataStage and popular data modeling and business intelligence tools. See Appendix A for a listing of specificMetaBrokers provided by Ascential Software. Further, the Custom MetaBroker Development facility allows the AscentialServices organization to deliver customer-specified MetaBrokers quickly and reliably.

The DataStage Director extends ease of use at run-time by providing tools that limit resources and allow validation ofjobs to check for subtle syntax and connectivity errors. It immediately notifies developers that passwords have changedrather than discovering such failures the following morning when the batch window has closed. The Director also pro-vides the Monitor window that provides real-time information on job execution. The Monitor displays the current numberof rows processed and estimates the number of rows per second. Developers often use the Monitor for determining bot-tlenecks in the migration process.

Meta Data Integration What does it mean to have truly integrated meta data? Some believe it is as simple as transforming meta data from oneformat to another. Others think that translating meta data in one direction is sufficient. But to really derive value frommeta data, organizations must pursue true integration in terms of semantic integration, bidirectional translation, andevent and design integration.

SEMANTIC META DATA INTEGRATIONIn order to accomplish true meta data sharing and to reap the benefits it brings to the information of the Enterprise, theintegration solution must do more than cast data from one format into another. While this approach is necessary in manycases, it is often not sufficient. Semantic integration ensures that the meaning as well as the format of the meta data isretained and transformed across the shared environment. The Meta Data Integration Hub is a semantic model thatunambiguously retains the meaning of tool meta data as well as its technical characteristics, so that meaning is pre-served and format is accommodated as multiple tools produce and reuse meta data in the performance of their work.The Meta Data Integration Hub therefore provides both a unified alphabet and vocabulary for the integration environ-

25

Database

Brokers

Integration Hub

Tool ATool CTool B

Database Database Database

Figure 7: Ascential’s end-to-end meta data management enables complete sharing ofmeta data among applications with different underlying meta data stores.

Page 26: Data Stage Fa My Whitepaper

ment. One that the MetaBrokers easily transform and express into a variety of different languages either specific to aparticular tool or aligned to industry standards.

BIDIRECTIONAL META DATA TRANSLATIONBidirectional meta data exchange is handled by "de-coder" and "en-coder" portions of the MetaBrokers. A de-coder usesan API or parses a pre-defined syntax of a tool in order import meta data instances into the Meta Data Hub. En-codersretrieve instances from the Meta Data Hub and present them in a pre-defined syntax that can be interpreted by a recipi-ent tool. Most MetaBrokers provide both import and export of meta data. Sometimes when there are technical limitationsof the participating tool, or when best practices advise against bidirectionality, the MetaBroker may be import-only, orexport-only. For example, most data modeling tool MetaBrokers are built to be import-only, supporting the best practicethat data should be defined once in the modeling tool, then pushed out to the various other tools that will re-use it. Metadata changes should be made at the source, and re-published to the other users.

INTEGRATION OF DESIGN AND EVENT META DATAEqually important to shared data assets is the ability to monitor data integration activity throughout the company.Remote DataStage servers transparently report their operational results back as meta data. From one single point ofcontrol, decision support managers can review individual transformation jobs and the tables they are reading and writing.This information provides quick notice of errors in the integration process and advance notification of potential process-ing bottlenecks. As a result, organizations avoid resource conflicts, maximize their computing resources, and benefitfrom immediate visibility to problem situations; they waste fewer hours trying to contact field personnel to research prob-lems or facilitate status reporting.

Meta Data Analysis Once meta data has been integrated the next step is to analyze it. Through analysis, organizations can determine theeffect of changes made in an integrated environment and as well as any events that occur against data during the inte-gration process. Further, the analysis can result in reports or an input to other queries. Analyzing the "data about thedata" provides organizations with another dimension to their integrated data environment.

CROSS TOOL IMPACT ANALYSISBy using a "connected to" relationship, which can be automatically or manually set between objects in the Meta DataIntegration Hub, developers can traverse all the relationships associated with an object represented in any of theMetaBroker models. This feature makes it possible to assess the impact of changes across the integrated environmenteven before a change occurs. For example, let us look at the effect of a change to an existing table, one that is used asthe schema in multiple places across the integrated environment. The designer will ALTER the Sales-Person TABLE toadd a new column, Territory_Key, and this column will be defined as NOT NULL. Because it will be defined as NOTNULL, the SQL statements in all the DataStage jobs that create this table will fail if they don’t incorporate the change.The cross-tool nature of the impact analysis further helps developers and administrators to rapidly assess where thischange appears in the company data models, databases, and in all the BI tool repositories that will be used to producecompany analytic reports. The nature of the change can be sized and planned accordingly into managed maintenanceactivities instead of experiencing potentially company-wide information failures.

The Object Connector facility allows the developer or administrator to specify which items in the Meta Data IntegrationHub are equivalent to other objects for the purpose of impact analysis, by "connecting" them. The GUI enables users toset object connections globally across or individually on an object-by-object basis. Cross-tool impact analysis shows itspower again by exposing the location and definitions of standard, corporate, calculated information such as profit orcost-of-goods-sold. Starting from the transformation routine definition, Designers can quickly identify which columns within a stage use the routine, thereby allowing developers to implement any changes to the calculation quickly. End-users have the ability to trace the transformation of a data item on their reports back through the building process thatcreated it. They can now have a complete understanding of how their information is defined and derived.

26

Page 27: Data Stage Fa My Whitepaper

DATA LINEAGEData lineage is the reporting of events that occur against data during the data integration process. When a DataStagejob runs, it reads and writes data from source systems to target data structures and sends data about these events tothe Meta Data Integration Hub automatically. Event meta data connects with its job design meta data. Data Lineagereporting captures meta data about the production processes: time and duration of production runs, sources, targets,row counts, read and write activities. It answers questions for the Operations staff: What tables are being read from orwritten to, and from what databases on whose servers? What directory was used to record the reject file from last night’srun? Or, exactly which version of the FTP source was used for the third load? Data lineage reporting allows managers tohave an overall view of the production environment from which they can optimize their resources and correct processingproblems.

BUILT-IN CUSTOMIZABLE QUERY AND REPORTINGAdvanced meta data reporting and analysis is a logical by-product of this architecture. The MetaBroker productionprocess retains a tremendous amount of knowledge about the shared environment. The data about how tools aremapped to the Meta Data Integration Hub (and therefore indirectly to each other) is a rich source of meta data thatallows Ascential to provide advanced meta data management and reporting. The meta data management services in theDataStage family of products provides a powerful, customizable query and reporting facility for navigating and filteringthe objects in the Meta Data Integration Hub via the MetaBrokers. The resulting answer set can be used as input toother queries, output to any form of online documentation or embedded in impact analysis templates.

Meta Data Sharing, Reuse, and Delivery In alignment with Ascential’s goal of maximizing efficiency and productivity, the DataStage Product Family supportsenterprise-wide sharing and reuse of meta data. By including features such as a publish and subscribe model, automaticchange notification, on-line documentation and web portal publishing, DataStage makes efficient meta data usage quickand easy for data users.

PUBLISH AND SUBSCRIBE MODEL FOR META DATA REUSEThis feature allows data managers and developers to make centrally controlled pieces of the transformation applicationavailable, with change notification, to interested users. Once published, sets of standard items, such as table definitions,routines, and customized functions are available for subscription. Subscribers push the published components viaMetaBroker export into their own native tool environment. Tool users exploit the best platform for their task and still getthe benefit of shared logic and centralized control where standards are concerned. The DataStage Product Family usesthis publish and subscribe process to move standard, published definitions among the various tools in the environment.

27

Figure 8: Cross Tool Impact Analysis diagram that illustrateswhere the Table Defintion "Customer Detail" is used inERwin, DataStage, and Business Objects

Page 28: Data Stage Fa My Whitepaper

Data modeling, data integration, and business intelligence tools can share meta data seamlessly. This flexible layer ofmanagement encourages meta data re-use while preventing accidental overwriting or erasure of critical business logic.

AUTOMATIC CHANGE NOTIFICATIONThe meta data management services of the DataStage Product Family capture meta data about the Publish andSubscribe process in order to provide change notification to subscribers. If something changes in a set of standard itemsthat has been published, the publication is automatically updated upon a meta data re-import, and any subscribers arenotified of the change by e-mail. A user profile describes each meta data subscriber, associating an e-mail address witheither an individual user or a distribution list for a class of users. This mechanism ensures that changes of standardmeta data can be reliably be propagated throughout the integrated environment to maintain the integrity of the informa-tion assets.

ON-LINE DOCUMENTATIONDevelopers can create on-line documentation for any set of objects in the Meta Data Integration Hub. DataStage canoutput documentation to a variety of useful formats: .htm and .xml for web-based viewing, .rtf for printed documentation,.txt files, and .csv format that can be imported into spread sheets or other data tools. Users can select sets of businessdefinitions or technical definitions for documentation. Path diagrams for Impact Analysis and Data Lineage can be auto-matically output to on-line documentation as well as printed as graphics. The document set can result from a meta dataquery, an import, or a manual selection process. It is vital to the management of the information asset that meta data ismade available to anyone who needs it, tailored to his or her needs.

AUTOMATIC META DATA PUBLISHING TO THE DATASTAGE XE PORTAL EDITIONDataStage XE can automatically publish meta data to a web portal. Users can set up this feature either upon import tothe meta data facility, creation of on-line documentation or by meta data script execution. Developers can use the portalto publish a Corporate Glossary of Terms for the Corporation or the Impact Analysis diagram to other developmentteams. The portal works as the data environment’s control and command center, running production data integrationjobs from DataStage and providing the results of the runs to the web.

VII. Embedded Data Quality Assurance Undoubtedly the success of the data integration environment relies directly on the quality of the data it contains.Unfortunately data quality is often confused with data cleansing. Data cleansing consists of data clean-up operationssuch as address parsing and company/product name pattern matching. The DataStage Product Family offers a struc-tured, repeatable methodology for data quality analysis. Specifically, the data quality assurance in DataStage XE pro-vides a framework for developing a self-contained and reusable Project that consists of business rules, analysis results,measurements, history and reports about a particular source or target environment. This Project becomes the dashboardfor developers and administrators to improve data quality and measure those improvements over time.

First, through built-in automation features the developers can quickly discover the values within the data and can answerthe question: Is the data complete and valid? From meta data derived in this initial discovery process, the developerscan then move to build simple or sophisticated tests known as Filters against specific combination of data fields andsources to ensure that data complies with business rules and database constraints. They can run a series of Filters as ajob stream or using the Macro feature, can also schedule the job stream within or outside the quality component. Theincidents and history of defective data are stored thereby allowing developers to export this information to other tools,send it using email or post it on intranet sites. The automated meta data facility reviews and updates meta data informa-tion for further analysis and for generating DataStage transformations. Throughout, the data quality assurance compo-nent of DataStage XE offers Metric capabilities to measure, summarize, represent and display the business cost of datadefects. The developer can produce in a variety of formats including HTML and XML reports, analysis charts for showingthe relative impact and cost of data quality problems, and trend charts that illustrate the trend in data quality improve-ment over time—important for on-going management of the data environment. The data quality assurance capabilities inthe DataStage Product Family outlined below ensure that only the highest quality data influences your organization’sdecision-making process.

• Automates Data Quality Evaluation — Performs automated domain analysis that determines exactly what is present in

28

Page 29: Data Stage Fa My Whitepaper

the actual data values within a data environment. The evaluation process furthers classifies data into its semantic orbusiness use—a key element in determining what types of data validation and integrity checking apply. Using thisinformation, DataStage can determine the degree of completeness and validity within these data values.

• Evaluates Structural Integrity — Looking within and across tables and databases, the data quality services of theDataStage Product Family uncovers the degree of conformance of the data environment to the implicitly or explicitlydefined data structure.

• Ensures Rule Conformance — Enables the testing of the data within and across data environments for the adherenceto the organization’s rules and standards. DataStage looks at the interrelationships of data elements to determine ifthey correctly satisfy the true data relationships that the business requires.

• Measures and Presents the Impact of Data Quality Defects — Evaluates the condition of the data and the processesresponsible for defective data in an effort to improve either or both. A key feature of the data quality component of theDataStage Product Family is the metric facility that allows the quantification of relative impacts and costs of data quality defects. This information provides a means to prioritize and determine the most effective improvement pro-gram.

• Assesses Data Quality Meta Data Integration — Checks the data quality of meta data integration in order to create, incorporate, and retain extensive data quality-related meta data. Data integration processes access and use this meta data that provides detailed and complete understanding of the strengths and weaknesses of source data.

• On-going Monitoring — Monitors the data quality improvement process to ensure quality improvement goals are met.

Depending on the data environment, ensuring data quality can range from the very simple to the very complex. Forexample, companies dealing with global data need to be localization-ready. The DataStage Product Family enables dataquality analysis of double-byte data. Others require production mode processing. Further, it allows users to run a serverversion of the data quality SQL-based analysis in a monitoring environment. In addition, DataStage allows developers toplug in complimentary partner products such as Trillium and FirstLogic for name and address scrubbing.

VIII. Using the Web to Access the Data Integration Environment One the biggest challenges facing data integration developers is the delivery of data and meta data to the businessusers. The DataStage product family member that addresses this challenge is the DataStage XE Portal Edition. TheDataStage XE Portal Edition offers business users and administrators a customizable web-based workspace and singlepoint of access to the integrated data environment. The Portal Edition offers a personalized "Start Page" that can belinked to favorite folders, internal and external web sites, applications and tools. Users can leverage the web infrastruc-ture just by creating their own virtual folders, subscriptions, and notifications tailored to their specific information interestsand needs.

The DataStage XE Portal Edition also enhances user productivity. First, keeping track of important information that haschanged because of content modifications, additions or deletions is time-consuming and tedious. The Portal Editionautomatically notifies users of important updates via email, a notification web page, or customized alerts. Further, itenhances productivity by allowing managers and authorized users to share information with other co-workers or externalbusiness associates such as customers, vendors, partners or suppliers. The Portal Edition integrates with Windowsdesktop and best of breed business intelligence tools thereby enabling permission-based direct content publishing fromdesktop applications such as Microsoft Word, Excel, and PowerPoint. It also automatically links data environment’s metadata to BI reports. This flexibility enables business users to take advantage of their Internet browser to view BusinessIntelligence and meta data reports.

A common obstacle to user productivity is searching through and categorizing the large amounts of data in the dataenvironment. The DataStage XE Portal Edition features an XML-based Business Information Directory (BID) that pro-vides search and categorize capabilities across all content sources. The BID uses corporate vocabularies to correctlymodel the enterprise’s industry-based information, processes, and business flows. Because of the BID’s semantic mod-eling, non-technical business users can quickly locate specific enterprise data. It also provides users with a way to re-

29

Page 30: Data Stage Fa My Whitepaper

categorize data because of a significant business event such as a merger or acquisition.

In addition to the maintenance and development features of the DataStage XE Portal Edition for managing the complexi-ties of a distributed heterogeneous environment, the Portal Edition offers administrators convenience and flexibility. Aspart of the organization’s intranet infrastructure, the Portal Edition allows administrators to customize the user interfaceto support their corporate branding standards. From within the Portal Edition, administrators can handle sensitive tasksincluding initial setup, BID configuration, user and group administration, LDAP Directory synchronization as well aseveryday maintenance such as create, modify, backup and restore. Further administrators have the option to delegatecertain responsibilities to specific users such as the ability to publish content in folders, set permissions on folders, or setsubscriptions to content on behalf of other users. They can also assign permissions to individual users or groups ofusers. Furthermore, for organizations that have implemented network and Internet security such as secure socket layer(SSL) and digital certificates (X.509), the Portal Edition leverages this technology to secure end-user and administrativeaccess to the data environment.

The DataStage XE Portal Edition integrates easily into any organization’s computing environment without requiring mas-sive retooling. By using industry standard eXtensible Markup Language (XML), JavaScript, and Java, the Portal Editiondelivers an information integration framework and cross-platform portability. Further, while it provides out-of-the-box con-tent integration, the Portal Edition allows developers to add content types, repositories, and resources seamlesslythrough its extensible SDKs. The software’s underlying design supports GUI customization and true device indepen-dence meaning that users can access the data environment via any browser or wireless device such as a PDA or mobilephone. The Portal Edition preserves the organization’s investment in the existing data infrastructure while simultaneouslyproviding the flexibility to incorporate new technologies.

Refer to Appendix A for a complete list of supported platforms.

IX. Conclusion

Rushing headlong to build a useful data environment has caused many developers and administrators to deal with unan-ticipated problems. However, a successful integration project will attract more users who will add and change datarequirements. More importantly, the business climate will change and systems and architectures will evolve leavingdevelopers to figure out how they are going to "make it all work" with the resources they have.

Data integration developers who consider key factors such as performance, network traffic, structured and unstructured

30

.DOC

.PDF

.PPT

.XLS

CognosCrystal

BusinessObjects

WebSites

Server

Business Information Directory

PDABrowser Cell Phone Future

Weather Email News DataStage

Translation

OLTPData

Warehouse

OracleSybase

DB2Cognos

BrioBusinessObjects

ReutersZDNet

DataIntegration

Meta DataManagement

DataQuality

XSLT

Vizlets

XML LDAPNT

AuthenticationBID

Business Format Type

Figure 9: DataStage XE Portal Edition

Page 31: Data Stage Fa My Whitepaper

data, data accuracy, integrating meta data, and overall data integration maintenance will stay one step ahead. But, isthere a toolset that can help them deal with these factors without creating long learning curves or diminishing productivity?

The DataStage Product Family can.

With the DataStage Product Family, administrators and developers will have the flexibility to:

• Provide data quality and reliability for accurate business analysis

• Collect, integrate, and transform complex and simple data from diverse and sometimes obscure sources for building and maintaining the data integration infrastructure without overloading available resources

• Integrate meta data across the data environment crucial for maintaining consistent analytic interpretations

• Capitalize on mainframe power for extracting and loading legacy data as well as maximize network performance

Providing developers with power features is only part of the equation. Ascential Software has long supported developersin their efforts to meet tight deadlines, fulfill user requests, and maximize technical resources. DataStage offers a devel-opment environment that:

• Works intuitively thereby reducing the learning curve and maximizing development resources

• Reduces the development cycle by providing hundreds of pre-built transformations as well as encouraging code re-use via APIs

• Helps developers verify their code with a built-in debugger thereby increasing application reliability as well as reducingthe amount of time developers spend fixing errors and bugs

• Allows developers to quickly "globalize" their integration applications by supporting single-byte character sets and multi-byte encoding

• Offers developers the flexibility to develop data integration applications on one platform and then package and deploy them anywhere across the environment

• Enhances developer productivity by adhering to industry standards and using certified application interfaces

Another part of the equation involves the users of integrated data. Ascential Software is committed to maximizing theproductivity of the business users and decision-makers. An integrated data environment built with DataStage offers datausers:

• Quality management which gives them confidence that their integrated data is reliable and consistent

• Meta data management that eliminates the time-consuming, tedious re-keying of meta data as well as provides accurate interpretation of integrated data

• A completed integrated data environment that provides a global snapshot of the company’s operations

• The flexibility to leverage the company’s Internet architecture for a single point of access as well as the freedom to use the business intelligence tool of choice

Ascential’s productivity commitment stands on the premise that data integration must have integrity. Ascential Softwarehas constructed the DataStage family to address data integration issues and make data integration transparent to theuser. In fact, it is the only product today designed to act as an "overseer" for all data integration and business intelli-gence processes.

In the end, building a data storage container for business analysis is not enough in today’s fast-changing business cli-mate. Organizations that recognize the true value in integrating all of their business data, along with the attendant com-plexities of that process, have a better chance of withstanding and perhaps anticipating significant business shifts.Individual data silos will not help an organization pinpoint potential operation savings or missed cross-selling opportuni-ties. Choosing the right toolset for your data environment can make the difference between being first to market andcapturing large market share or just struggling to survive.

31

Page 32: Data Stage Fa My Whitepaper

X. Appendix

32

Business Objects 5.1, 5.0Embarcadero ER/Studio 4.2Computer Associates ERwin 3.5.2Cognos Impromptu 6.0, 5.0ODBC3.0Oracle Designer 61, 2.1.2Sybase PowerDesigner 8.0, 7.0

Windows NT Version 4.0 Windows 2000SUN SolarisIBM AIXCompaqTru64HP-UXIBM System/390

PeopleSoft 7.5, 8.0 SAP R/3 3.1G 4 4.6 CSAP BW 2.0A – 2.1CSiebel 2000

Windows NT Version 4.0 Windows 2000

ApacheMicrosoft IIS

MVS and OS.390- DB2- DB2 UDB- IMS

Oracle- IBM AIX- HP-UX- SUN Solaris- Windows NT/2000

Client

Server

Tools

Enterprise Applications

Web Server CDC

Stages

Direct - Oracle OCI - Sybase Open Client- Informix CLI- OLE/DB - ODBC- Teradata- IBM DB2- Universe- Sequential

Process- Transformation- Aggregation- Sort- Merge- Pivot

Page 33: Data Stage Fa My Whitepaper

Bulk Loaders- Oracle OCI- Oracle Express- Informix XPS - Sybase Adaptive Server and Adaptive Server IQ- Microsoft SQL Server 7- BCPLoad- Informix Redbrick- Teradata- IBM UDB

Special Purpose Stages - Web Log Reader- Web Server Log Reader- Container- POP3Email Reader- Folder- XML Reader- XML Writer- IBM MQ Series- Complex Flat File- Hashed File- Perl Language Plug-in- Named Pipe- FTP

33

DataStage XE/390 Specific Stages- Delimited Flat File- Fixed Width Flat- DB2 Load Ready- Relational- External Source Routine- Lookup- Join- Aggregation- Transformation- FTP- MultiFormat file- External Target Routine

Page 34: Data Stage Fa My Whitepaper

Ascential Software Corporation is the leading provider of Information Asset Management solutions to the Global 2000.

Customers use Ascential products to turn vast amounts of disparate, unrefined data into reusable information assets that

drive business success. Ascential’s unique framework for Information Asset Management enables customers to easily

collect, validate, organize, administer and deliver information assets to realize more value from their enterprise data,

reduce costs and increase profitability. Headquartered in Westboro, Mass., Ascential has offices worldwide and supports

more than 1,800 customers in such industries as telecommunications, insurance, financial services, healthcare,

media/entertainment and retail. For more information on Ascential Software, visit http://www.ascentialsoftware.com.

About Ascential Software

50 Washington StreetWestboro, MA 01581Tel. 508.366.3888www.ascentialsoftware.com

A S C E N T I A L S O F T W A R E R E G I O N A L S A L E S O F F I C E S

North America 800 486 9636 Japan 81 3 5562 4321508 366 3888 Asia 852 2824 3311

Northern Europe 44 20 8818 0700 South Africa 27 11 807 0313Southern Europe 33 (0) 1 4696 37 37 Australia/New Zealand 612 9900 5688Central & Eastern Europe 49 89 20707 0 Latin America 55 11 5188 1000

WP-3002-0102

© 2002 Ascential Software Corporation. All rights reserved. AscentialTM is a trademark of Ascential Software Corporation and may be registered inother jurisdictions. The following are trademarks of Ascential Software Corporation, one or more of which may be registered in the U.S. or other juris-dictions: AxielleTM, DataStage®, MetaBrokersTM, MetaHubTM, XML PackTM, Extract PackTM and Load PACKTM. Other trademarks are the property of the own-ers of those marks.

Printed in USA 1/02. All information is as of January 2002 and is subject to change.