Whatâ€™s New in IBM InfoSphere DataStage 8

What’s New in IBM InfoSphere DataStage 8.7

Tony Curcio,

InfoSphere Product Manager

Please Note:

IBM’s statements regarding its plans, directions, and intent are subject to

change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our

general product direction and it should not be relied on in making a

purchasing decision.

The information mentioned regarding potential future products is not a

commitment, promise, or legal obligation to deliver any material, code or

functionality. Information about potential future products may not be

incorporated into any contract. The development, release, and timing of any

future features or functionality described for our products remains at our

sole discretion.

Performance is based on measurements and projections using standard

IBM benchmarks in a controlled environment. The actual throughput or

performance that any user will experience will vary depending upon many

factors, including considerations such as the amount of multiprogramming

in the user's job stream, the I/O configuration, the storage configuration,

and the workload processed. Therefore, no assurance can be given that an

individual user will achieve results similar to those stated here.

Acknowledgements and Disclaimers:

© Copyright IBM Corporation 2011. All rights reserved.

– U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted

by GSA ADP Schedule Contract with IBM Corp.

– IBM, the IBM logo, ibm.com, DataStage and QualityStage are trademarks or registered

trademarks of International Business Machines Corporation in the United States, other

countries, or both. If these and other IBM trademarked terms are marked on their first

occurrence in this information with a trademark symbol (® or ™), these symbols indicate

U.S. registered or common law trademarks owned by IBM at the time this information was

published. Such trademarks may also be registered or common law trademarks in other

countries. A current list of IBM trademarks is available on the Web at “Copyright and

trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all

countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are

provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice

to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is

provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of,

or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the

effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the

applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may

have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these

materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific

sales, revenue growth or other results.

http://www.ibm.com/legal/copytrade.shtml

3

Agenda

Product Review

InfoSphere DataStage 8.7

One Data Integration Platform

Data

Integration

Dozens of prebuilt

transformation

objects and 100s

of functions

Data

Quality

Data validation

& cleansing

applicable for

multiple information

domains

Connectivity

Databases,

Messages,

CDC,

Mainframe

and more

Metadata support Direct access to active metadata for

search, comparison, or impact

analysis

Distributed Transactions Scalable, heterogenous

information fabric with guarenteed

data delivery (2PC)

Data Masking

Protect

sensitive

information

Business

Rules Drive critical

enterprise logic

from SMEs in

your business

Std Industry

Formats SWIFT, EDI, HL7,

etc... along with

native and scalable

XML and Complex

File support

Parallel Engine

Most scalable data

integration engine

supporting SMP, MPP

and Grid with simple

configuration file

control

Balanced Optimization Maximize DBMS infrastructure by

moving processing to it

Usage modes Traditional scheduled batch or SOA

real-time using web services, RSS,

REST, JMS...

Enterprise

Packs Specific

connectivity for

leading ERP

solutions

One Metadata Store

Maximizes business & IT collaboration

and accelerates data governance efforts

One Set of Design Artifacts

Logic represented by one set of design

objects regardless of deployment styles

One Administration Center

Integration of install, security, auditing,

connectivity, logging reduces TCO

One Design Environment Single design paradigm

advances time to value

As part of InfoSphere Information Server, directly benefits from other aspects of the suite – data profiling, mapping specifications, etc…

InfoSphere Information Server

Q4’10

• Information Server 8.5 GA - all platforms released in one qtr

• Launched Concierge program to assist in customer upgrades

• New Migration and Validation Tooling has reduced time to

upgrade significantly

• Over 1000 customers have moved or are moving to 8.5

• Broad adoption of new features (SCCS, XML, Blueprint Director,

Looping Transformer)

Q1’11 • 8.1 FP2 and 8.5 FP1 Shipped - all suite components on the same

FP schedule going forward to simplify maintenance

• NEW CDC Stage for guaranteed delivery through complex heterogenous jobs

• NEW Data Masking Pack – reuses Optim sdk to provide data obfuscation for data integration scenarios

Q2’11 • New Workgroup Edition and Data Warehousing offerings

• Released new appliance offering for the IBM Smart Analytic System

Q3’11 • Beta program for our 8.7 release of Information Server

• Great customer and partner participation… THANK YOU !!!!

Q4’11 • Information Server 8.7 GA – all platforms released on same day

• New game-changing capabilities for data integration including Operations Console and Parallel Debugger

• Continue to offer Concierge program to assist customers in planning for their upgrade

6

InfoSphere Information Server Investment Themes

Engineering theme Description

Performance Linear scalable engine providing the fastest and optimized processing of data, high speed data source communication, optimal performance of features in the product

User Productivity Operational and development productivity for the end users bar none

Governance Provide cutting edge governance features for the largest base of governance customers to ensure compliance and control in their enterprise

End-to-end Integration Best in class integration with strategic sources of data and metadata in the Information Management landscape

Enterprise-ready platform

Robust and complete enterprise platform for all Data Integration projects

Breakthrough Innovation

Innovative ideas to automate, simplify and integrate the data integration process to simplify the hardest problems the customers face

7

Agenda

Product Review

InfoSphere DataStage 8.7

Announcing IBM InfoSphere Information Server 8.7

Smarter, faster, easier information

integration

Understanding, cleansing, transforming and delivering

trusted information

Four areas of innovation focus:

Comprehensive information governance capability - Empowering users with a more complete and controlled view of their information Enhanced productivity for users and organizations - Reducing integration time and cost New and tighter integration - Linking closely with other IBM solutions and solutions from other vendors Big data support - Addressing the most challenging data volumes across the enterprise

Kudos from InfoSphere Information Server 8.7 beta participants

Operations Console “The Operations Console is a singularly useful adjunct to the operations and administrative staff who have a responsibility for meeting performance KPIs; it highlights hotspots in overall execution.”

Data Rules Stage “Data Rules stage allows us to incorporate measurement of compliance with data rules within the ETL stream, rather than isolating them in the profiling tool Information Analyzer. This will make it easier to implement such initiatives as our data quality dashboard.”

Performance “The high performance parallel engine has always been the leader in processing large amounts of data and the new features will just re-enforce that.”

Metadata Advancements “The InfoSphere Metadata Asset Manager tool has provided the most productivity-enhancing capabilities in IS 8.7… This will no doubt decrease the time it takes to re-align metadata so that it reflects the current production state.”

Netezza Integration “Great product, provide more options and flexibilities to Netezza and Datastage users. Some unique options implemented for Netezza. And powerful Balanced Optimization options. Very stable release for the Beta software.”

Parallel Debugger “An interactive debugger is an almost essential troubleshooting tool once one has reached the stage where a job runs to completion but does not produce the expected results.”

Business Glossary “The new look and feel of the Business Glossary in particular is much easier to use in 8.7. Labels are a much welcomed addition to help sort and organize related data.”

Performance Enhancements & Back compatibility

Performance

• Design Time Performance

–Significant Performance improvement in Job Open, Save, Compile

etc.

–Expect 8.7 to be faster than 8.5 and 8.1 due to improvements in

Xmeta

• PX Engine performance improvements

– Improved partition/sort insertion algorithm

–Dataset sort information propagated between jobs

–Reduces the number of sorts in many jobs

–XML parsing performance is improved by 3x or more for large XML

files

• Significant performance improvements in the IA Data rules execution

• Major emphasis on back compatibility both in Development and QA

–Almost all behavior changes can be toggled by a flag

–A single technical document lists all compatibility issues for

Information Server

Job Log Now Available in Designer

• Access job log information through View menu

• Window is dockable on any side of the screen or can float

• Linked to the job currently being viewed

• Includes Run/Stop/Reset buttons

• Released in 8.5 FP1 and now in 8.7

User Productivity

Interactive Parallel Job Debugging Features

• New Interactive debugger for the parallel job canvas

• Provides debugging support for running parallel jobs across SMP, MPP, Cluster and GRID deployments

• Support for multiple breakpoints with conditional logic per link and node

• Visualizing data by node and filtering complex schemas

• While paused at a breakpoint

– breakpoints can be added/removed

– row data for the breakpoint link can be examined by node

– job parameter values can be examined.

– stage/job property editors can be opened to examine properties of the job design.

– the running job can be continued or aborted.

User Productivity


• Right click on any link and toggle the breakpoint on/off or choose to edit the breakpoint

• From breakpoint window, establish whether to stop at a particular row number, or based on a particular expression.

• View and edit all breakpoints in the job from one this one window.

User Productivity


Status

Ready, Running

Stopped

Active Breakpoints

Number of parallel

processes which

have hit a breakpoint

Data display

One tab for each

node that has an

active breakpoint

Values from

Multiple Rows

Tree control allows

display of data

values from 3 prior

rows

Watch List

Track specific

columns that are

of most interest

User Productivity

15 15

Big Data File Stage

End-to-end Integration

• Adds the new Big Data File Stage

so that organizations can make the

most of new Big Data sources

• Support for the Hadoop Distributed

File System (HDFS)

• Mirrors the Sequential File stage

experience so that users will find it

intuitive to get started.

• Provides reading from multiple files

in parallel (either listed specifically

or through file patterns)

• Can mimic the same degree of

partitioning as the Big Data file, or

the engine can dynamically

repartition that data "on the fly"

based on varying business

requirements

16

Netezza Integrations

16


• Netezza Connector provides:

– Scalable, high-performance data exchange for DataStage,

QualityStage and Info Analyzer

– Shared metadata across Information Server

– Rich logging and tracing capability

• Operations

– Row Selected SQL operation

– Sparse/Normal lookups

– Utilize UDX functions in user defined SQL

– Create table with distribution key options

– Parallel load/extract via external table and name pipes

– Further optimization of capabilities (including even better

performance)

– Job parameter can be applied to all options

– Supports options to turn on Netezza statistics collection

when inserting

• Support for both the server and parallel

canvas

17

Netezza Connector Features and Highlights

Read

• Auto generated SQL from columns

• User defined SQL

• Sequential read

• Parallel read with modulus and range partition

Write

• Insert

• Update

• Delete

• Update then insert

• Delete then insert

• Action column

• User defined SQL

Lookup

• Sparse with auto generated SQL or User Defined

• Normal with auto generated SQL or User Defined


InfoSphere DataStage and Netezza System Topology

18

InfoSphere DataStage Server (Intel® Xeon® E7-4870) • OS: Red Hat EL 5.3 x86-64 • Processor Type: Intel® Xeon® E7- 4870, 40 cores/80

threads • Processor Speed: 2.4GHZ • Memory Size: 1 TB RAM • Disk Space: 2 TB total disk space • Network Card: Intel®10 Gigabit CX4

IBM Netezza 1000-12 Appliance (TwinFin-12)

• 12 S-Blades • 96 CPU cores • Processor: Intel® Xeon® E5520

2.27GHz • Storage Space: 128 TB*

* @4x compression ratio • Network Card: Intel®10 Gigabit CX4 • 63 writer option enabled

10G Ethernet

Load Rate = 2.38 TB / hour

Unload Rate = 2.58 TB / hour

Balanced Optimizer for Netezza

• Leverages Netezza highly parallel architecture to push user defined processing to the appliance

• Particularly valuable for homogenous data integration tasks since it reduces or eliminates network data transfer in many cases

• Provides the same job design as traditional DataStage jobs so there is no recoding required

• Allows operator to execute either traditional DataStage run time or new Balanced Optimizer run time for maximum flexibility

Performance

Balanced Optimization Features

Optimization options

• Push processing to database targets

• Push joins to database targets

• Push processing to database sources

• Push joins to database sources

• Push data reduction processing to database targets

• Push all processing into the target database

Supported stages and connectors

• Aggregator, Copy, Filter, Funnel, Join, Lookup, Sort, RemDup,

Transformer

• DB2, Netezza, Teradata, Oracle connectors

Performance

Perform dual load by asynchronously

loading into multiple Teradata systems

and take advantage of DataStage job

failure and recovery strategies using

two ETL feeders.

Integrate with the TMSM API to send

connector events directly to the TMSM

event master to allow for monitoring

connector ETL processes through the

Teradata Viewpoint Portal.

Support executing dual load jobs with

TMSM integration under the same unit

of work (UOW) to ensure

synchronization of both systems.

Support for Teradata Dual Load End-to-end Integration

• New CDC Stage provides real-time

integration of log-based replicated

data directly into our massively

scalable parallel runtime.

• Couples CDC uses cases with any of

Information Server’s data integration,

data quality, data monitoring

components to open new use cases

• Infrastructure supports guaranteed

delivery through complex,

heterogeneous transformation tasks

• New 8.7 feature adds end-to-end

metadata representation through

Information Server

Solve challenging problems in real-time through low-impact database access

Deep Integration with InfoSphere CDC End-to-end Integration

XML Stage Advancements

• Supports XML

optional

elements and

sub-types

• Supports both

“minimal” in

addition to strict

validations

• Adds auto-chunk

feature for more

performant

handling of large

schemas


Data Rules Stage

• Provides data quality

monitoring and

analysis as part of any

data integration job

• Leverages same

business rules to

validate the data,

before it is persisted to

a data store

• Rules are shared with

Information Analyzer

to allow for multiple

entry points into a data

governance program

Breakthrough Innovation

Data Rules stage

appears under Data

Quality Palette

Appears like any

other stage on

the job canvas

Data Rules Stage Breakthrough

Innovation

List of

Published

Rules

available from

Information

Analyzer or

built in

DataStage

Can build up RuleSets

within RuleStage

Lists set of input links

from stage to chose

as rule variables

Set of Rule variables that

need to be satisfied, with

their bindings

Rules selected for this

stage instance

Data Rules Stage – Opens new paradigms Breakthrough

Innovation

• Integrating various data

integration and quality

components continues to add

to our industry uniqueness.

• In this job, we show a real-

time warehousing scenario

that is using rules shared with

the profiling environment to

test data validity and take

corrective action to both the

DW and OLTP source

Operations Console

• Provides quick answers for the operators, developers and other stakeholders as they view and analyze run-time environment

• Simple, web based, read-only access highly-optimized for answering most frequent questions

• Dashboard style graphs representing current job activity, success/failed jobs status, system health and resource consumption

• Visual cues alert user to potential issues

• Server wide and project specific views based on user privileges

• High level summary and detailed information for both current and historical activity

• Fully documented schema to support integration with other applications

27

Operational Metadata Analytics across Data Integration, Data Quality, Data Profiling and CDC

Governance

28

Operations Console – Home Page

High-level view of

all Job Activity over

a configurable time

period

Answers

“What is running?”

29


Filter display by Engine or Project(s)

Note: Access control prevents display of data

from projects which the current user does not

have access to, as determined by their entitled

projects.

30


Visual alerts for job

run failures

View recent Job run activity, via

shortcuts:

- Finished

- Finished with warnings

- Failed

Answers

“What jobs may need

investigation?”

31


Configurable view on

Operating System

resources

Answers

“Are system resources

keeping up with requests?”

32


Configurable Alert Thresholds Set Alert thresholds for key Operating System

metrics:

- CPU

- Free virtual memory

- Free physical memory

33


Status of Engine

Components

Answers

“Are all engine

components in a

proper state?”

34


If any Engine in the list has an

error, the engine status bar will

display red and display the

error icon.

Answers

“Do any other engines have

issues?”

35


First click to get the list of

all jobs in any project in

that state.

Answers

“What specifically is in a

failed state?”

36


Second click gets me the details

of that job run, including both

run summary information as well

as specific log error information.

Answers

“Why did the job fail?”

37

Operations Console – Repository Page

Browse Project properties,

including default

environment variable

settings

38


Folder content includes

thumbnails of job

designs for easier

identification/recollection

39


Job property information

including create/modify

details and latest run

details

Job Run history allows

quick summary view of all

prior executions of the job

and ability to get to details

for these runs.

Answers

“Has the job been

executing consistently?”

40


Easy access to

the job canvas

view.

41


Drill into the list of jobs

that are used by or

depend on this job

42


Run details provides

summary about the job

run along with parameter

information

43


Drag and drop the

metrics you want to the

graph for overlay

Answers

“What is the runtime

profile of this job?”

44


Display log details.

Captures control,

warning, fatal and a

minimal set of

informational

messages.

Hot link to the message

reference for information on

troubleshooting.

Answers

“How should I respond to this

issue?”

45

Operations Console – Link To Message Reference

46


Compare 2 jobs and

filter down to only

show differences.

Answers

“What changed since

I last ran this job?”

47


Overlay performance

information to see if

there are obvious deltas

in how resources were

used

Answers

“Has the job profile been

optimized?”

48


Log compare

places job run info

side by side.

Answers

“What messages

should I focus on?”

49

Operations Console – Activities Page

Analyze Job run activity

• current and historical

• list of current/recent Job runs

• performance metrics (rows

processed, elapsed time)

• status information

• compare one job vs another

• narrow focus to a particular

timeframe

Answers

“What is running and what are

the most recently completed

jobs?”

50

Operations Console – Activities Page

Look at resource consumption

across the engine for CPU,

Memory, Processes, Number of

runs, and disk space which is

configurable to specific disk

locations you want to manage.

New Administration Features

Maintenance Mode A new command line tool allows the suite administrator to put Information Server into maintenance mode. When

configured in this way, any attempted login by a user that does not have Suite Administrator role will be prevented from logging in. This feature will simplify how administrators can quiesce the system for any activities that require their exclusive access to the program.

SessionAdmin -user <user> -password <pwd> -set-maint-mode [ON|OFF]

SessionAdmin -user <user> -password <pwd> -get-maint-mode

SessionAdmin –authfile <auth file> -kill-user-sessions

Command line administration of user and roles Certain organizations standardize the way in which they establish their Information Server environments for new project

work. In order to support this process from beginning to end, we have expanded our command line interfaces to permit user and role assignment to InfoSphere DataStage and InfoSphere QualityStage Projects.

DirectoryAdmin -assign_project_group_roles myProj$myGroup$myRole

DirectoryAdmin -rm_proj_usr_roles myProj$myUserId$myRole1*myRole2*myRole3

Enterprise ready platform

New Administration Features

IPv6 Support Many organizations are now migrating their networks to the newer, expanded set of Internet Protocol addresses. At

version 8.7, Information Server is fully compatible with IPv6 addresses and can support dual-stack protocol implementations (i.e. mixed IPv4 and IPv6).

APT_USE_IPV4

Set this env variable to force network class to use only IPv4 protocols.

New backup and restore functionality To prevent the loss of data and to prepare for disaster recovery, organizations need robust backup strategies. Version

8.7 introduces a new administration tool called isrecovery,which provides capabilities to backup and restore the databases, profiles, and directories that are associated with InfoSphere Information Server. The isrecovery tool supports all three server tiers: the services tier, the enginetier, and the metadata repository tier. When you run a backup or recovery,all tiers that are installed on the computer are backed up simultaneously.

isrecovery -help | {

-backup -gen-config [-advanced] | -backup -archive directory |

-restore -gen-config [-advanced] | -restore -archive directory }

[optional_parameters] | -restart | -clean


New Security Features

Strongly encrypted credential files for command line utilities unified means for handling credential arguments to the various command line tools – istool, dsjob, etc…. Ability to reference an external file that maintains the credentials required for login to Information Server.

dsjob -authfile c:\cred_file.txt –run -paramfile c:\paramfile.txt dstage1 testJob

Strongly encrypted job parameter files for dsjob command Version 8.7 extends existing encryption ability by permitting the dsjob command line “paramfile” option to specify a file that contains strongly encrypted parameter values. This provides encryption-at-rest for job parameters in the same way as the authfile previously described provides it for login credentials.

dsjob -authfile c:\cred_file.txt –run -paramfile c:\paramfile.txt dstage1 testJob

Encryption Algorithm and Customization Information Server provides the AES-128 symmetric-key encryption algorithm which is compliant with many government and industry standards. While this will satisfy most companies’ security requirements, other organizations may wish to use their own standard encryption variant. Information Server 8.7 provides the means to override our out-of-the-box algorithm in such instances.

> /opt/IBM/InformationServer/ASBNode/bin/encrypt.sh myPa$$w0rd

> {iisenc}PvqKLr7z3QOLJCQ4QhbrrA==


# sample credentials file

user=dsadm

password={iisenc}HEf6s6cG+Ee6NdGDQppQNg==

domain=[2002:920:c000:217:9:32:217:32]:9080

server=RemoteServer

# sample parameter file

ftpuser=myftpid

ftppwd={iisenc}Rft6sd35!ERexg67uiPLkmM3err+

New Security Features

Mixed LDAP and OS Support via PAM Provides software access to both the actual end-users via accounts defined in their LDAP server as well as to

“functional users” that are defined via the local OS. These functional users represent a role that may be shared by

multiple individuals within the organization or may simply be assigned to a scripted process.

Topic in the publically available Information Center is…

“Configuring IBM InfoSphere Information Server to use PAM (Linux, UNIX)”

Automated audit trail for all DBMS access with InfoSphere Guardium Provided for satisfying regulatory mandates to protect sensitive information such financial records and personally

identifiable information (PII). Information Server 8.7 connectors can be easily configured to automatically register audit

trail entries into InfoSphere Guardium so that the organization has a complete view of the origin of any data integration

activity that affects the database


Metadata Asset Manager

Managed Metadata Import

- Create, manage and delete Import Areas

- Import metadata via Metabrokers/Bridges

- Compare import results with active repository

- Understand impact/results of import

- Publish import to active repository

- Re-import, including implied delete logic

- View history and compare import events

- Support for Logical Models

Metadata Management

- Browse Shared Metadata, View Usage before

Delete, Delete

- Manage duplicate and disconnected

metadata

- Advanced duplicate management / create

relationships

Governance

Smarter Metadata Management

56 IBM Confidential Until Announcement 08/24/2004

InfoSphere Blueprint Director Versioning, Task Management and Milestone planning

User Productivity

• Visibility of development progress in the blueprint of your project through integration with

task management (Rational Team Concert)

• Milestone planning support in your blueprints by defining milestones for elements and

visualizing evolution along milestones

• Versioning support for blueprints through integration with source code control management

systems (CVS, Rational Team Concert, etc.), incl. comparing changes

Thank You

Tony Curcio IBM Software Group

InfoSphere Product Management

[email protected]

www.ibm.com/software/data/infosphere

Whatâ€™s New in IBM InfoSphere DataStage 8

Documents

Transcript of Whatâ€™s New in IBM InfoSphere DataStage 8