What’s New in IBM InfoSphere DataStage 8
Transcript of What’s New in IBM InfoSphere DataStage 8
What’s New in IBM InfoSphere DataStage 8.7
Tony Curcio,
InfoSphere Product Manager
Please Note:
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our
general product direction and it should not be relied on in making a
purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our
sole discretion.
Performance is based on measurements and projections using standard
IBM benchmarks in a controlled environment. The actual throughput or
performance that any user will experience will vary depending upon many
factors, including considerations such as the amount of multiprogramming
in the user's job stream, the I/O configuration, the storage configuration,
and the workload processed. Therefore, no assurance can be given that an
individual user will achieve results similar to those stated here.
Acknowledgements and Disclaimers:
© Copyright IBM Corporation 2011. All rights reserved.
– U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted
by GSA ADP Schedule Contract with IBM Corp.
– IBM, the IBM logo, ibm.com, DataStage and QualityStage are trademarks or registered
trademarks of International Business Machines Corporation in the United States, other
countries, or both. If these and other IBM trademarked terms are marked on their first
occurrence in this information with a trademark symbol (® or ™), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other
countries. A current list of IBM trademarks is available on the Web at “Copyright and
trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all
countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are
provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice
to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is
provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of,
or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the
effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the
applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may
have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these
materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific
sales, revenue growth or other results.
3
Agenda
Product Review
InfoSphere DataStage 8.7
One Data Integration Platform
Data
Integration
Dozens of prebuilt
transformation
objects and 100s
of functions
Data
Quality
Data validation
& cleansing
applicable for
multiple information
domains
Connectivity
Databases,
Messages,
CDC,
Mainframe
and more
Metadata support Direct access to active metadata for
search, comparison, or impact
analysis
Distributed Transactions Scalable, heterogenous
information fabric with guarenteed
data delivery (2PC)
Data Masking
Protect
sensitive
information
Business
Rules Drive critical
enterprise logic
from SMEs in
your business
Std Industry
Formats SWIFT, EDI, HL7,
etc... along with
native and scalable
XML and Complex
File support
Parallel Engine
Most scalable data
integration engine
supporting SMP, MPP
and Grid with simple
configuration file
control
Balanced Optimization Maximize DBMS infrastructure by
moving processing to it
Usage modes Traditional scheduled batch or SOA
real-time using web services, RSS,
REST, JMS...
Enterprise
Packs Specific
connectivity for
leading ERP
solutions
One Metadata Store
Maximizes business & IT collaboration
and accelerates data governance efforts
One Set of Design Artifacts
Logic represented by one set of design
objects regardless of deployment styles
One Administration Center
Integration of install, security, auditing,
connectivity, logging reduces TCO
One Design Environment Single design paradigm
advances time to value
As part of InfoSphere Information Server, directly benefits from other aspects of the suite – data profiling, mapping specifications, etc…
InfoSphere Information Server
Q4’10
• Information Server 8.5 GA - all platforms released in one qtr
• Launched Concierge program to assist in customer upgrades
• New Migration and Validation Tooling has reduced time to
upgrade significantly
• Over 1000 customers have moved or are moving to 8.5
• Broad adoption of new features (SCCS, XML, Blueprint Director,
Looping Transformer)
Q1’11 • 8.1 FP2 and 8.5 FP1 Shipped - all suite components on the same
FP schedule going forward to simplify maintenance
• NEW CDC Stage for guaranteed delivery through complex heterogenous jobs
• NEW Data Masking Pack – reuses Optim sdk to provide data obfuscation for data integration scenarios
Q2’11 • New Workgroup Edition and Data Warehousing offerings
• Released new appliance offering for the IBM Smart Analytic System
Q3’11 • Beta program for our 8.7 release of Information Server
• Great customer and partner participation… THANK YOU !!!!
Q4’11 • Information Server 8.7 GA – all platforms released on same day
• New game-changing capabilities for data integration including Operations Console and Parallel Debugger
• Continue to offer Concierge program to assist customers in planning for their upgrade
6
InfoSphere Information Server Investment Themes
Engineering theme Description
Performance Linear scalable engine providing the fastest and optimized processing of data, high speed data source communication, optimal performance of features in the product
User Productivity Operational and development productivity for the end users bar none
Governance Provide cutting edge governance features for the largest base of governance customers to ensure compliance and control in their enterprise
End-to-end Integration Best in class integration with strategic sources of data and metadata in the Information Management landscape
Enterprise-ready platform
Robust and complete enterprise platform for all Data Integration projects
Breakthrough Innovation
Innovative ideas to automate, simplify and integrate the data integration process to simplify the hardest problems the customers face
7
Agenda
Product Review
InfoSphere DataStage 8.7
Announcing IBM InfoSphere Information Server 8.7
Smarter, faster, easier information
integration
Understanding, cleansing, transforming and delivering
trusted information
Four areas of innovation focus:
Comprehensive information governance capability - Empowering users with a more complete and controlled view of their information Enhanced productivity for users and organizations - Reducing integration time and cost New and tighter integration - Linking closely with other IBM solutions and solutions from other vendors Big data support - Addressing the most challenging data volumes across the enterprise
Kudos from InfoSphere Information Server 8.7 beta participants
Operations Console “The Operations Console is a singularly useful adjunct to the operations and administrative staff who have a responsibility for meeting performance KPIs; it highlights hotspots in overall execution.”
Data Rules Stage “Data Rules stage allows us to incorporate measurement of compliance with data rules within the ETL stream, rather than isolating them in the profiling tool Information Analyzer. This will make it easier to implement such initiatives as our data quality dashboard.”
Performance “The high performance parallel engine has always been the leader in processing large amounts of data and the new features will just re-enforce that.”
Metadata Advancements “The InfoSphere Metadata Asset Manager tool has provided the most productivity-enhancing capabilities in IS 8.7… This will no doubt decrease the time it takes to re-align metadata so that it reflects the current production state.”
Netezza Integration “Great product, provide more options and flexibilities to Netezza and Datastage users. Some unique options implemented for Netezza. And powerful Balanced Optimization options. Very stable release for the Beta software.”
Parallel Debugger “An interactive debugger is an almost essential troubleshooting tool once one has reached the stage where a job runs to completion but does not produce the expected results.”
Business Glossary “The new look and feel of the Business Glossary in particular is much easier to use in 8.7. Labels are a much welcomed addition to help sort and organize related data.”
Performance Enhancements & Back compatibility
Performance
• Design Time Performance
–Significant Performance improvement in Job Open, Save, Compile
etc.
–Expect 8.7 to be faster than 8.5 and 8.1 due to improvements in
Xmeta
• PX Engine performance improvements
– Improved partition/sort insertion algorithm
–Dataset sort information propagated between jobs
–Reduces the number of sorts in many jobs
–XML parsing performance is improved by 3x or more for large XML
files
• Significant performance improvements in the IA Data rules execution
• Major emphasis on back compatibility both in Development and QA
–Almost all behavior changes can be toggled by a flag
–A single technical document lists all compatibility issues for
Information Server
Job Log Now Available in Designer
• Access job log information through View menu
• Window is dockable on any side of the screen or can float
• Linked to the job currently being viewed
• Includes Run/Stop/Reset buttons
• Released in 8.5 FP1 and now in 8.7
User Productivity
Interactive Parallel Job Debugging Features
• New Interactive debugger for the parallel job canvas
• Provides debugging support for running parallel jobs across SMP, MPP, Cluster and GRID deployments
• Support for multiple breakpoints with conditional logic per link and node
• Visualizing data by node and filtering complex schemas
• While paused at a breakpoint
– breakpoints can be added/removed
– row data for the breakpoint link can be examined by node
– job parameter values can be examined.
– stage/job property editors can be opened to examine properties of the job design.
– the running job can be continued or aborted.
User Productivity
Interactive Parallel Job Debugging Features
• Right click on any link and toggle the breakpoint on/off or choose to edit the breakpoint
• From breakpoint window, establish whether to stop at a particular row number, or based on a particular expression.
• View and edit all breakpoints in the job from one this one window.
User Productivity
Interactive Parallel Job Debugging Features
Status
Ready, Running
Stopped
Active Breakpoints
Number of parallel
processes which
have hit a breakpoint
Data display
One tab for each
node that has an
active breakpoint
Values from
Multiple Rows
Tree control allows
display of data
values from 3 prior
rows
Watch List
Track specific
columns that are
of most interest
User Productivity
15 15
Big Data File Stage
End-to-end Integration
• Adds the new Big Data File Stage
so that organizations can make the
most of new Big Data sources
• Support for the Hadoop Distributed
File System (HDFS)
• Mirrors the Sequential File stage
experience so that users will find it
intuitive to get started.
• Provides reading from multiple files
in parallel (either listed specifically
or through file patterns)
• Can mimic the same degree of
partitioning as the Big Data file, or
the engine can dynamically
repartition that data "on the fly"
based on varying business
requirements
16
Netezza Integrations
16
End-to-end Integration
• Netezza Connector provides:
– Scalable, high-performance data exchange for DataStage,
QualityStage and Info Analyzer
– Shared metadata across Information Server
– Rich logging and tracing capability
• Operations
– Row Selected SQL operation
– Sparse/Normal lookups
– Utilize UDX functions in user defined SQL
– Create table with distribution key options
– Parallel load/extract via external table and name pipes
– Further optimization of capabilities (including even better
performance)
– Job parameter can be applied to all options
– Supports options to turn on Netezza statistics collection
when inserting
• Support for both the server and parallel
canvas
17
Netezza Connector Features and Highlights
Read
• Auto generated SQL from columns
• User defined SQL
• Sequential read
• Parallel read with modulus and range partition
Write
• Insert
• Update
• Delete
• Update then insert
• Delete then insert
• Action column
• User defined SQL
Lookup
• Sparse with auto generated SQL or User Defined
• Normal with auto generated SQL or User Defined
End-to-end Integration
InfoSphere DataStage and Netezza System Topology
18
InfoSphere DataStage Server (Intel® Xeon® E7-4870) • OS: Red Hat EL 5.3 x86-64 • Processor Type: Intel® Xeon® E7- 4870, 40 cores/80
threads • Processor Speed: 2.4GHZ • Memory Size: 1 TB RAM • Disk Space: 2 TB total disk space • Network Card: Intel®10 Gigabit CX4
IBM Netezza 1000-12 Appliance (TwinFin-12)
• 12 S-Blades • 96 CPU cores • Processor: Intel® Xeon® E5520
2.27GHz • Storage Space: 128 TB*
* @4x compression ratio • Network Card: Intel®10 Gigabit CX4 • 63 writer option enabled
10G Ethernet
Load Rate = 2.38 TB / hour
Unload Rate = 2.58 TB / hour
Balanced Optimizer for Netezza
• Leverages Netezza highly parallel architecture to push user defined processing to the appliance
• Particularly valuable for homogenous data integration tasks since it reduces or eliminates network data transfer in many cases
• Provides the same job design as traditional DataStage jobs so there is no recoding required
• Allows operator to execute either traditional DataStage run time or new Balanced Optimizer run time for maximum flexibility
Performance
Balanced Optimization Features
Optimization options
• Push processing to database targets
• Push joins to database targets
• Push processing to database sources
• Push joins to database sources
• Push data reduction processing to database targets
• Push all processing into the target database
Supported stages and connectors
• Aggregator, Copy, Filter, Funnel, Join, Lookup, Sort, RemDup,
Transformer
• DB2, Netezza, Teradata, Oracle connectors
Performance
Perform dual load by asynchronously
loading into multiple Teradata systems
and take advantage of DataStage job
failure and recovery strategies using
two ETL feeders.
Integrate with the TMSM API to send
connector events directly to the TMSM
event master to allow for monitoring
connector ETL processes through the
Teradata Viewpoint Portal.
Support executing dual load jobs with
TMSM integration under the same unit
of work (UOW) to ensure
synchronization of both systems.
Support for Teradata Dual Load End-to-end Integration
• New CDC Stage provides real-time
integration of log-based replicated
data directly into our massively
scalable parallel runtime.
• Couples CDC uses cases with any of
Information Server’s data integration,
data quality, data monitoring
components to open new use cases
• Infrastructure supports guaranteed
delivery through complex,
heterogeneous transformation tasks
• New 8.7 feature adds end-to-end
metadata representation through
Information Server
Solve challenging problems in real-time through low-impact database access
Deep Integration with InfoSphere CDC End-to-end Integration
XML Stage Advancements
• Supports XML
optional
elements and
sub-types
• Supports both
“minimal” in
addition to strict
validations
• Adds auto-chunk
feature for more
performant
handling of large
schemas
End-to-end Integration
Data Rules Stage
• Provides data quality
monitoring and
analysis as part of any
data integration job
• Leverages same
business rules to
validate the data,
before it is persisted to
a data store
• Rules are shared with
Information Analyzer
to allow for multiple
entry points into a data
governance program
Breakthrough Innovation
Data Rules stage
appears under Data
Quality Palette
Appears like any
other stage on
the job canvas
Data Rules Stage Breakthrough
Innovation
List of
Published
Rules
available from
Information
Analyzer or
built in
DataStage
Can build up RuleSets
within RuleStage
Lists set of input links
from stage to chose
as rule variables
Set of Rule variables that
need to be satisfied, with
their bindings
Rules selected for this
stage instance
Data Rules Stage – Opens new paradigms Breakthrough
Innovation
• Integrating various data
integration and quality
components continues to add
to our industry uniqueness.
• In this job, we show a real-
time warehousing scenario
that is using rules shared with
the profiling environment to
test data validity and take
corrective action to both the
DW and OLTP source
Operations Console
• Provides quick answers for the operators, developers and other stakeholders as they view and analyze run-time environment
• Simple, web based, read-only access highly-optimized for answering most frequent questions
• Dashboard style graphs representing current job activity, success/failed jobs status, system health and resource consumption
• Visual cues alert user to potential issues
• Server wide and project specific views based on user privileges
• High level summary and detailed information for both current and historical activity
• Fully documented schema to support integration with other applications
27
Operational Metadata Analytics across Data Integration, Data Quality, Data Profiling and CDC
Governance
28
Operations Console – Home Page
High-level view of
all Job Activity over
a configurable time
period
Answers
“What is running?”
29
Operations Console – Home Page
Filter display by Engine or Project(s)
Note: Access control prevents display of data
from projects which the current user does not
have access to, as determined by their entitled
projects.
30
Operations Console – Home Page
Visual alerts for job
run failures
View recent Job run activity, via
shortcuts:
- Finished
- Finished with warnings
- Failed
Answers
“What jobs may need
investigation?”
31
Operations Console – Home Page
Configurable view on
Operating System
resources
Answers
“Are system resources
keeping up with requests?”
32
Operations Console – Home Page
Configurable Alert Thresholds Set Alert thresholds for key Operating System
metrics:
- CPU
- Free virtual memory
- Free physical memory
33
Operations Console – Home Page
Status of Engine
Components
Answers
“Are all engine
components in a
proper state?”
34
Operations Console – Home Page
If any Engine in the list has an
error, the engine status bar will
display red and display the
error icon.
Answers
“Do any other engines have
issues?”
35
Operations Console – Home Page
First click to get the list of
all jobs in any project in
that state.
Answers
“What specifically is in a
failed state?”
36
Operations Console – Home Page
Second click gets me the details
of that job run, including both
run summary information as well
as specific log error information.
Answers
“Why did the job fail?”
37
Operations Console – Repository Page
Browse Project properties,
including default
environment variable
settings
38
Operations Console – Repository Page
Folder content includes
thumbnails of job
designs for easier
identification/recollection
39
Operations Console – Repository Page
Job property information
including create/modify
details and latest run
details
Job Run history allows
quick summary view of all
prior executions of the job
and ability to get to details
for these runs.
Answers
“Has the job been
executing consistently?”
40
Operations Console – Repository Page
Easy access to
the job canvas
view.
41
Operations Console – Repository Page
Drill into the list of jobs
that are used by or
depend on this job
42
Operations Console – Repository Page
Run details provides
summary about the job
run along with parameter
information
43
Operations Console – Repository Page
Drag and drop the
metrics you want to the
graph for overlay
Answers
“What is the runtime
profile of this job?”
44
Operations Console – Repository Page
Display log details.
Captures control,
warning, fatal and a
minimal set of
informational
messages.
Hot link to the message
reference for information on
troubleshooting.
Answers
“How should I respond to this
issue?”
45
Operations Console – Link To Message Reference
46
Operations Console – Repository Page
Compare 2 jobs and
filter down to only
show differences.
Answers
“What changed since
I last ran this job?”
47
Operations Console – Repository Page
Overlay performance
information to see if
there are obvious deltas
in how resources were
used
Answers
“Has the job profile been
optimized?”
48
Operations Console – Repository Page
Log compare
places job run info
side by side.
Answers
“What messages
should I focus on?”
49
Operations Console – Activities Page
Analyze Job run activity
• current and historical
• list of current/recent Job runs
• performance metrics (rows
processed, elapsed time)
• status information
• compare one job vs another
• narrow focus to a particular
timeframe
Answers
“What is running and what are
the most recently completed
jobs?”
50
Operations Console – Activities Page
Look at resource consumption
across the engine for CPU,
Memory, Processes, Number of
runs, and disk space which is
configurable to specific disk
locations you want to manage.
New Administration Features
Maintenance Mode A new command line tool allows the suite administrator to put Information Server into maintenance mode. When
configured in this way, any attempted login by a user that does not have Suite Administrator role will be prevented from logging in. This feature will simplify how administrators can quiesce the system for any activities that require their exclusive access to the program.
SessionAdmin -user <user> -password <pwd> -set-maint-mode [ON|OFF]
SessionAdmin -user <user> -password <pwd> -get-maint-mode
SessionAdmin –authfile <auth file> -kill-user-sessions
Command line administration of user and roles Certain organizations standardize the way in which they establish their Information Server environments for new project
work. In order to support this process from beginning to end, we have expanded our command line interfaces to permit user and role assignment to InfoSphere DataStage and InfoSphere QualityStage Projects.
DirectoryAdmin -assign_project_group_roles myProj$myGroup$myRole
DirectoryAdmin -rm_proj_usr_roles myProj$myUserId$myRole1*myRole2*myRole3
Enterprise ready platform
New Administration Features
IPv6 Support Many organizations are now migrating their networks to the newer, expanded set of Internet Protocol addresses. At
version 8.7, Information Server is fully compatible with IPv6 addresses and can support dual-stack protocol implementations (i.e. mixed IPv4 and IPv6).
APT_USE_IPV4
Set this env variable to force network class to use only IPv4 protocols.
New backup and restore functionality To prevent the loss of data and to prepare for disaster recovery, organizations need robust backup strategies. Version
8.7 introduces a new administration tool called isrecovery,which provides capabilities to backup and restore the databases, profiles, and directories that are associated with InfoSphere Information Server. The isrecovery tool supports all three server tiers: the services tier, the enginetier, and the metadata repository tier. When you run a backup or recovery,all tiers that are installed on the computer are backed up simultaneously.
isrecovery -help | {
-backup -gen-config [-advanced] | -backup -archive directory |
-restore -gen-config [-advanced] | -restore -archive directory }
[optional_parameters] | -restart | -clean
Enterprise ready platform
New Security Features
Strongly encrypted credential files for command line utilities unified means for handling credential arguments to the various command line tools – istool, dsjob, etc…. Ability to reference an external file that maintains the credentials required for login to Information Server.
dsjob -authfile c:\cred_file.txt –run -paramfile c:\paramfile.txt dstage1 testJob
Strongly encrypted job parameter files for dsjob command Version 8.7 extends existing encryption ability by permitting the dsjob command line “paramfile” option to specify a file that contains strongly encrypted parameter values. This provides encryption-at-rest for job parameters in the same way as the authfile previously described provides it for login credentials.
dsjob -authfile c:\cred_file.txt –run -paramfile c:\paramfile.txt dstage1 testJob
Encryption Algorithm and Customization Information Server provides the AES-128 symmetric-key encryption algorithm which is compliant with many government and industry standards. While this will satisfy most companies’ security requirements, other organizations may wish to use their own standard encryption variant. Information Server 8.7 provides the means to override our out-of-the-box algorithm in such instances.
> /opt/IBM/InformationServer/ASBNode/bin/encrypt.sh myPa$$w0rd
> {iisenc}PvqKLr7z3QOLJCQ4QhbrrA==
Enterprise ready platform
# sample credentials file
user=dsadm
password={iisenc}HEf6s6cG+Ee6NdGDQppQNg==
domain=[2002:920:c000:217:9:32:217:32]:9080
server=RemoteServer
# sample parameter file
ftpuser=myftpid
ftppwd={iisenc}Rft6sd35!ERexg67uiPLkmM3err+
New Security Features
Mixed LDAP and OS Support via PAM Provides software access to both the actual end-users via accounts defined in their LDAP server as well as to
“functional users” that are defined via the local OS. These functional users represent a role that may be shared by
multiple individuals within the organization or may simply be assigned to a scripted process.
Topic in the publically available Information Center is…
“Configuring IBM InfoSphere Information Server to use PAM (Linux, UNIX)”
Automated audit trail for all DBMS access with InfoSphere Guardium Provided for satisfying regulatory mandates to protect sensitive information such financial records and personally
identifiable information (PII). Information Server 8.7 connectors can be easily configured to automatically register audit
trail entries into InfoSphere Guardium so that the organization has a complete view of the origin of any data integration
activity that affects the database
Enterprise ready platform
Metadata Asset Manager
Managed Metadata Import
- Create, manage and delete Import Areas
- Import metadata via Metabrokers/Bridges
- Compare import results with active repository
- Understand impact/results of import
- Publish import to active repository
- Re-import, including implied delete logic
- View history and compare import events
- Support for Logical Models
Metadata Management
- Browse Shared Metadata, View Usage before
Delete, Delete
- Manage duplicate and disconnected
metadata
- Advanced duplicate management / create
relationships
Governance
Smarter Metadata Management
56 IBM Confidential Until Announcement 08/24/2004
InfoSphere Blueprint Director Versioning, Task Management and Milestone planning
User Productivity
• Visibility of development progress in the blueprint of your project through integration with
task management (Rational Team Concert)
• Milestone planning support in your blueprints by defining milestones for elements and
visualizing evolution along milestones
• Versioning support for blueprints through integration with source code control management
systems (CVS, Rational Team Concert, etc.), incl. comparing changes
Thank You
Tony Curcio IBM Software Group
InfoSphere Product Management
www.ibm.com/software/data/infosphere