Post on 28-Oct-2014
DataStage Enterprise Edition
Proposed Course Agenda
Day 1 Review of EE Concepts Sequential Access Best Practices DBMS as Source
Day 3 Combining Data Configuration Files Extending EE Meta Data in EE
Day 2 EE Architecture Transforming Data DBMS as Target Sorting Data
Day 4 Job Sequencing Testing and Debugging
The Course MaterialCourse Manual
Exercise Files and Exercise Guide
Online Help
Using the Course Material
Suggestions for learning Take notes Review previous material Practice Learn from errors
Intro Part 1
Introduction to DataStage EE
What is DataStage?
Design jobs for Extraction, Transformation, and Loading (ETL)
Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations Import, export, create, and managed metadata for use within jobsSchedule, run, and monitor jobs all within DataStage Administer your DataStage development and execution environments
DataStage Server and Clients
DataStage Administrator
Client Logon
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
Define global and project properties in Administrator Import meta data into Manager Build job in Designer
Compile DesignerValidate, run, and monitor in Director
DataStage Projects
Quiz True or False
DataStage Designer is used to build and compile your ETL jobs Manager is used to execute your jobs after you build them Director is used to execute your jobs after you build them Administrator is used to set global and project properties
Intro Part 2
Configuring Projects
Module Objectives
After this module you will be able to: Explain how to create and delete projects Set project properties in Administrator Set EE global properties in Administrator
Project Properties
Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator
Setting Project Properties
To set project properties, log onto Administrator, select your project, and then click Properties
Licensing Tab
Projects General Tab
Environment Variables
Permissions Tab
Tracing Tab
Tunables Tab
Parallel Tab
Intro Part 3
Managing Meta Data
Module Objectives
After this module you will be able to: Describe the DataStage Manager components and functionality Import and export DataStage objects Import metadata for a sequential file
What Is Metadata?
Data
SourceMeta Data
Transform
TargetMeta Data
Meta Data Repository
DataStage Manager
Manager Contents
Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc.
Import and Export
Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers
Export Procedure
In Manager, click Export>DataStage Components Select DataStage objects for export Specified type of export: DSX, XML
Specify file path on client machine
Quiz: True or False?
You can export DataStage objects such as jobs, but you cant export metadata, such as field definitions of a sequential file.
Quiz: True or False?
The directory to which you export is on the DataStage client machine, not on the DataStage server machine.
Exporting DataStage Objects
Exporting DataStage Objects
Import Procedure
In Manager, click Import>DataStage Components Select DataStage objects for import
Importing DataStage Objects
Import Options
Exercise
Import DataStage Component (table definition)
Metadata Import
Import format and column destinations from sequential files Import relational table column destinations Imported as Table Definitions
Table definitions can be loaded into job stages
Sequential File Import Procedure
In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary
Manager Table Definition
Importing Sequential Metadata
Intro Part 4
Designing and Documenting Jobs
Module Objectives
After this module you will be able to: Describe what a DataStage job is List the steps involved in creating a job Describe links and stages Identify the different types of stages Design a simple extraction and load job Compile your job Create parameters to make your job flexible Document your job
What Is a Job?
Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface
Compiles into Orchestrate shell language (OSH)
Job Development Overview
In Manager, import metadata defining sources and targets
In Designer, add stages defining data extractions and loadsAnd Transformers and other stages to defined data transformations Add linkss defining the flow of data from sources to targets
Compiled the jobIn Director, validate, run, and monitor your job
Designer Work Area
Designer ToolbarProvides quick access to the main functions of Designer Show/hide metadata markers
Job properties
Compile
Tools Palette
Adding Stages and Links
Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another
Sequential File Stage
Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited
Specified column definitionsSpecify write action
Job Creation Example Sequence
Brief walkthrough of procedure Presumes meta data already loaded in repository
Designer - Create New Job
Drag Stages and Links Using Palette
Assign Meta Data
Editing a Sequential Source Stage
Editing a Sequential Target
Transformer Stage
Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations)
Transformer Stage Elements
Create Column Mappings
Creating Stage Variables
Result
Adding Job Parameters
Makes the job more flexible Parameters can be: Used in constraints and derivations Used in directory and file names
Parameter values are determined at run time
Adding Job Documentation
Job Properties Short and long descriptions Shows in Manager
Annotation stage Is a stage on the tool palette Shows on the job GUI (work area)
Job Properties Documentation
Annotation Stage on the Palette
Annotation Stage Properties
Final Job Work Area with Documentation
Compiling a Job
Errors or Successful Message
Intro Part 5
Running Jobs
Module Objectives
After this module you will be able to: Validate your job Use DataStage Director to run your job Set to run options Monitor your jobs progress View job log messages
Prerequisite to Job ExecutionResult from Designer compile
DataStage Director
Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer Tools > Run Director
Running Your Job
Run Options Parameters and Limits
Director Log View
Message Details are Available
Other Director Functions
Schedule job to run on a particular date/time Clear job log Set Director options Row limits Abort after x warnings
Module 1
DSEE DataStage EE Review
Ascentials Enterprise Data Integration PlatformCommand & Control
ANY SOURCE CRM ERP SCM RDBMS Legacy Real-time Client-server Web services Data Warehouse Other apps.
DISCOVER Gather relevant informatio n for target enterprise application sData Profiling
PREPARE
TRANSFORM Standardiz e and enrich data and load to targets
ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-time Client-server Web services Data Warehouse Other apps.
Cleanse, correct and match input data
Data Quality
Extract, Transform, Load
Parallel Execution Meta Data Management
Course Objectives
You will learn to: Build DataStage EE jobs using complex logic Utilize parallel processing techniques to increase job performance Build custom stages based on application needs
Course emphasis is: Advanced usage of DataStage EE Application job development Best practices techniques
Course Agenda
Day 1 Review of EE Concepts Sequential Access Standards DBMS Access
Day 3 Combining Data Configuration Files
Day 2 EE Architecture Transforming Data Sorting Data
Day 4 Extending EE Meta Data Usage Job Control Testing
Module Objectives
Provide a background for completing work in the DSEE course Tasks Review concepts covered in DSEE Essentials course
Skip this module if you recently completed the DataStage EE essentials modules
Review Topics
DataStage architecture DataStage client review Administrator Manager Designer Director
Parallel processing paradigm
DataStage Enterprise Edition (DSEE)
Client-Server ArchitectureCommand & Control
Microsoft Windows NT/2000/XP
ANY SOURCE
ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-Time Client-server Web services Data Warehouse Other apps.
Designer
Director
Administrator
Repository Manager
Discover Extract
Prepare Transform Transform Cleanse
Extend Integrate
Server
Repository
Microsoft Windows NT or UNIXParallel Execution Meta Data Management
Process Flow
Administrator add/delete projects, set defaults Manager import meta data, backup projects Designer assemble jobs, compile, and execute Director execute jobs, examine job run logs
Administrator Licensing and Timeout
Administrator Project Creation/Removal
Functions specific to a project.
Administrator Project Properties
RCP for parallel jobs should be enabled
Variables for parallel processing
Administrator Environment Variables
Variables are category specific
OSH is what is run by the EE Framework
DataStage Manager
Export Objects to MetaStage
Push meta data to MetaStage
Designer Workspace
Can execute the job from Designer
DataStage Generated OSH
The EE Framework runs OSH
Director Executing Jobs
Messages from previous run in different color
StagesCan now customize the Designers palette Select desired stages and drag to favorites
Popular Developer Stages
Row generator
Peek
Row Generator
Can build test data
Edit row in column tab
Repeatable property
Peek
Displays field values Will be displayed in job log or sent to a file Skip records option Can control number of records to be displayed
Can be used as stub stage for iterative development (more later)
Why EE is so Effective
Parallel processing paradigm More hardware, faster processing Level of parallelization is determined by a configuration file read at runtime
Emphasis on memory Data read into memory and lookups performed like hash table
Parallel Processing Systems
DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously If you add more resources (CPUs, RAM, and disks) you increase system performance1 2
3 4
Example system containing 6 CPUs (or processing nodes) and disks
5
6
Scaleable Systems: ExamplesThree main types of scalable systems
Symmetric Multiprocessors (SMP): shared memory and diskClusters: UNIX systems connected via networks MPP: Massively Parallel Processing
note
SMP: Shared Everything Multiple CPUs with a single operating system Programs communicate using shared memory All CPUs share system resources (OS, memory with single linear address space, disks, I/O) When used with Enterprise Edition: Data transport uses shared memory Simplified startupcpu cpu
cpu
cpu
Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP
Traditional Batch Processing
Operational Data
Transform
Clean
Load
Archived Data Disk
Data Warehouse Disk Disk
SourceTraditional approach to batch processing: Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging
Target
Pipeline MultiprocessingData Pipelining Transform, clean and load processes are executing simultaneously on the same processor rows are moving forward through the flow Operational Data
Archived Data
Transform
Clean
Load
Data Warehouse
Source
Target Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still has limits on scalability
Think of a conveyor belt moving the rows from process to process!
Partition ParallelismData Partitioning Break up big data into partitions
Run one partition on each processor 4X times faster on 4 processors With data big enough: 100X faster on 100 processorsNode 1 A-F G- MSource Data
TransformNode 2
This is exactly how the parallel databases work! Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform
TransformNode 3
N-T U-Z
TransformNode 4
Transform
Combining Parallelism TypesPutting It All Together: Parallel DataflowPipelining
Source Data
Transform
Clean
Data Warehouse
Load
Source
Target
RepartitioningPutting It All Together: Parallel Dataflow with Repartioning on-the-flyPipeliningU-Z N-T G- M A-F
Source Data
Data Warehouse
Transform
CleanCustomer zip code
LoadCredit card number
Customer last name
SourceWithout Landing To Disk!
Targe
EE Program Elements
Dataset: uniform set of rows in the Framework's internal representation- Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasetshence possible need for Import - Different datasets typically have different schemas - Convention: "dataset" = Framework data set.
Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).- All the partitions of a dataset follow the same schema: that of the dataset
DataStage EE ArchitectureDataStage:Provides data integration platform
Orchestrate Framework:Provides application scalabilityOrchestrate Program(sequential data flow)
Flat FilesImport
Clean 1 Merge Clean 2 Analyze
Relational Data
Configuration File Orchestrate Application Framework and Runtime System
Centralized Error Handling and Event Logging Performance Visualization
Parallel access to data in RDBMS Parallel pipeliningClean 1 Import Merge Clean 2 Analyze
Inter-node communications
Parallel access to data in files
Parallelization of operations
DataStage Enterprise Edition:Best-of-breed scalable data integration platform No limitations on data volumes or throughput
Introduction to DataStage EE
DSEE: Automatically scales to fit the machine Handles data flow among multiple CPUs and disks
With DSEE you can: Create applications for SMPs, clusters and MPPs Enterprise Edition is architecture-neutral Access relational databases in parallel Execute external applications in parallel Store data across multiple disks and nodes
Job Design VS. ExecutionDeveloper assembles data flow using the Designer
and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile the design
Partitioners and Collectors
Partitioners distribute rows into partitions implement data-partition parallelism
Collectors = inverse partitioners Live on input links of stages running in parallel (partitioners) sequentially (collectors)
Use a choice of methods
Example Partitioning Iconspartitioner
Exercise
Complete exercises 1-1 and 1-2, and 1-3
Module 2
DSEE Sequential Access
Module Objectives
You will learn to: Import sequential files into the EE Framework Utilize parallel processing techniques to increase sequential file access Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages Manage partitioned data stored by the Framework
Types of Sequential Data Stages
Sequential Fixed or variable length
File Set Lookup File Set
Data Set
Sequential Stage Introduction
The EE Framework processes only datasets For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations this is performed by import and export OSH operators generated by Sequential or FileSet stages
During import or export DataStage performs format translations into, or out of, the EE internal formatData is described to the Framework in a schema
How the Sequential Stage Works
Generates Import/Export operators, depending on whether stage is source or target Performs direct C++ file I/O streams
Using the Sequential File StageBoth import and export of general files (text, binary) are performed by the SequentialFile Stage.
Importing/Exporting Data Data import:EE internal format
Data export
EE internal format
Working With Flat Files
Sequential File Stage Normally will execute in sequential mode Can be parallel if reading multiple files (file pattern option) Can use multiple readers within a node DSEE needs to knowHow file is divided into rows How row is divided into columns
Processes Needed to Import Data
Recordization Divides input stream into records Set on the format tab
Columnization Divides the record into columns Default set on the format tab but can be overridden on the columns tab Can be incomplete if using a schema or not even specified in the stage if using RCP
File Format Example
Record delimiter Field 1 , Field 1 , Field 1 , Last field nl
Final Delimiter = end Field Delimiter
Field 1
,
Field 1
,
Field 1
, Last field
, nl
Final Delimiter = comma
Sequential File Stage
To set the properties, use stage editor Page (general, input/output) Tabs (format, columns)
Sequential stage link rules One input link One output links (except for reject link definition) One reject link
Will reject any records not matching meta data in the column definitions
Job Design Using Sequential Stages
Stage categories
General Tab Sequential Source
Multiple output links
Show records
Properties Multiple Files
Click to add more files having the same meta data.
Properties - Multiple Readers
Multiple readers option allows you to set number of readers
Format Tab
File into records
Record into columns
Read Methods
Reject Link
Reject mode = output Source All records not matching the meta data (the column definitions)
Target All records that are rejected for any reason
Meta data one column, datatype = raw
File Set Stage
Can read or write file sets Files suffixed by .fs File set consists of:1. Descriptor file contains location of raw data files + meta data 2. Individual raw data files
Can be processed in parallel
File Set Stage Example
Descriptor file
File Set Usage
Why use a file set? 2G limit on some file systems Need to distribute data among nodes to prevent overruns If used in parallel, runs faster that sequential file
Lookup File Set Stage
Can create file sets Usually used in conjunction with Lookup stages
Lookup File Set > Properties
Key column specified
Key column dropped in descriptor file
Data Set
Operating system (Framework) file Suffixed by .ds Referred to by a control file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs
Persistent Datasets
Accessed from/to disk with DataSet Stage.Two parts: Descriptor file:
input.dsrecord ( partno: int32; description: string; )
contains metadata, data location, but NOT the data itself
Data file(s)
contain the data multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/ node2:/local/disk2/
Quiz!
True or False?Everything that has been data-partitioned must be collected in same job
Data Set Stage
Is the data partitioned?
Engine Data Translation
Occurs on import From sequential files or file sets From RDBMS
Occurs on export From datasets to file sets or sequential files From datasets to RDBMS
Engine most efficient when processing internally formatted records (I.e. data contained in datasets)
Managing DataSets
GUI (Manager, Designer, Director) tools > data set management Alternative methods OrchadminUnix command line utility List records Remove data sets (will remove all components)
Dsrecords
Lists number of records in a dataset
Data Set Management
Display data
Schema
Data Set Management From Unix
Alternative method of managing file sets and data sets Dsrecords
Gives record count Unix command-line utility $ dsrecords ds_name I.e.. $ dsrecords myDS.ds156999 records
Orchadmin
Manages EE persistent data sets Unix command-line utility I.e. $ orchadmin rm myDataSet.ds
Exercise
Complete exercises 2-1, 2-2, 2-3, and 2-4.
Module 3
Standards and Techniques
Objectives
Establish standard techniques for DSEE development Will cover: Job documentation Naming conventions for jobs, links, and stages Iterative job design Useful stages for job development Using configuration files for development Using environmental variables Job parameters
Job Presentation
Document using the annotation stage
Job Properties Documentation
Organize jobs into categories
Description shows in DS Manager and MetaStage
Naming conventions
Stages named after the Data they access Function they perform DO NOT leave defaulted stage names like Sequential_File_0
Links named for the data they carry DO NOT leave defaulted link names like DSLink3
Stage and Link Names
Stages and links renamed to data they handle
Create Reusable Job Components
Use Enterprise Edition shared containers when feasible
Container
Use Iterative Job Design
Use copy or peek stage as stub Test job in phases small first, then increasing in complexity Use Peek stage to examine records
Copy or Peek Stage Stub
Copy stage
Transformer Stage Techniques
Suggestions Always include reject link. Always test for null value before using a column in a function. Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage Variable type. Try to maintain the data type as imported.
Avoid type conversions.
The Copy StageWith 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder): Partitioners Sort / Remove Duplicates Rename, Drop column
can be inserted on: input link (Partitioning): Partitioners, Sort, Remove Duplicates) output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
Developing Jobs1.
Keep it simple Jobs with many stages are hard to debug and maintain.
2.
Start small and Build to final Solution Use view data, copy, and peek. Start from source and work out. Develop with a 1 node configuration file.
3.
Solve the business problem before the performance problem. Dont worry too much about partitioning until the sequential flow works as expected.
4.
If you have to write to Disk use a Persistent Data set.
Final Result
Good Things to Have in each Job
Use job parameters Some helpful environmental variables to add to job parameters $APT_DUMP_SCORE
Report OSH to message log Establishes runtime parameters to EE engine; I.e. Degree of parallelization
$APT_CONFIG_FILE
Setting Job Parameters
Click to add environment variables
DUMP SCORE OutputSetting APT_DUMP_SCORE yields:Double-click Partitoner And Collector
Mapping Node--> partition
Use Multiple Configuration Files
Make a set for 1X, 2X,. Use different ones for test versus production Include as a parameter in each job
Exercise
Complete exercise 3-1
Module 4
DBMS Access
Objectives
Understand how DSEE reads and writes records to an RDBMS Understand how to handle nulls on DBMS lookup Utilize this knowledge to: Read and write database tables Use database tables to lookup data Use null handling options to clean data
Parallel Database Connectivity
Traditional Client-Server Client Client
Client Client
Enterprise Edition
Sort
Client Client
Load
Parallel RDBMS
Parallel RDBMS
Only RDBMS is running in parallel
Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible
Each application has only one connection Suitable only for small data volumes
RDBMS AccessSupported Databases
Enterprise Edition provides high performance / scalable interfaces for:
DB2 Informix Oracle Teradata
RDBMS Access
Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values Support for standard SQL syntax for specifying: field list for SELECT statement filter for WHERE clause
Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query
RDBMS Stages
DB2/UDB Enterprise Informix Enterprise Oracle Enterprise Teradata Enterprise
RDBMS Usage
As a source Extract data from table (stream link) Extract as table, generated SQL, or user-defined SQL User-defined can perform joins, access views
Lookup (reference link) Normal lookup is memory-based (all table data read into memory) Can perform one lookup at a time in DBMS (sparse option) Continue/drop/fail options
As a target Inserts Upserts (Inserts and updates) Loader
RDBMS Source Stream Link
Stream link
DBMS Source - User-defined SQL
Columns in SQL statement must match the meta data in columns tab
Exercise
User-defined SQL Exercise 4-1
DBMS Source Reference Link
Reject link
Lookup Reject Link
Output option automatically creates the reject link
Null Handling
Must handle null condition if lookup record is not found and continue option is chosen Can be done in a transformer stage
Lookup Stage Mapping
Link name
Lookup Stage PropertiesReference link
Must have same column name in input and reference links. You will get the results of the lookup in the output column.
DBMS as a Target
DBMS As Target
Write Methods Delete Load Upsert Write (DB2)
Write mode for load method Truncate Create Replace Append
Target PropertiesGenerated code can be copied
Upsert mode determines options
Checking for Nulls
Use Transformer stage to test for fields with null values (Use IsNull functions) In Transformer, can reject or load default value
Exercise
Complete exercise 4-2
Module 5
Platform Architecture
Objectives
Understand how Enterprise Edition Framework processes data You will be able to: Read and understand OSH Perform troubleshooting
Concepts
The Enterprise Edition Platform Script language - OSH (generated by DataStage Parallel Canvas, and run by DataStage Director) Communication - conductor,section leaders,players. Configuration files (only one active at a time, describes H/W) Meta data - schemas/tables Schema propagation - RCP EE extensibility - Buildop, Wrapper Datasets (data in Framework's internal representation)
DS-EE Stage ElementsEE Stages Involve A Series Of Processing StepsInput Data Set schema: prov_num:int16; member_num:int8; custid:int32;
Output Data Set schema: prov_num:int16; member_num:int8; custid:int32;
Piece of Application Logic Running Against Individual Records
Parallel or SequentialPartitioner
EE Stage
Business Logic
DSEE Stage ExecutionDual Parallelism Eliminates Bottlenecks! EE Delivers Parallelism in Two Ways Pipeline PartitionProducer
Block Buffering Between Components Eliminates Need for Program Load Balancing Maintains Orderly Data FlowConsume r
Pipeline
Partition
Stages Control Partition Parallelism
Execution Mode (sequential/parallel) is controlled by Stage default = parallel for most Ascential-supplied Stages Developer can override default mode Parallel Stage inserts the default partitioner (Auto) on its input links Sequential Stage inserts the default collector (Auto) on its input links Developer can override default execution mode (parallel/sequential) of Stage > Advanced tab choice of partitioner/collector on Input > Partitioning tab
How Parallel Is It?
Degree of parallelism is determined by the configuration file
Total number of logical nodes in default pool, or asubset if using "constraints".
Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage
OSH
DataStage EE GUI generates OSH scripts Ability to view OSH turned on in Administrator OSH can be viewed in Designer using job properties
The Framework executes OSH What is OSH? Orchestrate shell Has a UNIX command-line interface
OSH Script
An osh script is a quoted string which specifies: The operators and connections of a single Orchestrate step In its simplest form, it is:osh op < in.ds > out.ds
Where: op is an Orchestrate operator in.ds is the input data set out.ds is the output data set
OSH Operators
OSH Operator is an instance of a C++ class inheriting from APT_Operator Developers can create new operators Examples of existing operators: Import Export RemoveDups
Enable Visible OSH in Administrator
Will be enabled for all projects
View OSH in Designer
Operator
Schema
OSH Practice
Exercise 5-1 Instructor demo (optional)
Elements of a Framework Program Operators Datasets: set of rows processed by Framework Orchestrate data sets: persistent (terminal) *.ds, and
virtual (internal) *.v. Also: flat file sets *.fs
Schema: data description (metadata) for datasets and links.
Datasets Consist of Partitioned Data and Schema Can be Persistent (*.ds) or Virtual (*.v, Link) Overcome 2 GB File LimitWhat you program: What gets processed:Node 1 Node 2Operator A
Node 3Operator A
Node 4Operator A
GUI
=
Operator A
What gets generated:OSH
data files of x.ds
. . .
$ osh operator_A > x.ds
Multiple files per partition Each file up to 2GBytes (or larger)
Computing Architectures: DefinitionDedicated Disk
Shared Disk
Shared Nothing
Disk
DiskCPU CPU CPU CPU
Disk
Disk
Disk
Disk
CPU
CPU
CPU
CPU
CPU
Memory
Shared Memory
Memory
Memory
Memory
Memory
Uniprocessor PC Workstation Single processor server
SMP System(Symmetric Multiprocessor) IBM, Sun, HP, Compaq 2 to 64 processors Majority of installations
Clusters and MPP Systems 2 to hundreds of processors MPP: IBM and NCR Teradata each node is a uniprocessor or SMP
Job Execution: OrchestrateConductor NodeC
Conductor - initial DS/EE process Step Composer Creates Section Leader processes (one/node) Consolidates massages, outputs them Manages orderly shutdown.
Processing NodeSL
P
Section Leader Forks Players processes (one/Stage) Manages up/down communication.
P
P
Processing NodeSL
Players The actual processes associated with Stages Combined players: one process only Send stderr to SL
P
P
P
Communication:
Establish connections to other players for data flow Clean up upon completion.
- SMP: Shared Memory - MPP: TCP
Working with Configuration Files
You can easily switch between config files:
'1-node' filetesting parallelism
- for sequential execution, lighter reportshandy for
'MedN-nodes' file - aims at a mix of pipeline and data-partitioned 'BigN-nodes' file - aims at full data-partitioned parallelism
Only one file is active while a step is running
The Framework queries (first) the environment variable:
$APT_CONFIG_FILE
# nodes declared in the config file needs not match # CPUs
Same configuration file can be used in development and
Scheduling Nodes, Processes, and CPUs
DS/EE does not: know how many CPUs are available schedule
Who knows what?
Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs
Nodes User Orchestrate O/S
OpsN Y
ProcessesNodes * Ops "
CPUsN Y
Y Y
Who does what? DS/EE creates (Nodes*Ops) Unix processes The O/S schedules these processes on the CPUs
Configuring DSEE Node Pools{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }
3 1
4 2
Configuring DSEE Disk Pools{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }
3 1
4 2
Re-PartitioningParallel to parallel flow may incur reshuffling: Records may jump between nodesnode 1 node 2
partitioner
Partitioning Methods
Auto Hash Entire Range Range Map
Collectors Collectors combine partitions of a dataset into a single input stream to a sequential Stage ...data partitions
collector
Collectors do NOT synchronize data
sequential Stage
Partitioning and Repartitioning Are Visible On Job Design
Partitioning and Collecting Icons
Partitioner
Collector
Setting a Node Constraint in the GUI
Reading Messages in Director
Set APT_DUMP_SCORE to true Can be specified as job parameter Messages sent to Director log If set, parallel job will produce a report showing the operators, processes, and datasets in the running job
Messages With APT_DUMP_SCORE = True
Exercise
Complete exercise 5-2
Module 6
Transforming Data
Module Objectives
Understand ways DataStage allows you to transform data Use this understanding to: Create column derivations using user-defined code or system functions Filter records based on business criteria Control data flow based on data conditions
Transformed Data
Transformed data is: Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields May be comprised of system variables
Frequently uses functions performed on something (ie. incoming columns) Divided into categories I.e.
Date and time Mathematical Logical Null handling More
Stages Review
Stages that can transform data TransformerParallel Basic (from Parallel palette)
Aggregator (discussed in later module)
Sample stages that do not transform data Sequential FileSet DataSet DBMS
Transformer Stage Functions
Control data flow Create derivations
Flow Control
Separate records flow down links based on data condition specified in Transformer stage constraints Transformer stage can filter records Other stages can filter records but do not exhibit advanced flow control Sequential can send bad records down reject link Lookup can reject records based on lookup failure Filter can select records based on data value
Rejecting Data
Reject option on sequential stage Data does not agree with meta data Output consists of one column with binary data type
Reject links (from Lookup stage) result from the drop option of the property If Not Found Lookup failed All columns on reject link (no column mapping option)
Reject constraints are controlled from the constraint editor of the transformer Can control column mapping Use the Other/Log checkbox
Rejecting Data Example
Constraint Other/log option Property Reject Mode = Output If Not Found property
Transformer Stage Properties
Transformer Stage Variables
First of transformer stage entities to execute Execute in order from top to bottom Can write a program by using one stage variable to point to the results of a previous stage variable
Multi-purpose Counters Hold values for previous rows to make comparison Hold derivations to be used in multiple field dervations Can be used to control execution of constraints
Stage Variables
Show/Hide button
Transforming Data
Derivations Using expressions Using functions
Date/time
Transformer Stage Issues Sometimes require sorting before the transformer stage I.e. using stage variable as accumulator and need to break on change of column value
Checking for nulls
Checking for Nulls
Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition Can be handled in constraints, derivations, stage variables, or a combination of these
Transformer - Handling Rejects
Constraint Rejects All expressions are false and reject row is checked
Transformer: Execution Order
Derivations in stage variables are executed first
Constraints are executed before derivations Column derivations in earlier links are executed before later links Derivations in higher columns are executed before lower columns
Parallel Palette - Two Transformers
All > Processing > Transformer Is the non-Universe transformer Has a specific set of functions No DS routines available
Parallel > Processing Basic Transformer Makes server style transforms available on the parallel palette
Can use DS routines
Program in Basic for both transformers
Transformer Functions From Derivation Editor
Date & Time Logical Null Handling Number String Type Conversion
Exercise
Complete exercises 6-1, 6-2, and 6-3
Module 7
Sorting Data
Objectives
Understand DataStage EE sorting options Use this understanding to create sorted list of data to enable functionality within a transformer stage
Sorting Data
Important because Some stages require sorted input Some stages may run faster I.e Aggregator
Can be performed Option within stages (use input > partitioning tab and set partitioning to anything other than auto) As a separate stage (more complex sorts)
Sorting Alternatives
Alternative representation of same flow:
Sort Option on Stage Link
Sort Stage
Sort Utility
DataStage the default UNIX
Sort Stage - Outputs
Specifies how the output is derived
Sort Specification Options
Input Link Property Limited functionality Max memory/partition is 20 MB, then spills to scratch
Sort Stage Tunable to use more memory before spilling to scratch.
Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE
Removing Duplicates
Can be done by Sort stage Use unique option
OR
Remove Duplicates stage Has more sophisticated ways to remove duplicates
Exercise
Complete exercise 7-1
Module 8
Combining Data
Objectives
Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages Use this understanding to create jobs that will Combine data from separate input streams Aggregate data to form summary totals
Combining Data
There are two ways to combine data: Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,Joins Lookup Merge
Vertically: One input link, one output link with column combining values from all input rows. E.g.,
Aggregator
Join, Lookup & Merge Stages
These "three Stages" combine two or more input links according to values of user-designated "key" column(s). They differ mainly in: Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)
Not all Links are Created Equal Enterprise Edition distinguishes between:- The Primary Input (Framework port 0) - Secondary - in some cases "Reference" (other ports)
Naming convention:Joins Left Right Lookup Source LU Table(s) Merge Master Update(s)
Primary Input: port 0 Secondary Input(s): ports 1,
Tip: Check "Input Ordering" tab to make sure intended Primary is listed first
Join Stage Editor
Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)
One of four variants: Inner Left Outer Right Outer Full Outer
Several key columns allowed
1. The Join StageFour types:
Inner Left Outer Right Outer Full Outer
2 sorted input links, 1 output link "left outer" on primary input, "right outer" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM
2. The Lookup StageCombines: one source link with one or more duplicate-free table links Source input One or more tables (LUTs)
no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match
0 0
1
2 1
Lookup
Output
Reject
The Lookup Stage
Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link On an SMP, no physical duplication of a Lookup Table occurs
The Lookup Stage
Lookup File Set Like a persistent data set only it contains metadata about the key. Useful for staging lookup tables
RDBMS LOOKUP NORMAL
Loads to an in memory hash table first
SPARSE
Select for each row. Might become a performance bottleneck.
3. The Merge Stage
Combines one sorted, duplicate-free master (primary) link with one or more sorted update (secondary) links. Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup).
Follows the Master-Update model: Master row and one or more updates row are merged if they have the same value in user-specified key column(s). A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored) Unmatched ("Bad") master rows can be eitherkept dropped
Unmatched ("Bad") update rows in input link can be captured in a "reject" link Matched update rows are consumed.
The Merge StageAllows composite keysMaster One or more updates
Multiple update linksMatched update rows are consumed
Unmatched updates can be captured0 0Merge
1 1
2
Lightweight2
Space/time tradeoff: presorts vs. inRAM tableRejects
Output
Synopsis: Joins, Lookup, & Merge
JoinsModel Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light exactly 2: 1 left, 1 right both inputs OK (x-product) OK (x-product) NONE NONE reusable 1 Nothing (N/A)
LookupSource - in RAM LU Table heavy 1 Source, N LU Tables no OK Warning! [fail] | continue | drop | reject NONE reusable 1 out, (1 reject) unmatched primary entries
MergeMaster -Update(s) light 1 Master, N Update(s) all inputs Warning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries
In this table: ,
= separator between primary and secondary input links (out and reject links)
The Aggregator StagePurpose: Perform data aggregations Specify:
Zero or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions:count (nulls/non-nulls) sum max/min/range
The grouping method (hash table or pre-sort) is a performance issue
Grouping Methods
Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed doesnt require sorted data good when number of unique groups is small. Running tally for each groups aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. Example: average family income by state, requires .05MB of RAM
Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out. requires input sorted by grouping keys can handle unlimited numbers of groups Example: average daily balance by credit card
Aggregator Functions
Sum Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
Aggregator Properties
Aggregation Types
Aggregation types
Containers
Two varieties Local Shared
Local Simplifies a large, complex diagram
Shared Creates reusable object that many jobs can include
Creating a Container
Create a job Select (loop) portions to containerize Edit > Construct container > local or shared
Using a Container
Select as though it were a stage
Exercise
Complete exercise 8-1
Module 9
Configuration Files
Objectives
Understand how DataStage EE uses configuration files to determine parallel behavior Use this understanding to Build a EE configuration file for a computer system Change node configurations to support adding resources to processes that need them Create a job that will change resource allocations at the stage level
Configuration File Concepts
Determine the processing nodes and disk space connected to each node When system changes, need only change the configuration file no need to recompile jobs When DataStage job runs, platform reads configuration file Platform automatically scales the application to fit the system
Processing Nodes Are
Locations on which the framework runs applications Logical rather than physical construct Do not necessarily correspond to the number of CPUs in your system Typically one node for two CPUs
Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node
Optimizing Parallelism
Degree of parallelism determined by number of nodes defined Parallelism should be optimized, not maximized Increasing parallelism distributes work load but also increases Framework overhead
Hardware influences degree of parallelism possible System hardware partially determines configuration
More Factors to Consider
Communication amongst operators Should be optimized by your configuration Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link
SMP leave some processors for operating system Desirable to equalize partitioning of data Use an experimental approach Start with small data sets Try different parallelism while scaling up data set sizes
Factors Affecting Optimal Degree of Parallelism
CPU intensive applications Benefit from the greatest possible parallelism
Applications that are disk intensive Number of logical nodes equals the number of disk spindles being accessed
Configuration File
Text file containing string data that is passed to the Framework Sits on server side Can be displayed and edited
Name and location found in environmental variable APT_CONFIG_FILE Components Node Fast name Pools Resource
Node Options
Node name name of a processing node used by EE Typically the network name Use command uname n to obtain network name
Fastname Name of node as referred to by fastest network in the system Operators use physical node name to open connections NOTE: for SMP, all CPUs share single connection to network
Pools Names of pools to which this node is assigned Used to logically group nodes Can also be used to group resources
Resource Disk Scratchdisk
Sample Configuration File{ node Node1" { fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } }
Disk Pools
Disk pools allocate storage By default, EE uses the default pool, specified by
pool "bigdata"
Sorting RequirementsResource pools can also be specified for sorting:
The Sort stage looks first for scratch disk resources in a sort pool, and then in the default disk pool
Another Configuration File Example{ node "n1" { fastname s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" } ... }
{"sort"}
6
{}
4
5
21
3
{}
{}
Resource Types
Disk Scratchdisk DB2 Oracle Saswork Sortwork
Can exist in a pool Groups resources together
Using Different Configurations
Lookup stage where DBMS is using a sparse lookup type
Building a Configuration File
Scoping the hardware: Is the hardware configuration SMP, Cluster, or MPP? Define each node structure (an SMP would be single node):
Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed)
Is the machine dedicated to EE? If not, what other applications are running on it? Get a breakdown of the resource usage (vmstat, mpstat, iostat) Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?
Exercise
Complete exercise 9-1 and 9-2
Module 10
Extending DataStage EE
Objectives
Understand the methods by which you can add functionality to EE Use this understanding to: Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages Build a DataStage EE job that uses the new stage
EE Extensibility Overview Sometimes it will be to your advantage to leverage EEs extensibility. This extensibility includes:
Wrappers Buildops Custom Stages
When To Leverage EE ExtensibilityTypes of situations: Complex business logic, not easily accomplished using standard EE stages Reuse of existing C, C++, Java, COBOL, etc
Wrappers vs. Buildop vs. Custom
Wrappers are good if you cannot or do not want to modify the application and performance is not critical. Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces. Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.
Building Wrapped StagesYou can wrapper a legacy executable: Binary Unix command Shell script and turn it into a Enterprise Edition stage capable, among other things, of parallel execution As long as the legacy executable is: amenable to data-partition parallelism
no dependencies between rows
pipe-safecan read rows sequentially no random access to data
Wrappers (Contd)
Wrappers are treated as a black box
EE has no knowledge of contents
EE has no means of managing anything that occurs inside the wrapperEE only knows how to export data to and import data from the wrapper
User must know at design time the intended behavior of the wrapper and its schema interfaceIf the wrappered application needs to see all records prior to processing, it cannot run in parallel.
LS Example
Can this command be wrappered?
Creating a Wrapper
To create the ls stage
Used in this job ---
Wrapper Starting Point
Creating Wrapped StagesFrom Manager:Right-Click on Stage Type > New Parallel Stage > Wrapped
We will "Wrapper an existing Unix executables the ls command
Wrapper - General Page
Name of stage
Unix command to be wrapped
The "Creator" Page
Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.
Wrapper Properties Page
If your stage will have properties appear, complete the Properties page
This will be the name of the property as it appears in your stage
Wrapper - Wrapped Page
Interfaces input and output columns these should first be entered into the table definitions meta data (DS Manager); lets do that now.
Interface schemas Layout interfaces describe what columns the stage: Needs for its inputs (if any) Creates for its outputs (if any) Should be created as tables with columns in Manager
Column Definition for Wrapper Interface
How Does the Wrapping Work?
Define the schema for export and import Schemas become interface schemas of the operator and allow for by-name column access
input schema export
stdin or named pipeUNIX executable stdout or named pipe
import output schema
QUIZ: Why does export precede import?
Update the Wrapper Interfaces
This wrapper will have no input interface i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.
Resulting Job
Wrapped stage
Job Run
Show file from Designer palette
Wrapper Story: Cobol Application
Hardware Environment: IBM SP2, 2 nodes with 4 CPUs per node.
Software: DB2/EEE, COBOL, EE
Original COBOL Application: Extracted source table, performed lookup against table in DB2, and Loaded results to target table. 4 hours 20 minutes sequential execution
Enterprise Edition Solution: Used EE to perform Parallel DB2 Extracts and Loads Used EE to execute COBOL application in Parallel EE Framework handled data transfer between DB2/EEE and COBOL application 30 minutes 8-way parallel execution
BuildopsBuildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include:
Speed / Performance
Complex business logic that cannot be easily represented using existing stages Lookups across a range of values Surrogate key generation Rolling aggregates
Build once and reusable everywhere within project, no shared container necessary Can combine functionality from different stages into one
BuildOps The DataStage programmer encapsulates the business logic
The Enterprise Edition interface called buildop automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary plumbing for a correct and efficient parallel execution. Exploits extensibility of EE Framework
BuildOp Process Overview
From Manager (or Designer):Repository pane: Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}
"Build" stages from within Enterprise Edition "Wrapping existing Unixexecutables
General PageIdentical to Wrappers, except:
Under the Build Tab, your program!
Logic Tab for Business Logic
Enter Business C/C++ logic and arithmetic in four pages under the Logic tab
Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it wont compile within EE either!
Code Sections under Logic Tab
Temporary variables declared [and initialized] here
Logic here is executed once BEFORE processing the FIRST row
Logic here is executed once AFTER processing the LAST row
I/O and TransferUnder Interface tab: Input, Output & Transfer pages
First line: output 0
Optional renaming of output port from default "out0"
Write row
In-Repository Table Definition
Input page: 'Auto Read' Read next row
'False' setting, not to interfere with Transfer page
I/O and Transfer
First line: Transfer of index 0
Transfer all columns from input to output. If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written
BuildOp Simple Example
Example - sumNoTransfer Add input columns "a" and "b"; ignore other columns that might be present in input Produce a new "sum" column Do not transfer input columns
a:int32; b:int32
sumNoTransfersum:int32
No Transfer
From Peek:
NO TRANSFER - RCP set to "False" in stage definitionand - Transfer page left blank, or Auto Transfer = "False"
Effects: - input columns "a" and "b" are not transferred- only new column "sum" is transferred
Compare with transfer ON
Transfer
TRANSFER- RCP set to "True" in stage definition or - Auto Transfer set to "True"
Effects:- new column "sum" is transferred, as well as - input columns "a" and "b" and - input column "ignored" (present in input, but not mentioned in stage)
Columns vs. Temporary C++ VariablesColumnsDS-EE type Defined in Table Definitions
Temp C++ variables
C/C++ type Need declaration (in Definitions or Pre-Loop page) Value persistent throughout "loop" over rows, unless modified in code
Value refreshed from row to row
Exercise
Complete exercise 10-1 and 10-2
Exercise
Complete exercises 10-3 and 10-4
Custom Stage
Reasons for a custom stage: Add EE operator not already in DataStage EE Build your own Operator and add to DataStage EE
Use EE API Use Custom Stage to add new operator to EE canvas
Custom StageDataStage Manager > select Stage Types branch > right click
Custom Stage
Number of input and output links allowed
Name of Orchestrate operator to be used
Custom Stage Properties Tab
The Result
Module 11
Meta Data in DataStage EE
Objectives
Understand how EE uses meta data, particularly schemas and runtime column propagation Use this understanding to: Build schema definition files to be invoked in DataStage jobs Use RCP to manage meta data usage in EE jobs
Establishing Meta Data
Data definitions Recordization and columnization Fields have properties that can be set at individual field level
Data types in GUI are translated to types used by EE
Described as properties on the format/columns tab (outputs or inputs pages) OR Using a schema file (can be full or partial)
Schemas Can be imported into Manager Can be pointed to by some job stages (i.e. Sequential)
Data Formatting Record Level
Format tab Meta data described on a record basis Record level properties
Data Formatting Column Level
Defaults for all columns
Column Overrides
Edit row from within the columns tab Set individual column properties
Extended Column Properties
Field and string settings
Extended Properties String Type
Note the ability to convert ASCII to EBCDIC
Editing Columns
Properties depend on the data type
Schema
Alternative way to specify column definitions for data used in EE jobs Written in a plain text file Can be written as a partial record definition Can be imported into the DataStage repository
Creating a Schema
Using a text editor Follow correct syntax for definitions OR
Import from an existing data set or file set On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions Select checkbox for a file with .fs or .ds
Importing a Schema
Schema location can be on the server or local work station
Data Types
Date Decimal Floating point Integer
Vector Subrecord Raw Tagged
StringTime Timestamp
Runtime Column Propagation
DataStage EE is flexible about meta data. It can cope with the situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). RCP is always on at runtime. Design and compile time column mapping enforcement. RCP is off by default. Enable first at project level. (Administrator project properties) Enable at job level. (job properties General tab) Enable at Stage. (Link Output Column tab)
Enabling RCP at Project Level
Enabling RCP at Job Level
Enabling RCP at Stage Level
Go to output links columns tab For transformer you can find the output links columns tab by first going to stage properties
Using RCP with Sequential Stages
To utilize runtime column propagation in the sequential stage you must use the use schema option Stages with this restriction: Sequential File Set External Source External Target
Runtime Column Propagation
When RCP is Disabled DataStage Designer will enforce Stage Input Column to Output Column mappings. At job compile time modify operators are inserted on output links in the generated osh.
Runtime Column Propagation
When RCP is Enabled DataStage Designer will not enforce mapping rules. No Modify operator inserted at compile time. Danger of runtime error if column names incoming do not match column names outgoing link case sensitivity.
Exercise
Complete exercises 11-1 and 11-2
Module 12
Job Control Using the Job Sequencer
Objectives
Understand how the DataStage job sequencer works Use this understanding to build a control job to run a sequence of DataStage jobs
Job Control Options
Manually write job control Code generated in Basic Use the job control tab on the job properties page Generates basic code which you can modify
Job Sequencer Build a controlling job much the same way you build other jobs Comprised of stages and links No basic coding
Job Sequencer
Build like a regular job Type Job Sequence Has stages and links Job Activity stage represents a DataStage job Links represent passing controlStages
ExampleJob Activity stage contains conditional triggers
Job Activity Properties
Job to be executed select from dropdown
Job parameters to be passed
Job Activity Trigger
Trigger appears as a link in the diagram Custom options let you define the code
Options
Use custom option for conditionals Execute if job run successful or warnings only
Can add wait for file to execute Add execute command stage to drop real tables and rename new tables to current tables
Job Activity With Multiple Links
Different links having different triggers
Sequencer Stage
Build job sequencer to control job for the collections application
Can be set to all or any
Notification Stage
Notification
Notification Activity
Sample DataStage log from Mail Notification
Sample DataStage log from Mail Notification
Notification Activity Message
E-Mail Message
Exercise
Complete exercise 12-1
Module 13
Testing and Debugging
Objectives
Understand spectrum of tools to perform testing and debugging Use this understanding to troubleshoot a DataStage job
Environment Variables
Parallel Environment Variables
Environment VariablesStage Specific
Environment Variables
Environment VariablesCompiler
The DirectorTypical Job Log Messages:
Environment variables Configuration File information
Framework Info/Warning/Error messagesOutput from the Peek Stage Additional info with "Reporting" environments
Tracing/Debug output
Must compile job in trace mode Adds overhead
Job Level Environmental Variables Job Properties, from Menu Bar of Designer Director will prompt you before each run
TroubleshootingIf you get an error during compile, check the following:
Compilation problems If Transformer used, check C++ compiler, LD_LIRBARY_PATH If Buildop errors try buildop from command line Some stages may not support RCP can cause column mismatch . Use the Show Error and More buttons Examine Generated OSH Check environment variables settings Very little integrity checking during compile, should run validate from Director.
Highlights source of error
Generating Test Data
Row Generator stage can be used Column definitions Data type dependent
Row Generator plus lookup stages provides good way to create robust test data from pattern files