Readme 2654

Partitioning File Sources

When a session uses a file source, you can configure it to read the source with one thread or with multiple

threads. The Integration Service creates one connection to the file source when you configure the session to

read with one thread, and it creates multiple concurrent connections to the file source when you configure

the session to read with multiple threads. Use the following types of partitioned file sources:

Flat file. You can configure a session to read flat file, XML, or COBOL source files.

Command. You can configure a session to use an operating system command to generate source

data rows or generate a file list.

When connecting to file sources, you must choose the same connection type for all partitions. You may

choose different connection objects as long as each object is of the same type.

To specify single- or multi-threaded reading for flat file sources, configure the source file name property for

partitions 2-n. To configure for single-threaded reading, pass empty data through partitions 2-n. To

configure for multi-threaded reading, leave the source file name blank for partitions 2-n.

Rules and Guidelines for Partitioning File Sources

Use the following rules and guidelines when you configure a file source session with multiple partitions:

Use pass-through partitioning at the source qualifier.

Use single- or multi-threaded reading with flat file or COBOL sources.

Use single-threaded reading with XML sources.

You cannot use multi-threaded reading if the source files are non-disk files, such as FTP files or

WebSphere MQ sources.

If you use a shift-sensitive code page, use multi-threaded reading if the following conditions are

true:

- The file is fixed-width.

- The file is not line sequential.

- You have not enabled user-defined shift state in the source definition.

To read data from the three flat files concurrently, you must specify three partitions at the source

qualifier. Accept the default partition type, pass-through.

If you configure a session for multi-threaded reading, and the Integration Service cannot create

multiple threads to a file source, it writes a message to the session log and reads the source with

one thread.

When the Integration Service uses multiple threads to read a source file, it may not read the rows

in the file sequentially. If a sort order is important, configure the session to read the file with a

single thread. For example, sort order may be important if the mapping contains a sorted Joiner

transformation and the file source is the sort origin.

You can also use a combination of direct and indirect files to balance the load.

Session performance for multi-threaded reading is optimal with large source files. The load may be

unbalanced if the amount of input data is small.

You cannot use a command for a file source if the command generates source data and the session

is configured to run on a grid or is configured with the resume from the last checkpoint recovery

strategy.

Using One Thread to Read a File Source

When the Integration Service uses one thread to read a file source, it creates one connection to the source.

The Integration Service reads the rows in the file or file list sequentially. You can configure single-threaded

reading for direct or indirect file sources in a session:

Reading direct files. You can configure the Integration Service to read from one or more direct

files. If you configure the session with more than one direct file, the Integration Service creates a

concurrent connection to each file. It does not create multiple connections to a file.

Reading indirect files. When the Integration Service reads an indirect file, it reads the file list and

then reads the files in the list sequentially. If the session has more than one file list, the Integration

Service reads the file lists concurrently, and it reads the files in the list sequentially.

Using Multiple Threads to Read a File Source

When the Integration Service uses multiple threads to read a source file, it creates multiple concurrent

connections to the source. The Integration Service may or may not read the rows in a file sequentially. You

can configure a multi-threaded reading for direct or indirect file sources in a session:

Reading direct files. When the Integration Service reads a direct file, it creates multiple reader

threads to read the file concurrently. You can configure the Integration Service to read from one or

more direct files. For example, if a session reads from two files and you create five partitions, the

Integration Service may distribute one file between two partitions and one file between three

partitions.

Reading indirect files. When the Integration Service reads an indirect file, it creates multiple

threads to read the file list concurrently. It also creates multiple threads to read the files in the list

concurrently. The Integration Service may use more than one thread to read a single file.

Configuring for File Partitioning

After you create partition points and configure partitioning information, you can configure source connection

settings and file properties on the Transformations view of the Mapping tab. Click the source instance name

you want to configure under the Sources node. When you click the source instance name for a file source,

the Workflow Manager displays connection and file properties in the session properties. You can configure

the source file names and directories for each source partition. The Workflow Manager generates a file name

and location for each partition. The following table describes the file properties settings for file sources in a

mapping:

Configuring Sessions to Use a Single Thread

To configure a session to read a file with a single thread, pass empty data through partitions 2-n. To pass

empty data, create a file with no data, such as “empty.txt,” and put it in the source file directory. Then, use

“empty.txt” as the source file name.

Note: You cannot configure single-threaded reading for partitioned sources that use a command to generate

source data.

The following table shows the source file name and values when the Integration Service creates one thread

to read ProductsA.txt. It reads rows in the file sequentially. After it reads the file, it passes the data to three

partitions in the transformation pipeline:

The following table shows the source file name and values when the Integration Service creates two threads.

It creates one thread to read ProductsA.txt, and it creates one thread to read ProductsB.txt. It reads the

files concurrently, and it reads rows in the files sequentially:

If you use FTP to access source files, you can choose a different connection for each direct file.

Configuring Sessions to Use Multiple Threads

To configure a session to read a file with multiple threads, leave the source file name blank for partitions 2-

n. The Integration Service uses partitions 2-n to read a portion of the previous partition file or file list. The

Integration Service ignores the directory field of that partition.

To configure a session to read from a command with multiple threads, enter a command for each partition

or leave the command property blank for partitions 2-n. If you enter a command for each partition, the

Integration Service creates a thread to read the data generated by each command. Otherwise, the

Integration Service uses partitions 2-n to read a portion of the data generated by the command for the first

partition.

The following table shows the attributes and values when the Integration Service creates three threads to

concurrently read ProductsA.txt:


read ProductsA.txt and ProductsB.txt concurrently. Two threads read ProductsA.txt and one thread reads

ProductsB.txt:


concurrently read data piped from the command:


read data piped from CommandA and CommandB. Two threads read the data piped from CommandA and

one thread reads the data piped from CommandB:

Configuring Concurrent Read Partitioning

By default, the Integration Service does not preserve the row order when multiple partitions read from a

single file source. To preserve row order when multiple partitions read from a single file source, configure

concurrent reads partitioning. You can configure the following options:

Optimize throughput. The Integration Service does not preserve the row order when multiple

partitions read from a single file source. Use this option if the order in which multiple partitions read

from a file source is not important.

Keep the relative input row order. Preserves the sort order of the input rows read by each

partition. Use this option if you want to preserve the sort order of the input rows read by each

partition. The following table shows an example sort order of a file source with 10 rows by two

partitions: Partition Rows Read

Partition #1 1,3,5,8,9

Partition #2 2,4,6,7,10

Keep absolute input row order. Preserves the sort order of all input rows read by all partitions.

Use this option if you want to preserve the sort order of the input rows each time the session runs.

In a pass-through mapping with passive transformations, the order of the rows written to the target

will be in the same order as the input rows.

The following table shows an example sort order of a file source with 10 rows by two partitions:

Partition Rows Read

Partition #1 1,2,3,4,5

Partition #2 6,7,8,9,10

Note: By default, the Integration Service uses the Keep absolute input row order option in sessions

configured with the resume from the last checkpoint recovery strategy.

Readme 2654

Documents

Transcript of Readme 2654