Developing a Streaming Pipeline Component for BizTalk...

23
Developing a Streaming Pipeline Component for BizTalk Server Published: February 2010 Author: Yossi Dahan, Sabra Ltd. Technical Reviewers: Ewan Fairweather, Microsoft Oleg Gershikov Paolo Salvatori, Microsoft Roni Schwarts Manuel Stern, Microsoft Tim Wieman, Microsoft Applies to: BizTalk Server 2009 Summary This paper shows how to address issues with high memory consumption and latency by taking a streaming approach to pipeline component development.

Transcript of Developing a Streaming Pipeline Component for BizTalk...

Page 1: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

Developing a Streaming Pipeline Component for BizTalk Server

Published: February 2010

Author: Yossi Dahan, Sabra Ltd.

Technical Reviewers: Ewan Fairweather, Microsoft Oleg Gershikov Paolo Salvatori, Microsoft Roni Schwarts Manuel Stern, Microsoft Tim Wieman, Microsoft

Applies to: BizTalk Server 2009

Summary

This paper shows how to address issues with high memory consumption and latency by taking a streaming approach to pipeline component development.

Page 2: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

ContentsDeveloping a Streaming Pipeline Component for BizTalk Server...........................................................1

Summary...............................................................................................................................................1

What do I mean by “streaming”?......................................................................................................3

How does a non-streaming pipeline component work?....................................................................4

A streaming pipeline component......................................................................................................6

A story of streams..............................................................................................................................7

The custom stream explained............................................................................................................8

Summary.........................................................................................................................................13

Appendix A - Testing the stream......................................................................................................14

Appendix B - The Sample.................................................................................................................15

Appendix C – Comparing the memory consumption of both components......................................16

Page 3: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

Orchestrations are often the first port of call for most BizTalk developers and are the feature most identified with the product. However, many scenarios can be implemented very well using messaging alone, harnessing the power of port configuration and pipelines.

Any serious messaging implementation is bound to involve one or more custom pipeline components, and these are not difficult to develop. However, too often developers ignore the streaming fashion of the pipeline and write components that read the entire message into memory.

Whilst this approach does work, and admittedly is somewhat easier to grasp, it has two significant downsides:

Higher memory consumption – the entire message has to be loaded into memory, whatever the message size is. In addition, if the .NET XML DOM is used (via the XmlDocument class, for example) the memory footprint of the component can be considerably bigger than the message size.

Latency – any further processing of the message has to wait until the current component has finished processing the entire message.

Both of these downsides can be addressed by taking a “streaming” approach to pipeline component development, which is what this article attempts to demonstrate.

What do I mean by “streaming”?To begin any discussion about streaming pipeline components, it is important to understand what “streaming” means in the context of a BizTalk pipeline.

To do that, let’s take a look at what a typical receive location looks like. (Note: Whilst this article will mostly discuss the use of a component in a receive pipeline, the same applies, only as a mirror image, to a send pipeline.)

Take a receive location that uses the File adapter, and a pass-through pipeline for example -

Figure 1- “Pass-through” receive location

The File adapter opens a file stream and passes it, contained in a BizTalk message but otherwise untouched, to the BizTalk Endpoint Manager (EPM) which hosts the pipeline. Since the pipeline in this case is pass-through, it does not contain any components, and so the stream is read by the endpoint manager, piece by piece, as it writes the message content’s to the message box database.

Page 4: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

Once the message has been persisted successfully, the EPM calls back the adapter letting it know it has completed processing the message, allowing the adapter to perform any required clean up. In this case the adapter would close the stream (if not already closed) and delete the file from the file system.

How would the story change as you introduce a couple of pipeline components to the pipeline?

Figure 2 -Receive Location with custom components

In this case the EPM passes the BizTalk message and with it the file stream, to the first pipeline component during the execution of the pipeline, by calling its Execute() method –

public IBaseMessage Execute(IPipelineContext pContext, IBaseMessage pInMsg)

The EPM passes one message in (along with its execution context) and expects to receive a message back (either the same message or a different one.)

If a second pipeline component exists in the pipeline, it is the message returned by the first component, and not the message created by the adapter, that is being passed into it. The same will be repeated for all the components in the pipeline, with each component receiving as an input the message returned by the previous component, until all the components in the pipeline have been stringed together.

How does a non-streaming pipeline component work?A non-streaming pipeline component’s implementation reads the stream provided in the input message into memory, performs whatever manipulation is required in memory (often by loading it into an XmlDocument object or a string, sometimes deserializing it into a typed object), and then – with the processing complete - assigns a modified stream back to the message, which is then being returned to the pipeline. Here’s a sample of how such code might look:

1 Stream outStream = new MemoryStream();2 StreamWriter memoryWriter = new StreamWriter(outStream);34 using(StreamReader sr = new 5 StreamReader(pInMsg.BodyPart.GetOriginalDataStream()))6{7 string record = string.Empty; 8 while ((record = sr.ReadLine()) != null)9 {10 processRecord(memoryWriter, record);11 }

Page 5: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

12 memoryWriter.Flush(); //flush writer to ensure writing's done.13 outStream.Seek(0, SeekOrigin.Begin);14 pInMsg.BodyPart.Data = outStream; 15 pContext.ResourceTracker.AddResource(memoryWriter);1617 return pInMsg;

}

Listing 1 – Non-streaming pipeline component implementation

In lines 1 and 2 a MemoryStream object is instantiated to hold the output message’s stream, as well as a StreamWriter object to write to it.

Line 4 instantiates a StreamReader object over the message’s body part stream (retrieved by calling the BodyPart property’s GetOriginalDataStream() metod). This allows the component to read the incoming message’s body part.

In line 8 the input stream is read line by line, and in line 10 it is passed to a method that will process the individual line; the method also receives the StreamWriter object, allowing it to write the resulting output into the memory stream; the exact implementation of this method is not important, and will vary depending on the component’s purpose. The key point is that somewhere in there, we’re likely to have the call to write to the memory stream, through the stream writer:

memoryWriter.WriteLine(record);

Once the entire incoming stream has been read (while loop exhausted), the memory stream is being flushed (line 12) to ensure all bytes have been committed and rewind (line 13) so that it can be read again without having to Seek() its beginning.

This is important because subsequent components, just like my sample above, and – in fact - the EPM itself, are likely to assume the stream received is ready to be read and is not (yet) consumed. This is an important assumption because not all input streams are seekable (an http request stream, for example).

Then, most importantly, in line 14, the message’s stream is being replaced with the output stream that was just created.

Before returning the message back to the pipeline in line 18, the StreamWriter object that was created is added to the context’s ResourceTracker (line 15). This ensures the .NET Garbage Collector does not dispose of this object and the stream it is using. If it did, the underlying stream would have been closed and disposed of also ll (the behaviour of the StreamWriter Dispose() method) and would not be available for other component’s or indeed BizTalk Server itself. Equally it also means that BizTalk Server will ensure disposing of this object once the processing of the message is complete.

Note:Registering both the writer and its underlying stream with the ResourceTraceker may lead to ObjectDisposedException when these are going through the managed disposal process performed after the pipeline execution has completed.

Page 6: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

The result of this component is a potentially modified message, returned to the pipeline. But, implemented this way, the rest of the pipeline’s execution is held while this component is performing its execution.

If there are no other pipeline components in the pipeline, this has very little impact. If, however, there is another component in the pipeline, its execution will not be started until the first component has finished processing the entire message. As we will see shortly, by changing the component to work in a streaming fashion, we can allow downstream components to work on portions of the message previous components have already processed, even before processing the entire message.

More crucially, though, in the preceding code, the entire message (output message in this case, but it could equally be the input message, or both, depending on the implementation) has been loaded into memory (using a memory stream, in this case). If the message is large, memory available to the host will be significantly reduced and, as a consequence, throttling (or worse) can happen, seriously affecting the solution’s throughput. This might be okay with a single message, but as is the case with any BizTalk implementation, one has to consider the implication of running in high throughput. What will happen when 100/1,000/10,000 messages will be processed in parallel? High throughput, even of smaller messages, will seriously aggravate this problem.

A streaming pipeline componentBy changing the component’s design we can achieve two things:

1. The component will return the message to the pipeline shortly after it receives it and before a single byte is read. This would allow subsequent components to do the same, and – all done well – all the components will be able to process the message almost simultaneously (not quite though, as each component has to work on the result of its preceding component).

2. The component will only ever load a small portion of the message into memory at a time; how small the portion depends on the requirements. The sample code provided only ever holds slightly more than a single line from the input file in memory.

So, how can this be achieved?

The approach is to take the input message and replace the stream it uses, initially the one provided by the adapter, with a custom stream (in my sample it is ‘RemoveDuplicatesStream’). As far as the code in the pipeline component itself, this is pretty much all that it has to do (the code is explained next):

public IBaseMessage Execute(IPipelineContext pContext, IBaseMessage pInMsg){1 Stream bodyPartStream = pInMsg.BodyPart.GetOriginalDataStream();2 RemoveDuplicatesStream newStream = new 3 RemoveDuplicatesStream(bodyPartStream);

4 pInMsg.BodyPart.Data = newStream;

5 pContext.ResourceTracker.AddResource(newStream);

6 return pInMsg;

Page 7: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

}Listing 2 - Streaming pipeline component implementation

In line 1 the stream from the input message is extracted into a local variable.

In line 2 an instance of a custom stream – in this case one that would remove duplicate records from the message – is being instantiated with the input message’s stream passed into its constructor; we will look into the implementation of this stream shortly.

In line 4 the message’s body stream is being replaced with the custom stream, which is then also being added to the resource tracker in line 5 before the message is being returned to the pipeline in line 6.

You will note that the component has not actually done any processing yet; so far a single byte of the stream hasn’t been read, and the message has been returned to the pipeline more or less untouched, only its stream is now “wrapped” by the custom stream object.

A story of streamsThe main principle of this approach is to implement the scenario so that the required processing is initiated from the custom stream’s Read() method. If all the components in any pipeline did the same, the EPM would eventually receive a message in which the adapter’s stream has been wrapped several times in custom streams, but not a single byte was read from the actual adapter’s stream yet.

Then, as the EPM reads the stream it received by calling its Read() method, the call gets bubbled down the layers all the way down to the adapter’s stream, with each layer applying whatever logic is required on the read bytes.

Consider a scenario where a pipeline contains three components, each with its own custom stream implementation, each wrapping the received stream, and neither reading the stream directly. Looking at the streams involved in this scenario, they would look something like this –

Figure 3 - Custom streams in a pipeline

The first component in the pipeline receives the file stream created by the adapter, and wraps it with its own stream, depicted in orange in the preceding diagram.

Page 8: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

The message, now containing the orange stream is passed to the second pipeline component, which wraps the stream in its own stream - the green stream.

The message, now containing the green stream, is passed to a third component which, again, wraps received stream with its own custom stream – the blue stream in the preceding diagram.

It is that last message that is handled by the EPM; as the EPM reads the message’s stream (the blue one, created by the last component), the call to the Read() method is bubbled down through the green stream, to the orange stream and, eventually, down to the file stream – the only stream actually containing any bytes; with the Read() method in each one of the streams applying its logic on the read bytes from the previous stream so that by the time the EPM receives any number of bytes from its call to the blue stream’s Read() method, they have been processed by all four streams along the way.

It is these processed bytes that get written, by the EPM, to the Message Box Database.

Of course not all the components have to be streaming for this approach to work. Any component that reads the message will simply “break” this chain, replacing the stream used by any subsequent components, and effectively separating the flow to two separate chains, one before the non-streaming component and one after it; the principle still works, however, within each chain.

Hopefully by now the principles of this approach are quite clear, and the code required to implement a streaming pipeline component is understood; there is really not much more to it than the preceding sample.

The only real variance in the pipeline component’s code between one component to another is usually the description, version, icon properties and, when needed, any other design time properties, which are then usually passed into the stream; as the majority of the code lies within the custom stream the pipeline component’s code is very simple and does not change much from one component to another.

The custom stream explainedSo, if the key to this approach lies within the custom stream, what does one look like? This is where there is a lot of variance, and the exact approach taken when implementing the stream really depends on the scenario at hand and its requirements.

There are several factors that need to be considered when deciding on the design of the custom steam, for example:

1. What stages of the pipeline will this component need to support? (Will it run before or after the disassembler/assembler? Will it need to support both send and receive pipelines?)

2. What would be the shape of the data flowing through the component? Xml? “flat file”? Will the data be encoded in any way?

3. Is a forward only implementation sufficient or will I have to support seeking operations? Xpath?

4. Is a read only implementation sufficient or will the data have to be changed?5. Etc…

Page 9: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

The stream in the sample assumes an ASCII flat file, in a specific format (two fields, separated by a comma, with the records separated by a line feed) and its purpose is to filter out “duplicate” records. Normally a database is likely to be used to determine whether a record has been processed before or not, but for the purpose of this sample the component will deem a record as duplicate if its record id is even (the record id is the first field in the file, expected to be a number.)

Note:Several very useful streams are shipped with BizTalk Server –VirtualStream, XmlTranslatorStream, ReadOnlySeekableStream to name but a few.In many cases these can be used instead of creating a custom stream or as a base class for the custom stream, and should be considered, although some of these classs are not yet fully documented.

When implementing the custom stream a class is created to inherit from the .NET Framework’s abstract Stream class, as below -

public sealed class RemoveDuplicatesStream : Stream, IDisposable

Listing 3 – Declaration of the custom stream

The class implementation has to redefine all the properties and methods of the Stream class, but the only method that normally has any real implementation is the Read() method; all other methods and properties can generally either be passed-through to the wrapped stream or throw a NotSupportedException.

A word of caution though – theoretically it is possible for any subsequent component in the pipeline to do whatever it wishes with the stream, so component developers have to think carefully about the implementation of the custom stream, taking into account any sequence of calls to the stream’s methods. However, I believe it is very reasonable to expect a behaviour similar to the BizTalk endpoint manager when it comes to reading the stream, and am happy with throwing exception from methods I don’t expect to have to support. (I do try to make sure the messages in these exceptions are as clear as possible.)

Before diving into the specific implementation details of the stream in the sample, it is useful to understand how it is going to be used by BizTalk Server.

As the EPM receives the stream, it is likely to check the values of the CanRead and CanSeek properties to determine the capabilities of the stream. CanRead is obviously expected to be set to true; in my implementation CanSeek is set to false. Due to the buffering behaviour of my stream, additional code would have been required to support seeking, which I decided to avoid. Forward only streams should be perfectly acceptable in most pipeline scenarios and, in fact, several of the streams used internally by BizTalk Server itself are not seekable: for example, the ForwardOnlyEventingReadStream.

The EPM will start calling the Read() method providing a buffer for the output, along with the offset required and the count of bytes expected. Here’s the signature of the method –

public override int Read(byte[] buffer, int offset, int count)Listing 4 - Read() method's signature

Page 10: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

A reader of a stream is expected to provide a byte array into which the Read() method will put the bytes read from the source.

The byte array (‘buffer’) can be of any size, and it does not have to be empty; the offset parameter indicates the method at which position in the array new bytes should start to be written; this does not have to be the beginning of the array (element 0), but can be any position within it.

The count parameter instructs the method how many bytes the caller is expecting to receive. This should always be less than the array size minus the offset, and is effectively the maximum number of bytes the method can provide back to the caller.

The return value of the read method is expected to be the actual number of bytes the method wrote to the byte array; this is expected to be any number between 0 and “count.”

If the return value (“number of bytes read”) equals “count” (the number of bytes the caller asked for), it is likely that there are more bytes to read from the source (as the method could only provide “count” bytes back).

If the return value is less than “count,” the reader can assume there are no more bytes to read from the source. It has to be careful, though, to only read “count” bytes from “offset” as it is more than possible that the buffer contains bytes from previous reads after that point, and there is no requirement from the Read() method to clean those.

This is pretty much the only behaviour that read needs to exhibit, the only real behaviour our solution will have.

The read method of the stream in the sample reads the underlying stream one line at a time (by reading the bytes until it reaches a new-line character); it then decides whether the line is a duplicate. Based on that, it will either return it to the caller or ignore it and read the next line.

Had it been possible to ensure that the caller will ask for one line at a time from the read method, the implementation would have been very simple. However, as the size of a line (in bytes) is unknown to the caller (and in most cases is not fixed anyway), there is some added complexity to the stream’s implementation.

Essentially what is required is to buffer bytes for both the read and the write operations.

If the basic flow of the component would have been:

- Read line- Check if line is duplicate- If duplicate – ignore and read next line- If not duplicate – write to output

Because we don’t know how many bytes the caller would ask for, we have to handle cases where the number of bytes we need to return is smaller than the number of bytes in any single line. This adds the following steps:

- If we have remaining bytes to write – write them now (up to count)

Page 11: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

- Read line- Check if line is duplicate- If duplicate - ignore and read again- If not duplicate:

if line is bigger than output – write ‘count’ bytes and keep remaining bytes in memory, otherwise write line to output

We also have to handle the case where the line is actually smaller than the buffer, in which case we could fit more than one line in the output:

- If we have remaining bytes to write – write them now (up to count)- Read line- Check if line is duplicate- If duplicate - ignore and read again- If not duplicate:

if line is bigger than output – write ‘count’ bytes and keep remaining bytes in memory, otherwise write line to output

- If line is smaller than output – read next line and go again

Let’s see how the read method is actually implemented (this code is explained below):

1 public override int Read(byte[] buffer, int offset, int count)2 {3 int bytesWritten=0; //first of all - use any 'left over bytes' from previous read4 bytesWritten += copyLeftOver(ref buffer, ref offset, count); //we have to write exactly 'count' bytes for each call to Read5 while (bytesWritten < count) 6 {7 int emptyOutputBytesLeft = count - bytesWritten; //read the next line; this may be bigger than we have place8 byte[] line = readLine();910 if (line.Length == 0)11 break; //there was nothing left to read, can exit loop1213 if (isDuplicate(line))14 continue; //line is a duplicate, move to next line15 //if we have room for the entire line in the output 16 if (line.Length <= emptyOutputBytesLeft)17 {

//copy line onto output buffer18 Array.Copy(line, 0, buffer, offset, line.Length);

//increase the count of bytes written19 bytesWritten += line.Length; //increase the pointer at the output offset20 offset += line.Length; 21 continue; //go again, as we have more bytes to write22 }23

//we don't have enough room for entire line24 if (line.Length > emptyOutputBytesLeft)

Page 12: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

25 { //copy line onto output buffer

26 Array.Copy(line, 0, buffer, offset, emptyOutputBytesLeft);

//increase the count of bytes written27 bytesWritten += emptyOutputBytesLeft;

//create left over buffer28 writeBuffer = new byte[line.Length –

emptyOutputBytesLeft]; //copy left overs

29 Array.Copy(line, emptyOutputBytesLeft, writeBuffer, 0,writeBuffer.Length);

30 break;31 }32 }33 return bytesWritten;34 }

Listing 5 - implementation of stream's read method

Line 3 declares a variable to keep track of how many bytes were actually written to the output buffer; as explained earlier. This is important as the method will have to know to stop at “count” bytes written. This is also because this value needs to be returned to the caller, indicating whether further calls to read are required.

In line 4 a method is called to copy to the output buffer any “left over” byte. These are bytes of records that need to be returned to the caller, but that on a previous call the output buffer was already full and there was no place for them. The code for this method is shown next. It is essentially all about copying bytes from one array (in memory left overs) to another (output buffer.)

private int copyLeftOver(ref byte[] buffer, ref int offset, int count){

//if we have no left over bytes if(writeBuffer==null || writeBuffer.Length<=0) return 0;

//if we have place for the entire 'left over' if(count>=writeBuffer.Length) { //copy left over to output buffer Array.Copy(writeBuffer, 0, buffer, offset, writeBuffer.Length); int bytesWritten = writeBuffer.Length;

//set the left overs to null, as we've exhausted it writeBuffer = null; //advance index for any future writes offset += bytesWritten; return bytesWritten; }

//if we don't have place for the entire left over if(count<writeBuffer.Length) { //copy as many left over bytes as we can to output buffer Array.Copy(writeBuffer, 0, buffer, offset, count); offset += count; //remove written bytes from left over byte[] newLeftOver = new byte[writeBuffer.Length - count]; Array.Copy(writeBuffer, count, newLeftOver, 0,

Page 13: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

writeBuffer.Length - count);//copy unwritten bytes writeBuffer = newLeftOver; return count; } throw new NotImplementedException(); }

Listing 6 - copyLeftOver implementation

Line five starts a while loop running as long as there is still place in the output buffer (bytes written is less than count). It is possible that the left overs alone would fill the output buffer (a case which is properly handled by the helper function), in which case the loop will not execute even once.

When it does execute, it first calculates how many available bytes it has in the output buffer (line 7) and then reads an entire line into a byte array. Again a helper function is being used, which calls ReadByte on the underlying stream repeatedly until it reaches a new line character.

Line 10 checks that a line was actually read (if the end of the stream has been reached, in which case it would break the loop and return to the caller). Line 13 calls the function that checks whether the line-at-hand is a duplicate or not.

If the line is a duplicate, it simply gets ignored and the while loop runs again; this will simply cause the next line to be read and evaluated and so on.

If the line is not a duplicate it gets copied to the output buffer. Lines 16-22 handle the case where the line can fit in the output buffer whilst lines 24-31 handle the case where it does not. Left over bytes – bytes from the line that cannot be written to the output buffer due to lack of place– need to be kept for the next call.

Both blocks could have been merged, but I felt it is clearer to keep them apart.

The first block is a simply a case of copying the bytes from the line array to the output array and advance the counters; in the second block, where the line is too big to fit in the output, we need to copy just as many bytes as we can fit into the output buffer, copying the remaining bytes to the ‘left over’ buffer. These will be used in line 4 on the next call to Read().

The while loop will continue iterating until either the output buffer is filled or the end of the underlying stream has been reached.

At this point the method returns, providing the number of bytes written as the return value, but having also populated the buffer with the actual bytes read.

SummaryWith the Read() method implementation understood, hopefully the picture now comes together nicely.

In just a few lines of code, we have built a component that does its processing, including modifying the incoming message, without having to load the entire message into memory. It also allows subsequent components in the pipeline to process the message whilst it is still doing its own processing.

Page 14: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

There’s no question that developing a component in such a way is slightly more demanding than developing it in a non-streaming fashion. However, I hope this article has managed to demonstrate that it is not at all that difficult, and that with a little bit of effort and good unit tests, it can be achieved in a relatively short time.

I strongly believe that the additional effort pays off easily in terms of lighter code that allows for lower latency, and – more significantly – lower memory consumption in the pipeline and through that, significantly higher throughput of any BizTalk solution.

Appendix A - Testing the streamWhilst unit testing the pipeline component can be somewhat complicated, as it is designed to be called by the EPM and the parameters passed into it are not easily constructed, testing the stream at the heart of the solution is very straightforward.

As the core of this approach exhibits a standard stream behaviour, unit tests can easily be created to test it, for example a unit test may:

1. Define a memory stream.2. Populate it with data as needed (to represent the incoming message).3. Wrap it in the custom stream to be tested.4. Perform various read operations on the stream and confirm the bytes resulted are as

expected.

Here’s an example of a basic unit test for the RemoveDuplicatesStream:

[TestMethod] public void TwoLinesOneNewLine() { string data = "1,some data\n2,some data"; RemoveDuplicatesStream stream = CreateStream(data); byte[] buffer = new byte[1024]; int bytesRead = stream.Read(buffer, 0, 1024); string result = Encoding.ASCII.GetString(buffer,0,bytesRead); Assert.AreEqual<string>(data, result); }

Where CreateStream is defined as:

private static RemoveDuplicatesStream CreateStream(string data) { MemoryStream ms = new MemoryStream(); StreamWriter sw = new StreamWriter(ms); sw.Write(data); sw.Flush(); ms.Seek(0, SeekOrigin.Begin); RemoveDuplicatesStream stream = new RemoveDuplicatesStream(ms); return stream; }

Listing 7 - Sample Unit Test

Page 15: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

Following this approach it is possible to create as many unit tests as needed; generally tests need to cover two area:

1. Various variations of input data in the stream2. The variations of read operations (different buffer length, different start position, etc.)

Appendix B - The SampleThe sample provided with this article is designed to demonstrate both the non-streaming and the streaming ways of achieving the solution.

It contains four projects:

1. Schemas: contains the [flat-file] schema of the sample input file 2. Pipelines: contains three pipelines –

a. ParseInputFile – does not perform any duplicate elimination, used to verify the scenario only

b. RemoveDuplicates – performs duplicate elimination in a non-streaming fashionc. RemoveDuplicates-Streaming – performs duplicate elimination in a streaming

fashion3. RemoveDuplicates – contains the classes used

a. RemoveDuplicates – the non-streaming version of the pipeline componentb. RemoveDuplicates-Streaming – the streaming version of the pipeline componentc. RemoveDuplicatesStream – the custom stream used by the streaming pipeline

component4. UnitTests – contains several unit tests to verify the custom stream’s behaviour.

The solution also contains a couple of sample input files as well as a large (12mb) sample file, zipped.

It also contains a binding file, pre-configured to run the streaming scenario.

To run the sample you will need to:

1. Open the solution in Visual Studio 2008.2. Build the RemoveDuplicates project only.3. Place the resulting

Sabra.Samples.BizTalk.StreamingPipelineComponent.RemoveDuplicates.dll in the GAC.4. Build and deploy the entire solution to BizTalk Server.5. Import the binding file.6. Check the path used by the file receive location – this needs to point to a folder that exists

and that the BizTalk user running the BizTalkServerApplication (or any other host you configure in the receive location) has full permissions to (default set to c:\StreamingPipelineComponentSample\In).

7. Confirm which pipeline you wish to test in the receive location.8. Check the path used by the send port (default set to C:\

StreamingPipelineComponentSample\Out).9. Start the StreamingPipelineComponentSample application.

Page 16: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

Drop the sample file in the In folder. All being well the file, without the “duplicates,” if you are using one of the remove duplicates pipelines, should appear in the out folder.

Repeat the test with both the streaming and the non-streaming version of the pipeline.

Appendix C – Comparing the memory consumption of both components Whilst I haven’t performed any scientific experiments, I did want to demonstrate the benefits of taking the streaming approach in terms of memory usage.

To do that, I’ve used the large sample file provided with the sample code, which is 12 MB.

I have created a couple of receive locations – one using the streaming pipeline and one the non-streaming pipeline.

I then created two hosts on my machine, running each receive location on its own host; this is in order isolate the execution of each pipeline from the other for the testing and so that I can show both graphs side by side.

I restarted both hosts, and then ran a small file for each receive-location, to ensure that all the components are loaded before I run the large file.

After this execution, both components were stable at around 47 MB (process private bytes.)

I then dropped a large file for each receive location, and repeated this 3 or 4 times. This produced the following graph:

Figure 4 - Memory consumption of streaming vs. non-streaming pipelines

At the beginning, you can see both hosts roughly around the 47 MB mark.

As I drop the first pair of files each around 12 MB, the streaming pipeline’s memory consumption (represented by the blue line) climbs to around 50 MB before stabilising at around 49mb. This host’s memory consumption then remains at the same level until the host is restarted, no matter how many files are processed by the stream.

The non-streaming pipeline, on the other hand, has both higher, and less consistent, memory consumption.

The pipeline starts at the same consumption level of around 47 MB; as the first file is being processed it grows to around 72 MB. This is significantly more than the 12 MB of the actual file being processed.

Page 17: Developing a Streaming Pipeline Component for BizTalk …download.microsoft.com/.../DevelopingAStreamingPipel…  · Web viewA non-streaming pipeline component’s implementation

This level will eventually be dropped to around 65 MB. However, as further files are being processed, it increases again and again. Only after some time without activity will the memory consumption drop to its lowest level since the first file was processed. Even then, that’s over 60 MB.

These results were observed when several messages, one at a time; what happens if I drop several files together for both receive locations?

I tried, and the graph received was as follows.

Figure 5 - Memory consumption - several files at a time

We started again at the 47 MB mark for both pipelines (after restarting the hosts and running a small file to load all the assemblies).

The non-streaming pipeline very quickly consumed as much as 128 MB, and then remained around that figure for as long as I had patience to wait.

The streaming pipeline did use more memory than before – peaking at 58 MB and then dropping back a little bit.

My experimentation was hardly scientific or elaborate, but I believe it did demonstrate a few points:

1. Memory consumption in a streaming pipeline is significantly less than in a non-streaming pipeline.

2. Memory consumption in a streaming pipeline is fairly consistent, without any severe peaks and as such can be very predictable.

3. Memory consumed in the non-streaming pipeline component is slow to be released. I assume this is due to the resource tracker not wasting resources on cleaning up until it needs to. However, it does mean that the overall effect on the server is higher.

My tests have not looked at all at the latency factor.