Customizing Middleware to Improve Performance and Footprint

Customizing Middleware to Customizing Middleware to Improve Improve

Performance and FootprintPerformance and Footprint Arvind S. Krishna

[email protected]

Institute for Software Integrated Systems

Vanderbilt University Nashville, Tennessee

Motivation (1/2)

Where are we right now?•Maturation of Distributed Object Computing Middleware (DOC)

•ACE+TAO middleware

• Open-source implementation of CORBA and Real-time CORBA

• Highly optimized implementation implementing almost all features of CORBA

•From Stovepiped to reusable architectures

Middleware

MiddlewareServices

Applications

Operating Sys& Protocols

Hardware & Networks

Functionality factored in middleware

Product Line Architectures• Set of Systems that share

common “core features”• Families of systems then built

using core features• Reduce time to market

pressures, cost productivity etc• Example: Boeing Bold Stroke

Architecture

Product line architectures minimize cost for building variants

Motivation (2/2)

Model Driven Development Paradigm (MDD)• Reduces costs of building new families of systems• Compose different systems at modeling level

• Model Check for correctness• Code-generators synthesize artificats: XML deployment

information, configuration information, benchmarking code…..

Models capture System properties: structure and behavior

Models capture System properties: structure and behavior

What we need?Optimizations that customize middleware based on system invariants

What we need?Optimizations that customize middleware based on system invariants

Information propagationMiddleware for Product-Lines •Still general purpose layered

•Enables different variants to be hosted by different configurations

•However not optimized for each variant

Customizing Middleware via Partial Evaluation

Partial Evaluation•Technique of automatically specializing programs based on ahead of time known parameters

• Two level mechanism:• First level annotating information• Second level involves synthesizing code

• Templates and Template meta-programming

RUN TIMEOPTIONS

Configurationparameters

COMPILE-TIMEOPTIONS

Type of stub,skeleton….

General Purpose Layered Architecture Optimized Implementation Stack

Research will examine•Techniques used in programming languages can be used in middleware

•Move from a general purpose to a more specialized architecture

Optimize the “known knowns” leave “known uknowns” to the middleware and use exceptions for “unknown unknowns”

Optimize the “known knowns” leave “known uknowns” to the middleware and use exceptions for “unknown unknowns”

Existing Middleware Optimizations• Footprint Reduction

Optimization•Micro ORB Architecture Virtual Component Pattern

• Micro POA Architecture Pluggable components

• Request Demux/Dispatch Optimizations•Connection Management Acceptor-Connector pattern, Reactor

•Buffer Management Strategies•Request Demultiplexing Active Demultiplexing & Perfect Hashing

Aren’t these optimizations enough?• Have worked really well for different applications in domains• General purpose middleware is still layered• Techniques that will fold layers (code and run-time checks) to improve

performance• Will add more to the general purpose optimizations

Aren’t these optimizations enough?• Have worked really well for different applications in domains• General purpose middleware is still layered• Techniques that will fold layers (code and run-time checks) to improve

performance• Will add more to the general purpose optimizations

Capturing System Invariants in Models (1/2)

Example System• Basic Simple (BasicSP) three component

Distributed Real-time Embedded (DRE) application scenario

• Timer Component – triggers periodic refresh rates

• GPS Component – generates periodic position updates

• Airframe Component – processes input from the GPS component and feeds to Navigation display

• Navigation Display – displays GPS position updates

ACE_wrappers/TAO/CIAO/DaNCE/examples/BasicSPCoSMIC/examples/BasicSP

Hypothesis Solution ApproachUse early binding parameters to tailor middleware

Techniques applied could range from:•Conditional Compilation•Optimize/Stub skeleton generation•Strategy pattern to handle alternatives

Program Specialization InvariantsMust hold for all specializations•output(porig) = output (pspl)• speed (pspl) > speed(porig)Boeing Product line scenario –

Representative DRE application: rate based

Capturing System Invariants in Models (1/2)

Mapping Ahead of Time (AOT) System Properties to Specializations• Periodicity Pre-create marshaled Request

• Single Interface Operations Pre-fetch POA, Servant, Skeleton servicing request

• Same Endianess Avoid de-marshaling (byte order swapping)

• Collocated Components Specialize for target location (remove remoting)

• Same operation invoked Cache CORBA Request header/update arguments only

Collocated Components

Collocated Components

Same Endianes

s

Same Endianes

sPeriodic Timer

Periodic Timer Single

method interfaces

Single method

interfaces

Component Interactions Component Deployment

Specializations Implemented in TAO

ClientOBJREF

Object(Servant)

in argsoperation()out args +

return

IDLSTUBS

ORBINTERFACE

IDLSKEL

Object Adapter

ORB CORE GIOP/ IIOP/ ESIOPS

2

3

5

4

1

1

2

3

4

5

Specialization on Location

Request HeaderCaching

Eliminate un-necessarychecks

Pre-create Request

Optimize for TargetLocation

Client Side Specialization• Request Header Caching• Pre-creating Requests• Marshaling checks• Target Location

Server Side Specialization•Specialize Request Processing•Avoid Demarshaling checks

Cumulative Effect•More than additive increase of adding specializations

•For example:• Client side – request caching• Server side – specialize

request processing• 1+1 = 3?

Specialize for Target Location (1/2)

IntentSpecialize a path based on knowledge that objects are collocated

Model Invariants• All communication between GPS, Airframe and Display components are collocated

• All Invocations are local• Do not need remoting code (Connection code not required)

Transformations to TAO (foot-print)

• Eliminate Connection handling code• Connection Strategies, Flushing

Strategies• Eliminate Invocation classes

• Remote Invocation classes • One way and two way invocation

classes

Transformations to TAO (performance)

• Eliminate Remoting Checks• Object Proxy checks for remoting• Invocation Adapter checks for

remoting for each invocation• Checks for one-way or two-way

invocation

Address Space

Client

Server

NETWORK

Specialize for Target Location (2/2)

Configuration• 2.4.21-27.0.1.ELsmp #1 SMP

Redhat kernel• Athlon dual processor 2 GHz

processor• 1 GB RAM and 256 KB cache for

each processor• Test run TAO’s performance-

tests/Latency/Collocation

TAO Implementation & Automation• All implementations present in branch

“TAO_PE_Collocation”

• Specialization implemented by Conditional compilation technique (TAO_HAS_COLLOCATION) flag to remove remoting

• Profiled optimistic case of absolute no remoting (i.e. no code to handle requests and replies)

Optimization Performance Improvements

CORBA Compliance & Automation

•Code subsetting – removed connection related code

• Performance – elimination of remoting checks

• libTAO ~ 6% (100 kB of reduction)

•Application ~ 15 %

• Improved by 10 % (over and above Thru_POA) collocation

•Compliant with CORBA specification

•Realized by macros

• Invocation classes can be separated out as libraries

Arvind S. Krishna

Specialize CORBA Request Header (1/4)

IntentAvoid the considerable overhead of creating new CORBA requests and replies for each of a series of request calls

Model Invariants• Timer Component periodically sends same event

• Operations to retrieve data from the models are also the same.

Update Rather than Create• Do not create new Request each time

• Use old request and re-use the Request Header

• Various levels of re-use possible• Reuse only Request Header• Reuse both Request Header +

Message Specific Header• Reuse entire request

This approach similar to TCP header prediction

This approach similar to TCP header prediction

Specialize Request Header (2/4)

TAO Implementation• First request creates the entire request (code flow same as normal path)

• Cache header information (marshaled)

• Update only the total size and ID after request creation on subsequent messages

• Implemented via conditional compilation

Request Header Caching• First level specialization – Cache only

the Request Header Part• Everything else in the request is

variable• Avoid marshaling de-marshaling costs

for the header part alone• Implemented at client side



•Cache GIOP Request Header part

•Roundtrip throughput improved by ~ 50-100 calls/sec


•Realized by macros

•Not much gain by doing this


Message Specific Header Caching• Cache both Request Header and Message Specific Header

• Object Key is the same• Service context information (same)• Operation name same e.g., get_data

Server side Only when Thread per connection usedGIOP Formats Only for GIOP 1.2 as 1.0 and 1.1 service contexts are written first

TAO Implementation• Move buffer pointer to start of data segment

• Write out the arguments for the call

• Update the total size of the request (SIZE) and REQUEST_ID fields in the request



•Cache Request Header + Request Message

•Roundtrip throughput improved by ~ 300 – 350 calls/sec (~ 5 %)

• Latency ~ 3 µsecs (~ 5%)

•Compliant with CORBA specification (service contexts)

•Realizable by using policies at object level at client side


Intent• Instead of caching only the header (Request + Message specific) pre-create

entire CORBA request

Model Invariants• Timer component sends “trigger” (heart beats) to recipient component. Similar situation for timeouts

• Request and data contents are the same

Proposed TAO implementation•Special IDL flag that will pre-create (marshal the request)• Each time same request is sent to the client• Update request ID of the request only• Save cost of request construction and marshaling

Optimization Performance Improvements CORBA Compliance & Automation

• Entire CORBA Request

•Avoids marshaling data completely

•Can eliminate multiple layers by directly sending request

•Not Compliant with spec

• IDL compiler can pre-create and generate entire request

Specialized Request Processing (1/2)

Intent• Resolve the mapping of incoming requests

to the POA, Servant, Skeleton, and operation to which they are dispatched only once, then use these pre computed results to optimize the dispatch of subsequent requests

Model Invariants• get_data operation invokes operation on the

same component, located in the same POA serviced by the same servant and operation

Once Per Connection Resolution of Dispatch

• TAO provides Active Demultiplexing + Perfect Hashing for O(1) lookup time bound

• Caching just POA may not give a lot of performance improvement

Specialized Request Processing (2/2)

TAO Implementation• As the operation names are the same: We directly cache the skeleton and

advance the current buffer pointer to beginning of arguments

• The length is calculated only for the first request and re-used. Cost amortized over number of operations

• Implemented via TAO_CACHE_SERVANT_REF conditional compilation macro

• $TAO_ROOT/performance-tests/Latency/Single-Threaded


•Cache skeleton directly

•Round-trip latency ~ 6µsecs (5%)

• Throughput ~ 300 calls/sec (~ 5%)

•Caching Skeletons not compliant

•Cannot be used in Default Servant and Servant Locator classes

• Provide policies at POA (now that it is refactored) to implement this layer folding

• Implemented as separate IIOPConnection handler class

Specialize Marshaling/De-marshaling

Intent• To mask endianess GIOP Request header contains a flag that indicates endianess of the

request• If different endianess, do byte swapping

Model Invariants• The two machines on which the components are hosted have the same endianess (byte

order) No checks for byte order required ACE Implementation• ACE_CDR streams provide for ACE_SWAP_ON_WRITE and

ACE_DISABLE_SWAP_ON_READ macros that can be used to eliminate checks for byte-ordering

• Macros and not set by default. Model interpreters could generate configuration setting to enable these macros


•Demarshaling check elimination

•Will improve more than ~10 if conditions for a normal CORBA request

• Improvements in both client and server side

•Used in conjunction with header caching optimizations


•Conditional compilation techniques

Concluding Remarks & Future Work

•Specialization techniques can be used as a technique for “folding layers” based on system invariants

•Current implementation “first cut” uses conditional compilation strategies. Examine more appropriate strategies for implementing these specialization•Request Header Caching – Strategies controlled by svc.conf•Specialize Request Processing – POA request processing policy•Marshaling/de-marshaling – ACE level•Pre-create request – IDL Generated code•Collocation specialization – Macros + Strategies (Invocation classes)

Container

ClientOBJREF

in argsoperation()out args +

return

DIIIDL

STUBSORB

INTERFACE

IDLSKEL

Object Adapter

ORB CORE GIOP/IIOP/ESIOPS

Component(Servant)

ServicesExamine specialization at the

Component Middleware level and Infrastructural Middleware level

Customizing Middleware to Improve Performance and Footprint

Documents

Transcript of Customizing Middleware to Improve Performance and Footprint