Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives....

9
32 IEEE SOFTWARE Published by the IEEE Computer Society 0740-7459/04/$20.00 © 2004 IEEE distributed systems is most often specified in terms of quality-of-service (QoS) require- ments, which help define the acceptable levels of dependability with which capabilities such as processing capacity, data throughput, or service availability reach users. For longer-term properties such as scalability, maintainability, adaptability, and system security, we can simi- larly use persistent software attributes (PSAs) to specify how and to what degree such prop- erties must remain intact as a network expands and evolves over time. In the past, a common approach to ensuring the long-term persistence of important service properties was simply to freeze all changes after a system’s properties had been sufficiently vali- dated. However, for an increasing range of im- portant QoS-intensive systems, this kind of de- velopment process no longer suffices. Today’s global information economy strongly encour- ages forms of development that bring together participants from across geographical loca- tions, time zones, and business organizations. While such distributed development processes provide a powerful and economically impor- tant development model, they unfortunately also increase the churn rates in QoS-intensive software (see the “QoS and PSAs” sidebar). focus Preserving Distributed Systems’ Critical Properties: A Model-Driven Approach W hile the connectivity and efficiency advantages of distributed systems ensure their continued expansion, our increasing de- pendency on distributed resources comes at a cost of reduced predictability. Much of the burden of creating the needed levels of predictability necessarily falls on the software that controls and orchestrates the distributed resources’ behaviors. The need for creating predictability in persistent software attributes The Skoll Distributed Continuous Quality Assurance process helps identify viable system and software configurations for meeting stringent QoS and PSA requirements by coordinating the use of distributed computing resources. The authors tested their process using the large, rapidly evolving ACE+TAO middleware suite. Cemal Yilmaz, Atif M. Memon, and Adam A. Porter, University of Maryland Arvind S. Krishna, Douglas C. Schmidt, Aniruddha Gokhale, and Balachandran Natarajan, Vanderbilt University

Transcript of Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives....

Page 1: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

3 2 I E E E S O F T W A R E P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y 0 7 4 0 - 7 4 5 9 / 0 4 / $ 2 0 . 0 0 © 2 0 0 4 I E E E

distributed systems is most often specified interms of quality-of-service (QoS) require-ments, which help define the acceptable levelsof dependability with which capabilities suchas processing capacity, data throughput, orservice availability reach users. For longer-termproperties such as scalability, maintainability,adaptability, and system security, we can simi-larly use persistent software attributes (PSAs)to specify how and to what degree such prop-

erties must remain intact as a network expandsand evolves over time.

In the past, a common approach to ensuringthe long-term persistence of important serviceproperties was simply to freeze all changes aftera system’s properties had been sufficiently vali-dated. However, for an increasing range of im-portant QoS-intensive systems, this kind of de-velopment process no longer suffices. Today’sglobal information economy strongly encour-ages forms of development that bring togetherparticipants from across geographical loca-tions, time zones, and business organizations.While such distributed development processesprovide a powerful and economically impor-tant development model, they unfortunatelyalso increase the churn rates in QoS-intensivesoftware (see the “QoS and PSAs” sidebar).

focusPreserving Distributed Systems’Critical Properties:A Model-Driven Approach

While the connectivity and efficiency advantages of distributedsystems ensure their continued expansion, our increasing de-pendency on distributed resources comes at a cost of reducedpredictability. Much of the burden of creating the needed levels

of predictability necessarily falls on the software that controls and orchestratesthe distributed resources’ behaviors. The need for creating predictability in

persistent software attributes

The Skoll Distributed Continuous Quality Assurance process helpsidentify viable system and software configurations for meetingstringent QoS and PSA requirements by coordinating the use ofdistributed computing resources. The authors tested their processusing the large, rapidly evolving ACE+TAO middleware suite.

Cemal Yilmaz, Atif M. Memon, and Adam A. Porter, University of Maryland

Arvind S. Krishna, Douglas C. Schmidt, Aniruddha Gokhale, and Balachandran Natarajan, Vanderbilt University

Page 2: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

Such churn typically destabilizes QoS proper-ties such as latency, jitter, and throughput ratesas well as—perhaps even more ominously—PSA attributes such as reliability, scalability, ef-ficiency, adaptability, maintainability, andportability. Rapid code changes can also occurdue to the use of evolution-oriented processes1

such as agile practices, where many small codechange increments are routinely added to thebase system. The challenge in such cases is to de-velop techniques that can both help coordinateremote developers and allow them to cope moreefficiently with frequent software changes.These new techniques must enable distributedteams to detect, diagnose, and correct changesthat could damage not only QoS constraintsbut also the longer-term, investment-preservingPSA properties of such systems.

In practice, budgets for development andin-house quality assurance are often too lim-ited to perform a comprehensive explorationof the software configuration space (see the re-lated sidebar). Instead, developers typically as-sess PSAs for only a few software configura-tions and then arbitrarily extrapolate theirresults to the entire configuration space. Thisapproach is risky, however, because it can letmajor sources of PSA degradation escape de-tection until systems are fielded. Even worse,since in-house–tested settings are often se-lected in an ad hoc manner, the points fromwhich such generalizations are made mightnot even include any actual settings in fieldedsystems. What we need, then, are QA tech-niques that can provide a scalable, effective

approach to understanding and navigatinglarge software configuration spaces.

We selected and merged two existing tech-niques to address this challenge: distributedcontinuous quality assurance (DCQA)2 andthe Skoll model-driven approach to coordinat-ing distributed activities (www.cs.umd.edu/projects/skoll). We developed Skoll DCQA tohelp automate QoS and PSA compliance in ge-ographically distributed systems and teams.Skoll DCQA approaches QoS and PSA com-

N o v e m b e r / D e c e m b e r 2 0 0 4 I E E E S O F T W A R E 3 3

Quality-of-service and persistent-software-attribute requirements oftenare tightly linked; for example, a QoS need for minimum services availabil-ity might not be achievable if a broader PSA requirement for continuedlong-term dynamic scalability of the distributed system can’t also be met.QoS-intensive systems are ones in which services must either continuallymeet a wide range of strict QoS limits or face serious to catastrophic conse-quences. QoS-intensive systems are necessarily also PSA-intensive, becauselarge-scale distributed systems aren’t sufficiently fixed over time for theirQoS certifications to remain adequately stable.

QoS-intensive distributed systems also often have both real-time and em-bedded properties. Examples include air traffic control systems and electri-cal power grid control systems, for which a serious failure to meet QoS andPSA requirements could lead to significant damage and even loss of humanlives. Other examples of QoS-intensive systems include high-performancescientific-computing systems, in which losing a critical system property at thewrong time could be extremely costly or could lead to the loss of one-of-a-kind data. Examples of scientific QoS-intensive systems include roboticspace missions, high-energy physics experiments, and long-running compu-tational fluid dynamics simulations.

QoS and PSAs

Persistent software attributes are akin to quality-of-serv-ice guarantees: they both require that specified functionalproperties remain intact during operational use. The maindifference between them is that QoS requirements focus onimmediate operational properties, such as throughput andlatency, whereas PSAs focus on the resilience of long-termsoftware engineering properties such as scalability, secu-rity, and maintainability. PSA and QoS concepts areclosely intertwined, since a failure to specify and imple-ment PSA requirements (for example, persistent scalability)can lead directly to serious QoS failures (for example, aninability to handle peak loads due to throughput failures).

This article addresses the difficult problem of how tomeet and maintain QoS and PSA requirements in largedistributed systems. The authors’ approach, Skoll DCQA,helps keep the PSA validation process manageablethrough its use of modeling techniques that intelligentlyguide the process of distributed and continuous validationof properties. The article includes examples of how SkollDCQA uncovered previously unrecognized problems in alarge, well-maintained, widely used middleware frame-work for distributed systems.

—Terry Bollinger, Jeffrey Voas, andMaarten Boasson, guest editors

WHY READ TH IS ART ICLE?

Page 3: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

pliance in much the same way as a more tra-ditional distributed data-processing task—byassigning subtasks that can be sent iteratively,opportunistically, and continuously to partici-pating network clients.2 Skoll DCQA makes iteasier and more efficient to automate theanalysis and synthesis of customization-relatedartifacts such as component interfaces, imple-mentations, glue code, configuration files, anddeployment scripts.3

The Skoll DCQA processThe Skoll DCQA process was inspired by

earlier work, including the Options Configura-tion Modeling language,4 the BenchmarkingGeneration Modeling Language (BGML),5

commercial systems efforts such as the Netscape

Quality Feedback Agent and Microsoft’s XPError Reporting, the distributed regression testsuites that open source projects use, and auto-build scoreboards such as the ACE+TAOVirtual Scoreboard (www.dre.vanderbilt.edu/scoreboard) and Dart (www.public.kitware.com/Dart).

Figure 1 depicts Skoll DCQA’s process ar-chitecture; in this case, two of several Skollclient groups are communicating (shown asred bidirectional arrows) with a central collec-tion site. Each collection site comprises a clus-ter of servers connected via high-speed links(blue bidirectional arrows). The (yellow) call-outs show the servers’ contents in terms of theDCQA model and configuration artifacts. Weuse Skoll’s configuration model to intelligentlyguide the distribution and continuous execu-tion of Skoll clients across a grid of computingresources. In this process, Skoll provides lan-guages for modeling system configurationsand their constraints, algorithms for schedul-ing and remotely executing tasks, and analysistechniques for characterizing faults.2 The re-sults of the distributed evaluations are re-turned to central Skoll servers, where they arefused together to guide subsequent iterationsof the Skoll DCQA process.

Skoll DCQA’s analytical cornerstone is its useof a formal model of the software configurationspace in which its processes must operate. Thisformal model makes it possible to capture validconfigurations and use them to develop specific,well-defined QA subtasks. A configuration inSkoll is represented formally as a set {(O1, S1),(O2, S2) … (On, Sn)}, where each Oi is a config-uration option and each Si is its value, drawnfrom the allowable settings. Not all configura-tions make sense in practice (for example, fea-ture X might not be supported on operating sys-tem Y), so Skoll DCQA also includes inter-optionconstraints that limit the setting of one optionbased on the setting of another.

To navigate the remaining valid configura-tions in the space, Skoll uses an intelligent steer-ing agent that assigns subtasks to clients as theybecome available. The ISA considers four majorfactors when making such assignments:

■ The configuration model, which charac-terizes the subtasks that can be assignedlegally

■ The summarized results of previouslycompleted subtasks

3 4 I E E E S O F T W A R E w w w. c o m p u t e r. o r g / s o f t w a r e

A good strategy for tackling any large problem is first to identify and an-alyze a smaller subset of that problem, ideally one that’s still large enoughto apply to a range of interesting systems. QoS-intensive systems providejust such a solid starting point in the form of dozens to hundreds of softwareconfiguration settings in the operating system, middleware, application,compiler, and runtime software of each platform in a distributed system. Fo-cusing on configuration settings limits the problem’s total complexity yet stillencompasses a large and interesting set of platform configurations.

An especially useful way to visualize the full range of settings possible ina distributed system is as a multidimensional software configuration space.Each possible total combination of settings on a network platform becomesa single unique point within this space, with the space as a whole encom-passing every possible combination of settings. Points that lie close togetherin this space represent combinations of settings that differ from each otheronly by one or two individual settings. The total size of the software configu-ration space depends on the number of settings available for change. If onlya small number of settings are treated “in play,” the resulting subset soft-ware configuration space might have no more than a few tens of uniquepoints within it. However, if every setting available on a typical realistic net-work platform is put into play, the resulting full software configuration spaceis typically very large indeed, containing millions or more of unique points.

We can now understand QoS requirements as criteria for determining an“acceptable performance” subset of the overall software configuration space.PSA requirements further limit this QoS subspace to exclude “quick fix” solu-tions—that is, solutions that might meet QoS constraints in the short term butonly at the cost of significantly degrading one or more longer-term PSAproperties such as adaptability, portability, scalability, or maintainability.

While the vast size of the full software configuration space ensures the pres-ence of solutions that are both adaptable and portable, it also places enormousdemands on developers who must ensure that all their design decisions keepthe system within the boundaries of the QoS- and PSA-compliant subspaces. Inparticular, our experience has shown that PSA requirements for reliability,portability, and efficiency can’t be reliably assured without first performingan extensive QA analysis of how system requirements interact with operatingenvironments.1 This is equivalent to mapping out the impact of QoS and PSArequirements over a large subset of the full software configuration space.

Software Configuration Spaces

Page 4: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

■ A set of global process goals that definegeneral policies for allocating subtasks,such as testing recently changed featuresmore than unchanged ones

■ Clients’ characteristics and participationlevel preferences, as described in a templateprovided by clients

Once the ISA has selected a configurationthat meets all the factors relevant to a newlyavailable client, it creates a job configurationthat includes all the code artifacts, configurationparameters, build instructions, and QA-specificcode (such as developer-supplied regression andperformance tests) associated with that subtask.The job configuration is then sent to the Skollclient, which executes it and returns the resultsto the ISA for collection and analysis.

A notable feature of this process is that bylooking at the results returned from manyclients, the ISA can learn and adapt the overallSkoll DCQA process to make it more efficient.For example, the ISA can identify configura-tions that repeatedly fail to meet PSAs, therebyenabling developers to concentrate their effortson fixing these problems. Conversely, onceproblem configurations have been character-ized, the ISA can refocus efforts on other pre-viously unexplored parts of the configuration

space that might provide better PSA support.To help developers implement specific Skoll

DCQA processes, the Skoll DCQA environ-ment provides a variety of model-driven tools,such as the BGML. This modeling tool lets de-velopers model interaction scenarios visually,generate test code, reuse QA subtasks acrossconfigurations, generate scripts to control sub-task distribution and execution by clients,monitor performance and functional behav-iors, and evaluate software attributes includ-ing correctness, throughput, latency, and jitter.

Skoll DCQA includes a variety of analysistools that help developers interpret and lever-age its often-complex global process results.One such tool provides classification treeanalysis,6 which creates tree-based modelsthat predict object class assignment based onthe values of a subset of object features. CTAis used to diagnose which options and optionsettings are most likely causing specific PSAtest failures, thus helping developers identifyroot causes of PSA failures quickly.

Evaluating Skoll DCQATo evaluate the effectiveness of the Skoll

DCQA process at navigating software configu-ration spaces, we applied it in three different con-texts described next: support of the ACE+TAO

N o v e m b e r / D e c e m b e r 2 0 0 4 I E E E S O F T W A R E 3 5

Skoll Client Group #12 Skoll Client Group #32

Central collection site

DCQA model• Configuration model• Automatic characterization• Adaptation strategies• Intelligent steering agent• ...

IDLScriptfilesScriptfilesScriptfiles

.cpp.xml

Configuration artifacts from models

Latency

Figure 1. The SkollDCQA process architecture.

Page 5: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

code base, an avionics application of ACE+TAO,and early work on the problem of churn-induceddegradation of distributed code bases.

Applying Skoll DCQA to ACE+TAOIn this application, we used Skoll DCQA to

analyze and support the popular ACE+TAOmiddleware suite (www.dre.vanderbilt.edu/Download.html). ACE+TAO provided a goodhigh-end test of the Skoll DCQA process, be-cause it includes about two million lines of con-tinuously evolving C++ frameworks, functionalregression tests, and performance benchmarks,packaged in about 4,500 files and receiving anaverage of over 300 code updates per week. Wefocused primarily on QoS and PSA require-ments affecting latency, throughput, and cor-rectness. Subtasks assigned to the clients in-cluded activities such as compiling the system,running regression tests in a particular systemconfiguration, and evaluating system responsetimes under different input workloads.

From this exercise (described in detail else-where2), we identified three main benefits ofusing Skoll DCQA:

■ Detecting and pinpointing software porta-bility and availability problems

■ Identifying the subsets of options that hadthe most impact on specific PSAs

■ Understanding the effects of systemchanges on PSAs, enabling developers toensure continued support with an accept-able level of effort

Because the ACE+TAO evaluation was themost controlled of the three we conducted, wewere able to use it to try out scenarios, startingwith small subset configuration spaces andworking toward larger ones. For simplicity andeasy comparison of results, we ran all thesescenarios on a single operating system (Linux2.4.9-3) using a single compiler (gcc 2.96);since then, we’ve run other studies across mul-tiple operating systems and compilers.

Scenario 1: Clean compilation. This scenarioassessed whether each ACE+TAO featurecombination compiled without error. We se-lected 10 binary-valued compile-time optionsthat control build-time inclusion of features,such as asynchronous messaging, use of soft-ware interceptors, and user-specified messag-ing policies (a detailed explanation of the

many ACE+TAO configuration options areavailable at www.dre.vanderbilt.edu/TAO/docs). We also identified seven inter-optionconstraints that take forms similar to “if op-tion A is enabled, then option B must be dis-abled.” This relatively small subset configura-tion space had 89 valid configurations.

By executing the Skoll DCQA process onthis subset configuration space, we determinedthat 60 of the 89 valid configurations didn’teven build—a result that was quite a surpriseto the ACE+TAO developers. By using CTAon the results from client workstations, wewere also able to identify a previously undis-covered bug. This bug centered on a particularline in the TAO source code, and it occurred ineight configurations that all had a specific pairof option settings.

Even though we intentionally kept the se-lected configuration space small for testing pur-poses, it was still large enough to contain sig-nificant problems that the maintainers of theACE+TAO code base hadn’t yet identified.Identifying these problem areas in the configu-ration space helped improve both QoS and PSAsupport by steering users toward safer regionsin the configuration space and by showingACE+TAO developers what needed to be fixed.

Scenario 2: Testing with default runtime options.This scenario assessed portability and correct-ness PSAs of ACE+TAO by executing regressiontests on each compile-time configuration usingthe default runtime options (that is, the configu-ration that new users encounter upon installa-tion). We used the 96 regression tests that aredistributed with ACE+TAO, each containing itsown test oracle that reported success or failureon exit. We expanded the configuration modelto include options that captured low-level oper-ating system and compiler information—forexample, indicating the use of static versus dy-namic libraries, multithreading versus sin-glethreading, and inlining versus non-inlining.We also added test-specific options to the con-figuration space because some ACE+TAO testscan only run in particular configurations, suchas when multithreading is enabled.

We added one test-specific option per test,called run (Ti), which indicates whether test Ti

will run in a given compile-time configuration.We also defined constraints over these op-tions; for example, some tests can run only onconfigurations that have more than Minimum

We focusedprimarily onQoS and PSA

requirementsaffectinglatency,

throughput, andcorrectness.

3 6 I E E E S O F T W A R E w w w. c o m p u t e r. o r g / s o f t w a r e

Page 6: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

CORBA features. After making these changes,the space had 14 compile-time options with 12constraints and an additional 120 test-specificconstraints.

After resolving the constraints, we compiled2,077 individual tests, of which 98 did not com-pile. Of the 1,979 tests that did compile, 152failed, while 1,827 passed. This process tookabout 52 hours of computer time on the Skollgrid that was available for the experiments.

Subsequent analysis using classificationtrees showed that in several cases tests failedfor the same reason in the same configurations.For example, the analysis showed that testcompilation always failed at a given file for thefollowing option settings: CORBA_MSG = 1,POLLER = 0, and CALLBACK = 0. This compi-lation error stemmed from a previously undis-covered bug that occurred because certainTAO files assumed these settings were invalidand thus couldn’t occur. Using our model-driven DCQA environment and process, wewere able to determine whether the currentversion of ACE+TAO successfully completedall regression tests in its default configuration.

Scenario 3: Regression testing with configur-able runtime options. This scenario assessedthe functional correctness of ACE+TAO byexecuting the ACE+TAO regression tests overall settings of their runtime options, such aswhen to flush cached connections or whatconcurrency strategies TAO should support.This was the largest scenario in terms of con-figuration space. We modified the configura-tion model to reflect six runtime configurationoptions. Overall, there were 648 differentcombinations of CORBA runtime policies.

After making these changes, the compile-time option space had 14 options and 12 con-straints; there were 120 test-specific con-straints, and six new runtime options with nonew constraints. Thus, the configuration spacefor this scenario grew to 18,792 valid config-urations. At roughly 30 minutes per test suite,the entire test process involved around 9,400hours of computer time on the Skoll grid.

Several tests failed in this scenario, al-though they hadn’t failed in Scenario 2 whenthey were run with default runtime options.These problems were often located in feature-specific code. Interestingly, some tests failedon every single configuration, including thedefault configuration tested earlier, despite

succeeding in Scenario 2. Bugs in option set-ting and processing code were often thesources of these problems. ACE+TAO devel-opers were intrigued by these findings becausein practice they rely heavily on user testing atinstallation time (Scenario 2) to verify properinstallation and provide feedback on systemcorrectness. Our feasibility study raises ques-tions about the adequacy of that approach.

Another group of tests had particularly in-teresting failure patterns. Three tests failed be-tween 2,500 and 4,400 times (out of 18,792executions). Using CTA, we automatically dis-covered that the failures occurred only whenORBCollocation = NO was selected (thatis, no other option influenced these failures).This option lets objects in the same addressspace communicate directly, saving marshal-ing, demarshaling, and protocol-processingoverhead. The fact that these tests workedwhen objects communicated directly but failedwhen they talked over the network suggested aproblem related to message passing. In fact,further scrutiny revealed the problem’s sourcewas a bug in the ACE+TAO routines for mar-shaling and demarshaling object references.Our DCQA process thus helped us not only tosystematically evaluate the functional correct-ness PSA across many different runtime config-urations but also to leverage that informationto help pinpoint the causes of specific failures.

Applying Skoll DCQA to avionicsFigure 2 gives a high-level overview of how

we used Skoll DCQA to analyze and supportPSAs for one part of an avionics mission com-

N o v e m b e r / D e c e m b e r 2 0 0 4 I E E E S O F T W A R E 3 7

Model

Codegenerators Artifacts

Asynchronous communication

Interfacedefinition

files

Scriptfile

Sourcefiles

Skoll

Targetmachine

Internet

Avionicsscenario

Timer40 Hz

1

2

4

3

GPSNavigation

display

Synchronous communication

Figure 2. Using SkollDCQA on an avionicssystem.

Page 7: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

puting system based on Boeing’s Bold Strokesoftware architecture.7 This system was devel-oped using the same ACE+TAO middlewaredescribed in the first example. This scenario’sgoal was to leverage the generative capabilitiesof our BGML tool5 and integrate it with SkollDCQA. There were four major steps in apply-ing Skoll DCQA to the avionics system.

Step 1: Define the application scenario. Usingthe Generic Modeling Environment model-driven tool suite (www.isis.vanderbilt.edu/Projects/gme), developers created BGML mod-els of both the avionics software system andthe PSA-specific evaluation activities that areneeded to ensure acceptable operational be-havior. The models detailed system configura-tion options and interoption constraints, andalso captured PSA-specific information such asthe metrics calculated in benchmarking experi-ments, the number and execution frequency oflow-level profiling probes, and which eventpatterns should be filtered out or logged by thesystem. For example, in the avionics missioncomputing scenario, we used a three-compo-nent, basic single-processor scenario (knownas the BasicSP scenario) that receives global-position updates from a GPS device and thendisplays them on a user interface in real time.

Step 2: Create benchmarks using BGML. In theBasicSP scenario, the GPS component servesas the source for multiple components requir-ing position updates at regular intervals. Thiscomponent’s concurrency mechanism shouldtherefore be tuned to serve multiple requestssimultaneously. Moreover, requiring that thedesired data request and display frequenciesare fixed at 40 Hz is captured in the models.The BGML model interpreter processes thesemodels to generate the lower-level XML-basedconfiguration files, the required benchmarkingcode (such as IDL files and required headerand source files), and necessary script files forexecuting the Skoll DCQA process. This stepreduces accidental complexities associatedwith tedious and errorprone handcrafting ofsource code for a potentially large set of con-figurations. The configuration file is input tothe Skoll ISA, which schedules the subtasks toexecute as Skoll clients become available.

Step 3: Register and download clients. Remoteusers register with the Skoll infrastructure and

obtain the Skoll client software and configura-tion template that the BGML model interpretergenerated. Clients can run periodically at user-specified times, continuously, or on demand.

Step 4: Execute DCQA process. As each clientrequest arrives, the ISA examines its internalrule base and Skoll databases, selects a validconfiguration, packages the job configuration,and sends it to the client. The client executes itand returns the results to the Skoll server, whichupdates its databases and executes any adapta-tion strategies triggered by the new results.

Generating the required source and config-uration files from higher-level models removedthe need to handcraft these files. Moreover,the generated code was syntactically and se-mantically correct, thus eliminating commonsources of errors. The Skoll DCQA processjust described illustrates how DCQA capabili-ties can address both performance-relatedPSAs and functional-correctness PSAs.

Applying Skoll DCQA to change-induced performance degradation

Given its successes in the first two studiesdescribed earlier, we were interested inwhether we could also apply Skoll DCQA tothe problem of ensuring consistent perform-ance in an environment of rapid code change,which can lead to performance degradationover time. We call the strategy we used maineffects screening, which is based on the fol-lowing three steps.

Step 1: Generate a formal model of the mostrelevant effects. This formal model is based onthe Skoll DCQA system configuration modeland uses screening designs8 to help reveal low-order effects—that is, small changes to single,paired, or tripled option settings that haveseemingly disproportionate effects on perform-ance. We call the most influential of these op-tion settings main effects. This technique workswhen the relationships between such settingsand performance remain reasonably linear.

Step 2: Execute the resulting model on theDCQA grid. Each task involves running andmeasuring benchmarks on a single configura-tion dictated by the experimental model de-vised in Step 1. As before, we used the model-driven BGML tool to simplify benchmarkcreation, execution, and analysis.

3 8 I E E E S O F T W A R E w w w. c o m p u t e r. o r g / s o f t w a r e

Finding andcharacterizing“dangerous”

regions inconfigurationspaces helpsensure thatimportant

properties canbe preservedas their codebase expands

over time.

Page 8: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

Step 3: Collect and analyze the data to identifymain effects. By first centralizing and then it-eratively redistributing the lessons learned pro-vided by a wide range of participating devel-opers, it becomes possible over time to form animproved consensus on the main effects’ set-tings and their thresholds. We’ve found thatperiodically monitoring just these main effectsgives an inexpensive and rapid way to detectperformance degradations across the configu-ration space.

T he results from our Skoll DCQA stud-ies have been encouraging. They’vedemonstrated the approach’s value at

finding and characterizing “dangerous” re-gions in configuration spaces, which in turnhelps developers ensure that important prop-erties can be preserved as their code base ex-pands over time. We aren’t the only groupthat has addressed this need via DCQAprocesses,1,2,9,10 but much more remains to bedone. One important challenge is to increase theoverall level of automation in DCQA processes.

Specifying DCQA processes, for example, is cur-rently a more labor-intensive process than we’dlike. Another problem is applying DCQAprocesses to PSA properties such as usabilityand maintainability that aren’t easily meas-ured by automated means, and so again re-quire labor-intensive human intervention.Other unresolved challenges and risks includehow best to structure DCQA processes, whattypes of QA tasks can be distributed effec-tively, and how the costs and benefits ofDCQA processes compare to conventional in-house QA processes.

To address these issues, we’re working withother researchers in the Remote Analysis andMeasurement of Software Systems community(http://measure.cc.gt.atl.ga.us/ramss) to developtools, services, and algorithms needed to pro-totype and test DCQA processes. Scaling is aninteresting issue; in general, we expect SkollDCQA’s quality to improve as the number ofparticipating clients and users increases. Thispositive scaling effect is reminiscent of the scal-ing phenomenon that Terry Bollinger postulatesis behind the recent rapid economic growth in

www.computer.org/software/edcal.htmor email software computer.org

http://cs-ieee.manuscriptcentral.com

www.computer.org/software/edcal.htmor email software computer.org

http://cs-ieee.manuscriptcentral.com

Submission deadline: 15 December 2004

Publication: July/August 2005

Submission deadline: 15 December 2004

Publication: July/August 2005

For detailed author guidelines and submission details:

To submit:

For detailed author guidelinesand submission details:

To submit:

Incorporating COTSinto the Development Process

Dealing with commercial off-the-shelf products is high-risk in part due to lack of access to source code anddevelopers. COTS integration has typically been addressed as an add-on to software development. However,it affects all the traditional aspects of software development and the entire development life cycle.

This special issue will explore innovative ways of integrating commercial off-the-shelf software productsinto software systems for purposes often unimagined by their creators. It will investigate the challenges andrisks faced as well as the benefits gained. It will report on

• Software engineering principles (methods, techniques, tools) for integrating COTS products into software systems effectively, safely, reliably, and predictably

• Lessons learned and case studies that demonstrate such software engineering principles

Please contact the guest editors for more information about your idea’s relevance to the topic area: Alexander Egyed, [email protected] Hausi A. Müller, [email protected] Dewayne E. Perry, [email protected]

Manuscripts must not exceed 5,400 words including figures and tables, which count for 200 words each.Longer submissions may be rejected without refereeing. The articles we deem within the theme's scope willbe peer-reviewed and are subject to editing for magazine style, clarity, organization, and space. We reservethe right to edit the title of all submissions. Be sure to include the theme’s name on your manuscript.

Page 9: Preserving Distributed Systems’ Critical Propertiesatif/papers/KrishnaIEEESoftware2004.pdflives. Other examples of QoS-intensive systems include high-performance scientific-computing

the use of open source development.11 Despitesuch challenges, the overall Skoll DCQA ap-proach of using iterative, model-coordinated,highly distributed evaluations of softwareproperties holds considerable promise for bet-ter understanding, characterizing, and improv-ing distributed systems. We look forward tohelping this field evolve over the next few yearsand evaluating the degree to which such meth-ods end up supporting and validating impor-tant properties that must be made persistentover time to be fully usable.

Acknowledgments We thank Terry Bollinger for extensive comments

that greatly improved this article’s form and content.This material is based on work supported by the USNational Science Foundation under NSF grants ITRCCR-0312859, CCR- 0205265, and CCR-0098158,as well as funding from BBN Technologies, LockheedMartin, Raytheon, and Siemens.

References1. A. Orso et al., “Gamma System: Continuous Evolution

of Software After Deployment,” Proc. Int’l Symp. Soft-ware Testing and Analysis, ACM Press, 2002, pp. 65–69.

2. A. Memon et al., “Skoll: Distributed Continuous Qual-ity Assurance,” Proc. 26th IEEE/ACM Int’l Conf. Soft-ware Eng., IEEE CS Press, 2004, pp. 459–468.

3. G. Karsai et al., “Model-Integrated Development ofEmbedded Software,” Proc. IEEE, vol. 91, no. 1, 2003,pp. 145–164.

4. E. Turkaye, A. Gokhale, and B. Natarajan, “Addressingthe Middleware Configuration Challenges using Model-based Techniques,” Proc. 42nd Ann. Southeast Conf.,ACM Press, 2004.

5. A.S. Krishna et al., “CCMPerf: A Benchmarking Toolfor CORBA Component Model Implementations,”Proc. 10th Real-Time Technology and ApplicationSymp. (RTAS ’04), IEEE Press, 2004, pp. 140–147.

6. L. Breiman et al., Classification and Regression Trees,Wadsworth, 1984.

7. D.C. Sharp and W.C. Roll, “Model-Based Integration ofReusable Component-Based Avionics System,” Proc.Workshop Model-Driven Embedded Systems in RTAS2003, IEEE Press, 2003.

8. C.F.J. Wu and M. Hamada, Experiments: Planning,Analysis, and Parameter Design Optimization, JohnWiley & Sons, 2000.

9. J. Bowring, A. Orso, and M.J. Harrold, “MonitoringDeployed Software Using Software Tomography,” Proc.2002 ACM SIGPLAN/SIGSOFT Workshop ProgramAnalysis for Software Tools and Eng., ACM Press,2002, pp. 2–9.

10. B. Liblit, A. Aiken, and A.X. Zheng, “Bug Isolation viaRemote Program Sampling,” Proc. ACM SIGPLAN2003 Conf. Programming Languages Design and Imple-mentation (PLDI 03), ACM Press, 2003, pp. 141–154.

11. T. Bollinger, Software Cooperatives: Infrastructure inthe Internet Era, www.terrybollinger.com/swcoops/swcoops.

For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.

4 0 I E E E S O F T W A R E w w w. c o m p u t e r. o r g / s o f t w a r e

About the Authors

Arvind S. Krishna is a PhD student in the Electrical Engineering and Computer ScienceDepartment at Vanderbilt University and a member of the Institute for Software IntegratedSystems. He received his MA in management and his MS in computer science from Universityof California, Irvine. His research interests include patterns, real-time Java technologies forReal-Time Corba, model-integrated QA techniques, and tools for partial evaluation and special-ization of middleware. He is a student member of the IEEE and ACM. Contact him at the Inst. forSoftware Integrated Systems, 2015 Terrace Pl., Nashville, TN 37203; [email protected].

Cemal Yilmaz is a PhD student in computer science at the University of Maryland, Col-lege Park. His research interests include distributed, continuous QA, software testing, perform-ance evaluation, and application of design of experiment theory to software QA. Contact himat the Dept. of Computer Science, Univ. of Maryland, College Park, MD 20742; [email protected].

Atif M. Memon is an assistant professor in the Department of Computer Science at theUniversity of Maryland. His research focuses on program testing, software engineering, artifi-cial intelligence, plan generation, reverse engineering, and program structures. He received hisPhD in computer science from the University of Pittsburgh and is a member of the IEEE Com-puter Society and ACM. Contact him at the Dept of Computer Science, 4115 A.V. WilliamsBldg., Univ. of Maryland, College Park, MD 20742; [email protected].

Adam A. Porter is an associate professor in the Department of Computer Science andthe Institute for Advanced Computer Studies at the University of Maryland. His research inter-ests include empirical methods for identifying and eliminating bottlenecks in industrial devel-opment processes, experimental evaluation of fundamental software engineering hypotheses,and development of tools that demonstrably improve the software development process. Hereceived his PhD in information and computer science from the University of California, Irvine,and is a member of the IEEE Computer Society and ACM. Contact him at the Dept. of ComputerScience, Univ. of Maryland, College Park, MD 20742; [email protected].

Douglas C. Schmidt is a professor in the Electrical Engineering and Computer ScienceDepartment at Vanderbilt University and a senior research scientist at the Institute for SoftwareIntegrated Systems. His research interests include patterns, optimization techniques, and empiri-cal analyses of software frameworks and domain-specific modeling environments that facilitatethe development of distributed real-time and embedded middleware and applications runningover high-speed networks and embedded system interconnects. He received his PhD in informa-tion and computer science at the University of California, Irvine. Contact him at the Inst. for Soft-ware Integrated Systems, 2015 Terrace Pl., Nashville, TN 37203; [email protected].

Aniruddha Gokhale is an assistant professor in the Electrical Engineering and ComputerScience Department at Vanderbilt University and a senior research scientist at the Institute forSoftware Integrated Systems. His research focuses on real-time component middleware opti-mizations, distributed systems and networks, model-driven software synthesis applied to compo-nent middleware-based distributed systems, and distributed resource management. He receivedhis PhD in computer science from Washington University. Contact him at the Inst. for SoftwareIntegrated Systems, 2015 Terrace Pl., Nashville, TN 37203; [email protected].

Balachandran Natarajan is a senior staff engineer at the Institute for Software In-tegrated Systems and a PhD student in electrical engineering and computer science at Vander-bilt University. His research focuses on applying patterns, optimization principles, and frame-works to build high-performance, dependable, and real-time distributed systems. He receivedhis MS in computer science from Washington University. Contact him at the Inst. for SoftwareIntegrated Systems, 2015 Terrace Pl., Nashville, TN 37203; [email protected].