Acknowledgments:

Colloquium: Florida Tech Copyright © 2012 Cem Kaner

1

An Overview of High Volume Test

Automation(Early Draft: Feb 24, 2012)

Cem Kaner, J.D., Ph.D.Professor of Software Engineering

Florida Institute of TechnologyAcknowledgments: Many of the ideas presented here were developed in collaboration with Douglas Hoffman.

These notes are partially based on research that was supported by NSF Grant CCLI-0717613 “Adaptation & Implementation of an Activity-Based Online or Hybrid Course in Software Testing.” Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


AbstractThis talk is an introduction to the start of a research program. Drs. Bond, Gallagher and I have some experience with high volume test automation but we haven't done formal, funded research in the area. We've decided to explore it in more detail, with the expectation of supervising research students. We think this will be an excellent foundation for future employment in industry or university. If you're interested, you should talk with us. Most discussions of automated software testing focus on automated regression testing. Regression tests rerun tests that have been run before. This type of testing makes sense for testing the manufacturing of physical objects, but it is wasteful for software. Automating regression tests *might* make them cheaper (if the test maintenance costs are low enough, which they often are not) but if a test doesn't have much value to begin with, how much should we be willing to spend to make it easier to reuse it? Suppose we decided to break away from the regression testing tradition and use our technology to create a steady stream on new tests instead. What would that look like? What would our goals be? What should we expect to achieve?This is not yet funded research--we are still planning our initial grant proposals. We might not get funded, and if we do, we probably won't get anything for at least a year. So, if you're interested in working with us, you should expect to support yourself (e.g. via GSA) for at least a year and maybe longer.

2


Typical Testing TasksAnalyze product & its risks

• Benefits & features• Risks in use• Market expectations• Interaction with external S/W• Diversity / stability of platforms• Extent of prior testing• Assess source code

Develop testing strategy• Pick key techniques• Prioritize testing foci

Design tests• Select key test ideas• Create tests for each idea

Design oracles• Mechanisms for determining

whether the program passed or failed a test

Assess the tests• Debug the tests• Polish their design• Evaluate any bugs found by them

Execute the tests• Troubleshoot failures• Report bugs• Identify broken tests

Document the tests• What test ideas or spec items does each

test cover?• What algorithms generated the tests? • What oracles are relevant?

Maintain the tests• Recreate broken tests• Redocument revised tests

Manage test environment• Set up test lab• Select / use hardware/software

configurations• Manage test tools

Keep archival records• What tests have we run• What collections / suites provide what

coverage


4

Regression testingThis is the most commonly discussed approach to automated testing:• Create a test case• Run it and inspect the output• If the program fails, report a bug and try again later• If the program passes the test, save the resulting

outputs• In future testing:

– Run the program – Compare the output to the saved results. – Report an exception whenever the current output and

the saved output don’t match.


5

Really? This is automation?• Analyze product & its risks -- Human• Develop testing strategy -- Human• Design tests -- Human• Design oracles -- Human• Run each test the first time -- Human• Assess the tests -- Human• Save the code -- Human• Save the results for comparison -- Human• Document the tests -- Human• (Re-)Execute the tests -- Computer• Evaluate the results -- Computer + Human• Maintain the tests -- Human• Manage test environment -- Human• Keep archival records -- Human


6

This is computer-assisted testing, not automated testing.

ALL testing is computer-assisted.


Other computer-assistance…• Tools to help create tests• Tools to sort, summarize or evaluate test output or test

results• Tools (simulators) to help us predict results• Tools to build models (e.g. state models) of the software,

from which we can build tests and evaluate / interpret results

• Tools to vary inputs, generating a large number of similar (but not the same) tests on the same theme, at minimal cost for the variation

• Tools to capture test output in ways that make test result replication easier

• Tools to expose the API to the non-programmer subject matter expert, improving the maintainability of SME-designed tests

• Support tools for parafunctional tests (usability, performance, etc.)


8

Don't think "automated or not"• Think continuum: more to

less

Not, "can we automate"• Instead: "can we automate

more?"


A hypothetical• System conversion (e.g. Filemaker application to SQL)

– Database application, 100 types of transactions, extensively specified (we know the fields involved in each transaction, know their characteristics via data dictionary)

– 15000 regression tests– Should we assess the new system by making it pass

the 15000 regression tests?– Maybe to start, but what about…

° Create a test generator to create high volumes of data combinations for each transaction. THEN:

° Randomize the order of transactions to check for interactions that lead to intermittent failures

– This lets us learn things we don’t know, and ask / answer questions we don’t know how to study in other ways


Suppose you decided to never

run another regression test. What kind of

automation could you do?

10


Fuzzing Sampling system

Long-Sequence Regression

Oracles

Model Reference Diagnostic Constraint

Inputs

• Input filters

• Function

• Consequences

• Output filters

Combinations

Task sequences

File contents

• Input / reference / config

State transitions

Execution environment

11


Issues that Drive Design of Test Automation • Theory of errorWhat kinds of errors do we hope to expose?

• Input dataHow will we select and generate input data and conditions?

• Sequential dependenceShould tests be independent? If not, what info should persist or drive sequence from test N to N+1?

• ExecutionHow well are test suites run, especially in case of individual test failures?

• Output dataObserve which outputs, and what dimensions of them?

• Comparison dataIF detection is via comparison to oracle data, where do we get the data?

• DetectionWhat heuristics/rules tell us there might be a problem?

• EvaluationHow to decide whether X is a problem or not?

• Troubleshooting supportFailure triggers what further data collection?

• NotificationHow/when is failure reported?

• RetentionIn general, what data do we keep?

• MaintenanceHow are tests / suites updated / replaced?

• Relevant contextsUnder what circumstances is this approach relevant/desirable?


Primary drivers of our designsThe primary driver of a design is the key factor that motivates us or makes the testing possible. In Doug's and my experience, the most common primary drivers have been:• Theory of error

– We’re hunting a class of bug that we have no better way to find

• Available oracle– We have an opportunity to verify or validate a

behavior with a tool• Ability to drive long sequences

– We can execute a lot of these tests cheaply.

13


More on … Theory of Error• Computational errors• Communications problems

– protocol error– their-fault interoperability failure

• Resource unavailability or corruption, driven by– history of operations– competition for the resource

• Race conditions or other time-related or thread-related errors

• Failure caused by toxic data value combinations– that span a large portion or a small portion of the

data space– that are likely or unlikely to be visible in "obvious"

tests based on customer usage or common heuristics14


Simulate Events with Diagnostic Probes• 1984. First phone on the market with

an LCD display. • One of the first PBX's with integrated

voice and data. • 108 voice features, 110 data

features.Simulate traffic on system, with• Settable

probabilities of state transitions

• Diagnostic reporting whenever a suspicious event detected


16

More on … Available OracleTypical oracles used in test automation• Reference program• Model that predicts results• Embedded or self-verifying data• Checks for known constraints• Diagnostics


17

Function Equivalence Testing• MASPAR (the Massively Parallel computer, 64K

parallel processors). • The MASPAR computer has several built-in

mathematical functions. We’re going to consider the Integer square root.

• This function takes a 32-bit word as an input. Any bit pattern in that word can be interpreted as an integer whose value is between 0 and 232-1. There are 4,294,967,296 possible inputs to this function.

• Tested against a reference implementation of square root


Function Equivalence Test• The 32-bit tests took the computer only 6 minutes to

run the tests and compare the results to an oracle. • There were 2 (two) errors, neither of them near any

boundary. (The underlying error was that a bit was sometimes mis-set, but in most error cases, there was no effect on the final calculated result.) Without an exhaustive test, these errors probably wouldn’t have shown up.

• For 64-bit integer square root, function equivalence tests involved random sample rather than exhaustive testing because the full set would have required 6 minutes x 232 tests.

18


This tests for equivalence of functions, but it is less exhaustive than it looks

(Acknowledgement: From Doug Hoffman)

Program state

System state

Configuration and system resourcesCooperating processes, clients or servers

System state

Impacts on connected devices / resourcesTo cooperating processes, clients or servers

Program state, (and uninspected outputs)

System under

test

Reference function

Monitored outputsIntended

inputs

Program state

System state

Configuration and system resourcesCooperating processes, clients or servers

Program state, (and uninspected outputs)

System state

Impacts on connected devices / resourcesTo cooperating processes, clients or servers

Intended inputs

Monitored outputs


20

More on … Ability to Drive Long SequencesAny execution engine will (potentially) do:• Commercial regression-test execution tools• Customized tools for driving programs with (for

example)– Messages (to be sent to other systems or

subsystems)– Inputs that will cause state transitions– Inputs for evaluation (e.g. inputs to functions)


Long-sequence regression• Tests taken from the pool of tests the program

has passed in this build.• The tests sampled are run in random order until

the software under test fails (e.g crash).• Typical defects found include timing problems,

memory corruption (including stack corruption), and memory leaks.

• Recent (2004) release: 293 reported failures exposed 74 distinct bugs, including 14 showstoppers.

Note:• these tests are no longer testing for the failures they were designed

to expose.• these tests add nothing to typical measures of coverage, because

the statements, branches and subpaths within these tests were covered the first time these tests were run in this build.


Imagining a structure

for high-volume automated testing

22


23

Some common characteristics• The tester codes a testing process rather than

individual tests.• Following the tester’s algorithms, the computer creates

tests (maybe millions of tests), runs them, evaluates their results, reports suspicious results (possible failures), and reports a summary of its testing session.

• The tests often expose bugs that we don’t know how to design focused tests to look for. – They expose memory leaks, wild pointers, stack

corruption, timing errors and many other problems that are not anticipated in the specification, but are clearly inappropriate (i.e. bugs).

– Traditional expected results (the expected result of 2+3 is 5) are often irrelevant.


What can we vary?• Inputs to functions

– To check input filters– To check operation of the

function– To check consequences

(what the other parts of the program do with the results of the function)

– To drive the program's outputs

• Combinations of data• Sequences of tasks

• Contents of files– Input files– Reference files– Configuration files

• State transitions– Sequences in a state

model– Sequences that drive

toward a result• Execution environment

– Background activity– Competition for specific

resources• Message streams

24


How can we vary them?Fuzzing:• Random generation /

selection of tests• Execution engine• Weak oracle (run till

crash)Fuzzing examples• Random inputs• Random state

transitions (dumb monkey)

• File contents• Message streams• Grammars

Statistical or AI sampling

• Test selection optimized against some criteria

Long-sequence regression

Model-based oracle• E.g. state machine• E.g. mathematical

modelReference programDiagnostic oracleConstraint oracle

25


Fuzzing

Sampling

system

Long-Seque

nce Regres

sion

Oracles

Model

Reference

Diagnostic

Constraint

Inputs• Input

filters• Function• Consequen

ces• Output

filtersCombinationsTask sequencesFile contents• Input /

reference / config

State transitionsExecution environment

26


Issues that Drive Design of Test Automation • Theory of errorWhat kinds of errors do we hope to expose?

• Input dataHow will we select and generate input data and conditions?

• Sequential dependenceShould tests be independent? If not, what info should persist or drive sequence from test N to N+1?

• ExecutionHow well are test suites run, especially in case of individual test failures?

• Output dataObserve which outputs, and what dimensions of them?

• Comparison dataIF detection is via comparison to oracle data, where do we get the data?

• DetectionWhat heuristics/rules tell us there might be a problem?

• EvaluationHow to decide whether X is a problem or not?

• Troubleshooting supportFailure triggers what further data collection?

• NotificationHow/when is failure reported?

• RetentionIn general, what data do we keep?

• MaintenanceHow are tests / suites updated / replaced?

• Relevant contextsUnder what circumstances is this approach relevant/desirable?


28

About Cem Kaner• Professor of Software Engineering, Florida Tech

I’ve worked in all areas of product development (programmer, tester, writer, teacher, user interface designer, software salesperson, organization development consultant, as a manager of user documentation, software testing, and software development, and as an attorney focusing on the law of software quality.) Senior author of three books:• Lessons Learned in Software Testing (with James Bach &

Bret Pettichord)• Bad Software (with David Pels)• Testing Computer Software (with Jack Falk & Hung Quoc

Nguyen).My doctoral research on psychophysics (perceptual measurement) nurtured my interests in human factors (usable computer systems) and measurement theory.

Acknowledgments:

Documents

Transcript of Acknowledgments: