Practical GUI Testing of Android Applications via … › papers › icse19.pdfPractical GUI Testing...

Practical GUI Testing of Android Applications viaModel Abstraction and Refinement

1Tianxiao Gu 1Chengnian Sun 2Xiaoxing Ma 2Chun Cao2Chang Xu 2Yuan Yao 3Qirun Zhang 2Jian Lu 1,4Zhendong Su

1University of California, Davis, USA 3Georgia Institute of Technology, USA 4ETH Zurich, Switzerland2State Key Laboratory for Novel Software Technology, Nanjing University, China

{txgu,cnsun}@ucdavis.edu, {xxm,caochun,changxu,y.yao,lj}@nju.edu.cn, [email protected], [email protected]

Abstract—This paper introduces a new, fully automated model-based approach for effective testing of Android apps. Differentfrom existing model-based approaches that guide testing with astatic GUI model (i.e., the model does not evolve its abstractionduring testing, and is thus often imprecise), our approach dynam-ically optimizes the model by leveraging the runtime informationduring testing. This capability of model evolution significantlyimproves model precision, and thus dramatically enhances thetesting effectiveness compared to existing approaches, which ourevaluation confirms. We have realized our technique in a practicaltool, APE. On 15 large, widely-used apps from the Google PlayStore, APE outperforms the state-of-the-art Android GUI testingtools in terms of both testing coverage and the number of detectedunique crashes. To further demonstrate APE’s effectiveness andusability, we conduct another evaluation of APE on 1,316 popularapps, where it found 537 unique crashes. Out of the 38 reportedcrashes, 13 have been fixed and 5 have been confirmed.

Index Terms—GUI testing; mobile app testing; CEGAR;

I. INTRODUCTION

Mobile application (app) testing heavily involves humaneffort [1]. For example, a human tester writes code to simulateGUI actions (e.g., clicking a button) to drive the execution ofAndroid apps [2]. This process is not only time-consumingbut also error-prone. Moreover, when the GUI changes, thetester has to make nontrivial modifications to their existingtest scripts [1]. To mitigate these problems, many automatedGUI testing techniques have recently been proposed [3], [4].Automated GUI Testing for Android Apps. Monkey [5],a GUI fuzzing tool developed by Google, generates purelyrandom events [6] as test input without any guidance. Thus,it does not guarantee steering test exploration to uniformlytraverse the GUIs (i.e., low activity coverage [7]), andcannot incorporate user-defined rules such as inputting apassword [7] or prohibiting logging out. Additionally, thegenerated events are low-level, with hard-coded coordinates,and usually excessively long, which complicates reproductionand debugging [3], [8]. To overcome Monkey’s limitations,techniques such as evolutionary algorithms [9] and symbolicexecution [10] have been adapted to guide input generation.However, such approaches are computationally expensive anddo not yet scale in practice [11].

An alternative approach for performing Android GUI testingis model-based [4], [12], [13]. A model is usually a finite statemachine, where each state has a set of model actions, and

each transition between states is labeled with a model actionof the source state. In practice, almost no app comes witha model. Existing testing tools thus build a GUI-based modelby abstracting/mapping GUI actions to model actions and GUIviews to states, respectively.

A model offers at least three types of benefits to GUI testing.First, a model can be used to guide the exploitation of an app.A testing tool can traverse the model using specific guidanceto systematically generate action sequences and then replay theaction sequences to test the app [4]. Second, a model-basedtesting tool generates input sequences composed of high-levelmodel actions rather than low-level events, which can facilitatereplaying [14]. Third, a proper abstraction can be applied to themodel, which in turn can help mitigate the explosion of GUIactions. Through abstraction, many GUI actions with the samebehavior can be mapped to the same model action. Since theseGUI actions behave the same, the testing tool does not need toexercise each of them and can instead select a representativeGUI action among them when executing the model action.State Abstraction. Mapping each GUI action to a modelaction is the most critical step in state abstraction. A stateis usually identified by the set of its model actions sincestates with the same set of model actions can be merged [15].All existing model-based approaches apply a static abstractionbased on certain heuristics throughout the testing of an app.Designing a proper abstraction is challenging. First, if themodel is overly fine-grained, the testing tool cannot systemat-ically explore the model due to state explosion. Second, ifthe model is overly coarse-grained, the testing tool cannotgather sufficiently accurate knowledge on model actions torealize effective guidance (i.e., difficult to replay model actionsequences on the tested app). In particular, an ineffectiveabstraction may map multiple GUI actions with differentbehaviors (e.g., leading to different target states) to the samemodel action. Therefore, a model action cannot be replayedas expected if the testing tool chooses a GUI action behavingdifferently from the one chosen during model construction.Accordingly, all subsequent actions in the sequence dependingon the model action can neither be successfully replayed.Our Approach. This paper proposes APE, a new, practicalmodel-based automated GUI testing technique for Androidapps via effective dynamic model abstraction. At the be-ginning, a default abstraction is used to initiate the testing

process. This initial abstraction may be ineffective. Basedon the runtime information observed during testing, APEgradually refines the model by searching for a more suitableabstraction, an abstraction that effectively balances the sizeand precision of the model. This dynamic nature distinguishesour approach from all existing state-of-the-art techniques asthey rely solely on static abstractions. Instead of operating on afixed abstraction granularity, our approach dynamically adjuststhe granularity as needed. Specifically, APE represents the dy-namic abstraction with a decision tree, and tunes it on-the-flywith the feedback obtained during testing. This decision tree-based representation of model abstractions greatly improvestesting effectiveness, as demonstrated in our comprehensiveevaluation (Section IV).

We compared APE with the state-of-art testing tools (i.e.,Monkey [5], SAPIENZ [3], and STOAT [4]) on 15 large,widely-used apps from the Google Play Store. APE comple-ments the state-of-art-tools (see Section IV-C). APE achievedhigher code coverage and detected more unique crashes (i.e.,unique stack traces defined in Section IV-C) than the othertools on the 15 benchmark apps. Specifically, APE consistentlyresulted in 26–78%, 17–22% and 14–26% relative improve-ments over the three tools in terms of average activity cover-age, method coverage, and instruction coverage, respectively.Though these apps have been well tested, within one hourAPE managed to find 62 unique crashes by testing their GUIs,whereas Monkey found 44, SAPIENZ 40, and STOAT 31. Tofurther demonstrate the usability and effectiveness of APE, weconducted another large-scale evaluation on 1,316 apps fromthe Google Play Store. APE found 537 crashes from 281 appsin total. We reported 38 crashes to developers with detailedsteps to reliably reproduce these crashes, where 13 crasheshave already been fixed and 5 crashes have been confirmed(fixes pending).Contributions. This paper makes the following contributions:• We propose a novel, fully automated, model-based tech-

nique for Android GUI testing. The major differenceof our approach from existing techniques is the abilityto dynamically evolve GUI models toward ones thatdiscard all irrelevant GUI details while faithfully reflectthe runtime states of the app under test.

• We realize dynamic model abstraction via a novel type ofdecision trees, which can expressively represent a widerange of abstractions and thus enables APE to effectively,dynamically refine and coarsen abstractions to balancemodel precision and size.

• Our extensive evaluation demonstrates that APE out-performs the state-of-the-art tools. It can automaticallyexplore many more GUIs (i.e., activities) than the othertools. Therefore, according to our evaluation results, itnot only increased code coverage but also found moreunique crashes.

• We implement the proposed technique into a practicaltool, APE, and make it publicly available.1

1http://gutianxiao.com/ape

Paper Organization. The remainder of this paper is orga-nized as follows. Section II introduces necessary background.Section III presents the approach. Section IV details ourextensive evaluation. Section V surveys related work, andSection VI concludes.

II. BACKGROUND

This section introduces relevant background on model-basedAndroid GUI testing and its challenges.

A. GUI of Android Apps

In an Android app, an activity [16] is a composition ofwidgets. These widgets are organized into a tree-like structure,named GUI tree in this paper [4], [12], [17]. A widget can bea button, a text box, or a container with a layout. It supportsvarious GUI actions such as clicking and swiping. A widgethas four categories of attributes describing its type (e.g., class),appearance (e.g., text), functionalities (e.g., clickable andscrollable), and the designated order among sibling widgets(i.e., index). Each attribute is a key-value pair. We use i, c,and t to denote the key of an index, class, and text attribute,respectively. For example, an index attribute whose value is 0can be denoted by i = 0.

(a) (b)

. . .

. . .

0,ListView,

0,TextView,XLSX

1,TextView,PPTX

2,TextView,DOCX

. . .

Ti

wi0

wi1

wi2

wi3

(c)

. . .

. . .

0,ListView,

0,TextView,DOCX

1,TextView,XLSX

2,TextView,PPTX

. . .

Tj

wj0

wj1

wj2

wj3

(d)

Fig. 1: Figs. 1a and 1b are two GUI snapshots of GoogleDrive. Figs. 1c and 1d are their corresponding GUI trees.

A GUI tree T is a rooted, ordered tree where each nodew is a GUI widget that has a set of attributes, denoted byattributes(w). The Android SDK has provided a tool [18]to obtain the GUI tree of an activity. Figs. 1c and 1d show

http://gutianxiao.com/ape

the corresponding (simplified) GUI trees for Figs. 1a and 1b,respectively. Since only bold parts are of interest in thispaper, we assumed that Ti and Tj are rooted at wi0 and wj0,respectively.

B. Attribute PathA testing tool needs to properly identify widgets to accu-

mulate testing knowledge on them. The memory address ofthe runtime object representing the widget cannot be used toidentify the widget because the same GUI may be createdand disposed many times. In a GUI tree T , given a widgetwn, a node path ω = 〈w1, w2, · · · , wn〉 is a sequence of treenodes (i.e., widgets) that are on the traversal path from oneof its ancestors w1 to wn. This node path is referred to asan absolute node path if the first node w1 is the root of thetree, and otherwise a relative node path, which is similar tothe concept of absolute and relative file paths in file systems.An absolute node path uniquely identifies a widget in the tree.For example, wi1’s unique node path in Fig. 1c is 〈wi0, wi1〉.

Definition 1 (Attribute Path): Given a widget wn and oneof its node paths ω = 〈w1, w2, · · · , wn〉, an attribute pathπ = 〈a1, a2, · · · , an〉 is a projection of ω, such that,

∀ni=1ai ⊆ attributes(wi).

That is, each ai is a subset of the attributes of the correspond-ing widget wi.

Definition 2 (Full Attribute Path): An attribute path π =〈a1, a2, · · · , an〉 is a full attribute path if• its ω = 〈w1, w2, · · · , wn〉 is an absolute node path, and• ∀ni=1ai = attributes(wi).

To distinguish from an arbitrary attribute path, we use σto denote a full attribute path. In this paper, a widget canbe uniquely identified (in the tree) by its full attribute path,because the order of the widgets on the full attribute pathresembles the hierarchy of widgets in the tree, and the indexattribute of each widget determines the unique location ofthe widget among its siblings. Therefore, a GUI tree T canessentially be represented with, and is equivalent to, the setof full attribute paths of all its widgets. Table I shows the fullattribute path for each widget in Fig. 1.

Definition 3 (Attribute Path Reduction): An attribute pathreducer is a function R that takes as input an attribute pathπ = 〈a1, a2, · · · , an〉, and returns a new attribute path π′ =〈bm, · · · , bn〉, such that,

1 ≤ m ≤ n ∧ ∀ni=mbi ⊆ ai.In other words, π′ is a suffix of π, and each element of π′ isa subset of the corresponding element of π.Widget Assumption. A GUI action is identified by its widgetσ and action type τ , i.e., 〈σ, τ〉. For a simpler presentation, weassume that every widget supports only one GUI action andomit action types and functionality attributes throughout thepaper. The relation between widgets and GUI actions becomesbijective, and we use them (i.e., σ and 〈σ, τ〉) interchangeably.Note that this assumption is for illustration only — ourtechnique also supports widgets with multiple actions by usingfunctionality attributes (see Section III-B).

GUI Model

Abstraction

Exploration Strategy

Model-BasedTesting Tool App

inject events

dump a GUI tree

Fig. 2: The typical workflow of model-based GUI testing.

C. Model-Based Android GUI Testing

Fig. 2 depicts the typical workflow of a model-based testingtool [4], [12], [15], [17]. Such a tool interacts iteratively withthe app under test. Initially, the testing tool starts with anempty state machine as the model. During each iteration, thetesting tool (1) obtains the current GUI tree of the app, (2)identifies an existing, corresponding state, or creates a newstate for this GUI tree, and (3) picks a model action anddetermines a concrete GUI action to interact with the app.

A model-based testing tool aims at exploring the app todiscover new widgets and exploiting the app to exerciseinteractions among the discovered widgets. In practice, modernAndroid apps usually have a large number of widgets. Tomitigate this scalability challenge, a model-based testing toolfurther attempts to identify equivalent GUI actions and abstractthem to a same model action to reduce the search space.However, it is nontrivial to determine whether two widgetsare equivalent. A full attribute path usually contains irrelevantinformation, which prevents two “semantically” equivalentwidgets from being discovered. For example, wi1 in Fig. 1cand wj2 in Fig. 1d are essentially equivalent but have differentfull attribute paths, as shown in Table I. Relying solely onfull attribute paths, a testing tool will view the two widgetsdifferently, and the model size is consequently increased.State Abstraction. State abstraction refers to the procedurethat identifies equivalent GUI trees and actions, and maps themto the same state and the same model action, respectively. Ingeneral, state abstraction is realized based on the similarity offull attribute paths of GUI actions [4], [12], [15]. Specifically,state abstraction determines two GUI actions as equivalentif (1) they have the same action type and (2) their fullattribute paths can be reduced to the same attribute pathπ by removing irrelevant attributes under certain reductionrules. The corresponding model action is denoted as π (〈π, τ〉when action type τ is considered). Similarly, state abstractiondetermines two GUI trees as equivalent if all of their GUIactions can be reduced to the same set of attribute paths. Thecorresponding model state is denoted as the set of all modelactions (i.e., the set of attribute paths π) [15].

However, an automated testing tool has no prior knowledgeof the apps under test and can only define reduction rulesheuristically. Note that a testing tool can apply differentreductions to different widgets to achieve a proper abstrac-tion granularity. For example, STOAT [4] assumes that childwidgets of ListView behave equivalently and realizes thisheuristic in its state abstraction. Specifically, STOAT removesall attributes from child widgets of ListView, and removesonly types and texts from other widgets. For the examples in

TABLE I: Model actions (i.e., reduced attribute paths) by different abstractions.

Tree Widget Full Attribute Path STOAT AMOLA (C-Lv4) AMOLA (C-Lv5) Text-Only

Ti

wi0 〈{i = 0, c =LV, t =∅}〉〈{i = 0}〉〈{i = 0}〉〈{i = 0, t =∅}〉〈{i = 0}〉

wi1 〈{i = 0, c =LV, t =∅}, {i = 0, c =TV, t =XLSX}〉〈{i = 0},∅〉〈{i = 0}, {i = 0}〉〈{i = 0, t =∅}, {i = 0, t =XLSX}〉〈{i = 0}, {t =XLSX}〉

wi2 〈{i = 0, c =LV, t =∅}, {i = 1, c =TV, t =PPTX}〉〈{i = 0},∅〉〈{i = 0}, {i = 1}〉〈{i = 0, t =∅}, {i = 1, t =PPTX}〉〈{i = 0}, {t =PPTX}〉

wi3 〈{i = 0, c =LV, t =∅}, {i = 2, c =TV, t =DOCX}〉〈{i = 0},∅〉〈{i = 0}, {i = 2}〉〈{i = 0, t =∅}, {i = 2, t =DOCX}〉〈{i = 0}, {t =DOCX}〉

Tj

wj0 〈{i = 0, c =LV, t =∅}〉〈{i = 0}〉〈{i = 0}〉〈{i = 0, t =∅}〉〈{i = 0}〉

wj1 〈{i = 0, c =LV, t =∅}, {i = 0, c =TV, t =DOCX}〉〈{i = 0},∅〉〈{i = 0}, {i = 0}〉〈{i = 0, t =∅}, {i = 0, t =DOCX}〉〈{i = 0}, {t =DOCX}〉

wj2 〈{i = 0, c =LV, t =∅}, {i = 1, c =TV, t =XLSX}〉〈{i = 0},∅〉〈{i = 0}, {i = 1}〉〈{i = 0, t =∅}, {i = 1, t =XLSX}〉〈{i = 0}, {t =XLSX}〉

wj3 〈{i = 0, c =LV, t =∅}, {i = 2, c =TV, t =PPTX}〉〈{i = 0},∅〉〈{i = 0}, {i = 2}〉〈{i = 0, t =∅}, {i = 2, t =PPTX}〉〈{i = 0}, {t =PPTX}〉

* LV and TV are abbreviations for ListView and TextView, respectively.

Fig. 1, STOAT maps all child widgets of ListView to thesame model action (i.e., 〈{i = 0},∅〉). As shown in Table Iand Fig. 3a, STOAT assigns both Figs. 1a and 1b to the samestate (i.e., 1 ) since they have the same set of model actions.In contrast, AMOLA supports five static abstractions and itsC-Lv5 criterion [12] takes into account both the indices andtext content, and identifies a model action for each widget.As shown in Table I and Fig. 3c, AMOLA assigns Figs. 1aand 1b to two states, i.e., 3 and 4 , respectively.

1

WP

X

〈{i=0},∅〉

2W

P

X〈{i=0}, {i=0}〉

〈{i=0}, {i=2}〉

〈{i=0}, {i=1}〉

5W

P

X〈{i=0}, {t=DOCX}〉

〈{i=0}, {t=PPTX}〉

〈{i=0}, {t=XLSX}〉

3X

P

W〈{i=0, t=∅}, {i=0, t=XLSX}〉

〈{i=0, t=∅}, {i=1, t=PPTX}〉

〈{i=0, t=∅}, {i=2, t=DOCX}〉

4

〈{i=0, t=∅}, {i=0, t=DOCX}〉〈{i=0, t=∅}, {i=1, t=XLSX}〉

〈{i=0, t=∅}, {i=2, t=PPTX}〉

(a) STOAT (b) AMOLA (C-Lv4)

(c) AMOLA (C-Lv5)

(d) Text-Only

Fig. 3: Partial models. Each circle is a state and each edge is astate transition labeled with a model action. States W , X andP are w.r.t. GUI trees of file viewers for DOCX, XLSX and

PPTX, respectively. Other states are w.r.t. GUI trees in Fig. 1.Model actions of ListView (i.e., wi0 and wj0) are omitted.

A good state abstraction should balance trade-offs betweenthe model size and model precision. On one hand, the stateabstraction should be coarse to avoid state explosion bytolerating differences of GUIs that are irrelevant to testing.For example, AMOLA builds very fine-grained models, butthe models can quickly explode in size as they incorporatemuch testing-irrelevant GUI information. In constrast, STOATdoes not have this problem as it builds a coarse-grained model.

On the other hand, the state abstraction should preciselymodel the runtime states of an app in order to interact with theapp. For example, AMOLA (C-Lv4) keeps indices only, andthus can distinguish different file items in the same GUI butnot among different GUIs. STOAT cannot distinguish differentfile items even in the same GUI. As shown in Fig. 3a,the model action 〈{i = 0},∅〉 actually represents a set ofinequivalent widgets and leads to three different target states(i.e., W , X and P ). When the guidance in STOAT promptsit to open a DOCX file (i.e., reach the state W ), the actually

opened file may be PPTX or XLSX (i.e., the actual target statesmay be X or P ). This divergent behavior will misguide thetesting tool.

A proper way to identify model actions here is to includethe file name (text) only. this state abstraction identifies threeactions and a single state (i.e., 5 in Fig. 3d) for GUIs inFigs. 1a and 1b and also avoids mapping inequivalent widgetsto the same model action. Unfortunately, none of the existingtechniques supports such a proper abstraction. To addressthese difficulties, we propose a technique that can dynamicallyoptimize the abstraction during testing. Specifically, we startthe testing from a default abstraction. During the testing, wesystematically and efficiently improve the abstraction towardthe one that balances the size and precision of the model.

III. APPROACH

A. Model

Definition 4 (Model): A model M in APE is a tuple(S,A, T ,L), where• S, the set of states. S ∈ S is a set of attribute paths.• A, the set of model actions. Each model action π ∈ A is

an attribute path, and A =⋃S∈S S.

• T : S×A×S , the set of state transitions. Each transition(S, π, S′) has a source state S, a target state S′, and atransition label that is a model action π.

• L: the abstraction function, which reduces a full attributepath σ to an attribute path π, i.e., L(σ) = π.

We also record every GUI transition between iterations. EachGUI transition is a tuple of (T, σ, T ′), where T and T ′ areGUI trees and σ is the chosen GUI action for simulation.

To facilitate presentation, we derive two variants from L,LL and LL, to process a GUI tree and a GUI transition,respectively. Given a GUI tree T (i.e., a set of full attributepaths), LL can be used to find its corresponding state S,

S = LL(T ) = {π|π = L(σ) ∧ σ ∈ T}.

And given a GUI transition (T, σ, T ′), LL can be used to findits corresponding state transition (S, π, S′),

(S, π, S′) = LL((T, σ, T′)) = (LL(T ),L(σ),LL(T ′)).

Algorithm 1 sketches the workflow of model-based testingshown in Fig. 2. Initially, the model M is empty. At thebeginning of each iteration, S and π refer to the state andthe chosen model action in the previous iteration. Next,

Algorithm 1: Model construction.Input: Testing budget B, initial abstraction function L.Output: The model M .

1 (M,S, π)← ((∅,∅,∅,L),∅,∅) . Initialization.2 while B > 0 do3 B ← B − 1 .Decrease the testing budget.4 T ′ ← CaptureGUITree() .Use uiautomator.5 M ← UptAndOptModel(M,S, π, T ′) . See Algorithm 2.6 S ← LL(T

′) .Get the current state.7 π ← SelectAndSimulateAction(S) . See Section III-D.

8 return M

function UptAndOptModel adds the state LL(T ′) and thestate transition (S, π,LL(T ′)) to the model and also attemptsto optimize the model as needed. Finally, we update Swith the current state LL(T ′) and determine a new modelaction π on LL(T ′). The model action is determined withthe function SelectAndSimulateAction, which will beinformally described in Section III-D. When executing π, wefirst determine a set of GUI actions that map to π (i.e.,{σ|σ ∈ T ∧ L(σ) = π}) and then randomly choose one ofthem to simulate.

B. Dynamic Abstraction Functions

As aforementioned in Section I, the key novelty of APE isits dynamic abstraction function that can dynamically adap-t/change the model abstraction to balance model precisionand size based on runtime information. However, realizingdynamic abstraction functions is nontrivial. First, a dynamicabstraction function should be adaptable at runtime. Allprevious approaches use a statically crafted abstraction func-tion with a fixed abstraction granularity. Hence, adapting theabstraction needs to change the source code implementingthe static abstraction function. Second, a dynamic abstractionfunction should be generalizable. For the examples in Fig. 1,an abstraction function should apply to not only Ti and Tjbut also any GUI trees after reordering files. We cannot simplyuse a hash map between full attribute paths and attribute pathsas the abstraction function, because new GUI trees and newattribute paths are continuously being discovered when thetesting progresses and they are not in the hash map of such anabstraction function. Third, a dynamic abstraction should alsobe human-interpretable so that users can incorporate criticalrules to further improve the dynamic abstraction.

Different from existing approaches, we represent an abstrac-tion function as a decision tree that determines a reducer for agiven σ. A decision tree is a rooted tree, where each node is areducer R and each edge (branch) is labeled with an attributepath π. Here, π is called the selector of the branch.Selector. A branch selects a full attribute path σ if σ can bereduced to the selector π of the branch. Given a GUI tree T ,π selects a set of full attribute paths in T that can be reducedto π, i.e., {σ|σ ∈ T ∧ ∃R :R(σ) = π}.Decision Procedure. To determine the reducer for the givenσ, we start from the root node of the decision tree and checkwhether σ can be selected by any branch of the current node. Ifyes, we move to the target node n of the branch and continue

to recursively check the branches of n. Otherwise, the reducerof n is used as the output to reduce σ. And n is referred asoutput node in later discussions. As a function, an importantproperty of the decision tree is that it should determine oneand only one reducer for σ. To guarantee this property, weenforce that any σ must be selected by at most one branch.Reducer. A reducer is the aggregation of a set of primitivereducers. We first define two types of primitive reducers.

Definition 5 (Local Reducer): Let A denote a set ofattribute keys. Given a full attribute path σ = 〈a1, a2, . . . , an〉,a local reducer RA removes a1, a2, . . . , an−1 and retainsattributes in A for an.

RA(σ) = 〈{(k, v)|(k, v) ∈ an ∧ k ∈ A}〉

In this paper, we use four types of local reducers: Rc to selectthe type attributes, Rt to select the appearance attributes, Rito select the index attribute, and R∅ to select no attribute.

Definition 6 (Parent Reducer): Given a full attribute pathσ = 〈a1, a2, . . . , an〉, a parent reducer Rp reuses the outputof the parent full attribute path 〈a1, a2, . . . , an−1〉 and retainsnothing for an.

Rp(σ) = L(〈a1, a2, . . . , an−1〉)⊕R∅(σ)

where ⊕ is the concatenation of sequences.Definition 7 (Reducer Aggregation): Given a full attribute

path σ = 〈a1, a2, . . . , an〉 and two reducers R and R′, supposethat R(σ) = 〈bm, bm+1, . . . , bn〉, R′(σ) = 〈ck, ck+1, . . . , cn〉and m ≤ k. The aggregation RonR′(σ) is the reverse element-wise union of R(σ) and R′(σ):

RonR′ = 〈bm, bm+1, . . . , bk ∪ ck, bk+1 ∪ ck+1, . . . , bn ∪ cn〉.

We have defined five primitive reducers in this paper andobtain 24 reducers in total (denoted by R) since R∅onR = R.We also define a partial order over all reducers in R. We saythat a reducer R′ is finer than another reducer R (i.e., R′ @ R)if R′ consists of more primitive reducers than R.Multiple Action Types. Recall that we have assumed thata widget supports only one action type in Section II-A. Tosupport multiple action types, we can represent a GUI actionas 〈σ, τ〉 and a model action as 〈π, τ〉. A selector π selects all〈σ, τ〉 in which σ can be reduced to π. The decision tree canstill be used to realize L(〈σ, τ〉) = 〈π, τ〉.

RionRp Rp〈{c=LV},∅〉

Fig. 4: The decision tree of STOAT’s abstraction function (Ls).

Example. The decision tree in Fig. 4 implements STOAT’sabstraction function (denoted by Ls), which we will use as thestarting abstraction function to illustrate dynamic optimizationof our approach in Section III-C. Take widgets wi0 and wi1 inFig. 1c as examples. Ls assigns Ri onRp to wi0 because wi0is not a child of a ListView and cannot be selected by〈{c = LV},∅〉. Ls assigns Rp to wi1 because wi1 is a childof a ListView and can be selected by 〈{c =LV},∅〉. Sincewi0 is the root and has no parent, Rp selects nothing for it.

The attributes path of wi0 is actually created by Ri, which is〈{i = 0}〉. Since wi1 is a child of wi0, Rp reuses 〈{i = 0}〉 forwi1. The final attribute path of wi1 is 〈{i = 0},∅〉.

C. Optimizing Abstraction Functions

APE starts with an initial abstraction function and contin-uously refines and coarsens the abstraction function toward aproper model precision. Specifically, refinement either replacesthe reducer R of certain leaf output node in the decision treewith a not coarser reducer R′ (i.e., R′ 6A R) or inserts a newbranch with a finer reducer R′ (i.e., R′ @ R) to certain outputnode, while coarsening reverts the current abstraction functionL to the previous one L′ that L is refined from. Hence,the initial abstraction function represents the lower bound ofabstraction and should be coarse enough.

An extremely coarse abstraction function would map allGUI actions to the same model action and all GUI trees tothe same state. Such a function builds a very simple model,and this model does not have non-deterministic transitions,because every GUI tree is mapped to the same state. To avoidthis trivial case, we first refine the abstraction function untilno model action abstracts more than α GUI actions. Second,a state transition (S, π, S′) is non-deterministic if there is an-other transition (S, π, S′′), where S′ 6= S′′. Such a pair of non-deterministic transitions usually indicates that the abstractionof the source state S or the model action π is overly coarse andneeds to be refined. Hence, we refine the abstraction functionaiming at eliminating every non-deterministic transition. Third,the previous refinement monotonically increases the modelsize, and may eventually lead to a state explosion. Thereby,we revert the abstraction function L to the previous one L′ ifL refines a state created by L′ into β new states. Here, α andβ are two configurable threshold values.

RionRp

Ls

Rp

〈{c=LV},∅〉RionRp

L1

Rp

RionRp

〈{c=LV},∅〉

〈{i=0},∅〉

RionRp

L2

Rp

RionRtonRp

〈{c=LV},∅〉

〈{i=0},∅〉

RionRp

L3

Rp

RtonRp

〈{c=LV},∅〉

〈{i=0},∅〉

1 2

34

Fig. 5: Example of the optimization of abstraction functions.

Example. Fig. 5 depicts the evolution from Stoat’s abstractionfunction Ls to the text-only abstraction function (i.e., L3)discussed in Section II-C. As shown in Table I and Fig. 3a,suppose that APE visits the model action 〈{i = 0},∅〉 onlytwice, on Ti and Tj , and simulates two GUI actions to clickwidgets wi2 and wj1, respectively. Since clicking wi2 and wj1leads to two different target states (i.e., P and W ), APE candetect that 〈{i=0},∅〉 is non-deterministic. Suppose that APEprefers to use RionRp to refine 〈{i=0},∅〉 and map wi2 and wj1to 〈{i=0}, {i=1}〉 and 〈{i=0}, {i=0}〉, respectively. Since theinitial decision tree may incorporate user-defined rules, APEdoes not modify its nodes (i.e., rectangle nodes) and inserts anew branch into Ls to create L1. L1 only temporally eliminatethe non-deterministic transitions, because APE cannot detect

that 〈{i = 0}, {i = 1}〉 or 〈{i = 0}, {i = 0}〉 is non-deterministicwhen both 〈{i=0}, {i=1}〉 and 〈{i=0}, {i=0}〉 have a singletransition each, i.e., by clicking wi2 and wj1, respectively. APEwill detect the non-determinism (see Fig. 3b) when there aremore transitions. Suppose that APE further refines L1 to L2

with a finer reducer Ri on Rt on Rp. Similar to AMOLA C-Lv5, L2 cannot tolerate file reordering and leads to the stateexplosion. Then, APE reverts L2 to L1 and blacklists L2. Alater refinement will refine L1 to L3, the text-only abstractiondiscussed in section II-C.

Algorithm 2: Update and optimize the model.1 Function UptAndOptModel(M = (S,A, T ,L), S, π, T ′)

Input: The new GUI tree T ′, the model M = (S,A, T ,L), theprevious state S, and the previous action π.

Output: The new model.2 S′ ← LL(T ′)3 M ← (S ∪ {S′},A ∪ S′, T ∪ {(S, π, S′)},L)4 repeat M ← ActionRefinement(M,T ′) until M is not updated5 M ← StateCoarsening(M,T ′)6 return StateRefinement(M, (S, π, S′))

7 Function ActionRefinement(M = (S,A, T ,L), T ′)8 foreach π′ ∈ LL(T ′) do9 if |L(π′) ∩ T ′| > α then

10 R← GetReducer(π′)11 foreach R′ ∈ {R′|R′ ∈ R ∧ R 6@ R′} do12 L′ ← L∪ {R→ R′} | L ∪ {(R, π′, R′)}13 Π← {L′(σ)|σ ∈ L(π′) ∩ T ′}14 if |Π| > 1 then15 return RebuildModel(M, {LL(T ′)},L′)

16 return M

17 Function StateCoarsening(M = (S,A, T ,L), T ′)18 L′ ← GetPrev(L)19 S← {LL(T )|T ∈ LL′ (LL′ (T ′))}20 if |S| > β then return RebuildModel(M, S,L′) else return M

21 Function StateRefinement(M = (S,A, T ,L), (S, π, S′))22 foreach (S, π, S′′) ∈ {(S, π, S′′)|(S, π, S′′) ∈ T ∧ S′′ 6= S′} do23 foreach π′ ∈ S do24 R← GetReducer(π′)25 foreach R′ ∈ {R′|R′ ∈ R ∧ R 6@ R′} do26 L′ ← L∪ {R→ R′} | L ∪ {(R, π′, R′)}27 S1 ← {LL′ (T )|(T, σ, T ′) ∈ LL′ ((S, π, S′))}28 S2 ← {LL′ (T )|(T, σ, T ′′) ∈ LL′ ((S, π, S′′))}29 if S1 ∩ S2 = ∅ then30 return RebuildModel(M, S,L′)

31 return M

We use three requirements with threshold α and β tomeasure the balance between model precision and model size.

1) No model action abstracts more than α GUI actions ina GUI tree, meaning that we should refine each modelaction until it abstracts fewer than α GUI actions.

2) Non-deterministic transitions should be as few as possi-ble, meaning that we should keep optimizing the abstrac-tion function until no non-deterministic transition can beeliminated.

3) No model state created by a previous L is refined into βnew states by the refined version of L.

Algorithm 2 aims at balancing the model precision and size.We first introduce inverse mappings to facilitate explaining it.• L(π) is the set of GUI actions that are abstracted to π.

• LL(S) is the set of GUI trees that are abstracted to S.• LL((S, π, S

′)) is the set of GUI transitions that areabstracted to (S, π, S′).

Function UptAndOptModel first adds the new state S′

and transition (S, π, S′) into the model. Then, it checkswhether S′ and (S, π, S′) violates any requirements and ifyes further attempts to optimize S′ and (S, π, S′) to incremen-tally optimize the model via function ActionRefinement,StateRefinement and StateCoarsening. APE keepsusing the model if it fails to optimize S′ or (S, π, S′).Action Refinement. Function ActionRefinement refineseach model action that abstracts more than α GUI actions inthe GUI tree T ′ (i.e., |L(π′) ∩ T ′| > α). For such a π′, thefunction tries to find a new abstraction function that refines π′

into multiple model actions. Function GetReducer returnsthe reducer R that creates π′. Since any reducer R′ that is notcoarser than R (i.e., R′ 6A R) may refine π′, a new abstractionfunction is created by either replacing R with a not coarser R′

(denoted by L∪{R→ R′}) or appending a new branch with afiner R′ to the output node of R (denoted by L∪{(R, π′, R′)}).If L′ refines π′ into multiple model actions, we replace Lby L′ and rebuild the model accordingly. Since we only doreplace on leaf nodes and append at most one branch a time,no node can be selected by multiple branches. The refineddecision tree still guarantees to be a valid function. FunctionRebuildModel takes the current model, the set of affectedstates and the new abstraction function as input, removes allaffected model actions, states and transitions, and applies thenew abstraction function to re-add states and transitions forall affected GUI trees and GUI transitions. Since the modelmay be rebuilt, S′ at line 8 may be stale. Hence, functionActionRefinement repeats until M is not updated, i.e.,no model action needs refinement.State Coarsening. Function StateCoarsening (1) ap-plies a previous LL′ to T ′ to obtain the old state LL′(T ′), (2)retrieves all GUI trees that can be abstracted to the old state,i.e., LL′(LL′(T ′)), (3) obtains the current states S of theseGUI trees, and (4) reverts L to L′ if |S| > β, which means Lis much finer than L′.State Refinement. To eliminate non-deterministic transitions(S, π, S′) and (S, π, S′′), function StateRefinement triesto refine each model action of S following the same stepsas ActionRefinement, except that L′ must refine S todifferent states or refine π to different model actions. We haveimplemented the latter case but omit its details in Algorithm 2since it is just a variant of ActionRefinement.Parameter Selection. A model action will be refined if itleads to non-deterministic transitions no matter what α is. Thethreshold α is set to 3, which is good in practice. If α istoo large, the exploration strategy should sample each modelaction more times. If α is too small, the testing tool cannottolerate minor differences. The threshold β is calculated basedon the number of primitive reducers of the finest reducer inthe decision tree. The largest value of β is 8 by default.Algorithm Decisions. When two refinements can both elimi-nate the non-deterministic transitions, we prefer the refinement

that creates fewer states and then fewer model actions inthe new model at first, e.g., prefer L1 to L2 in Fig. 5.Since refinement may conflict with coarsening, the strictlybalanced model under α and β may not even exist. Whenthe conflict arises, we prefer coarsening to suppress the stateexplosion because the corresponding GUIs have been exploredsufficiently when the explosion is detected.

D. Exploration StrategyWe propose to combine random and greedy into the depth

first search. First, we found that some non-determinism canbe tolerated by keeping the locality of the exploration. Forexample, APE can tolerate non-determinism caused by accesstime via keeping clicking the same file in Fig. 1. Hence,we discover connected sub-graphs and try to explore allmodel actions in a connected sub-graph before jumping toother states. Next, we only greedily visit every newly addedunvisited model action, where model actions are matched bytheir attribute paths. Third, we randomly visit other modelactions, and give a higher priority to model actions that are notvisited or abstract more GUI actions. When failing to replay atransition, APE detaches the target states to avoid unnecessaryreplaying attempts. When reaching the state again by randomactions, APE attaches the state to the model again. Hence, APEevolves not only state abstraction but also state connectivity.

E. ImplementationAPE is built on top of Monkey and runs on both emulators

and real devices. It is designed to be a drop-in replacement ofMonkey and requires no customization of both the app undertest and Android devices. We have tested APE’s compatibilityon Android 6 and 7 devices because they dominate the marketshare when we started to develop APE [19]. GUI trees aredumped with the accessibility API [20], the same one usedin [18]. The initial decision tree uses Rc for every widget.Initially, there is only one decision tree for all states. If a stateneeds refinement, we build a new decision tree for it in a copy-on-write manner. Hence, we can apply different abstractionsto widgets with the same full attribute paths when necessary.

IV. EVALUATION

We evaluate APE on 15 widely-used apps from the GooglePlay Store and compare it with three state-of-art testing tools,i.e., Monkey, SAPIENZ [3] and STOAT [4]. On average, APEachieved higher code coverage and detected more crashes thanall the others. Specifically, APE consistently resulted in 26–78%, 17–22% and 14–26% relative improvements over theother three tools in terms of average activity coverage, methodcoverage, and instruction coverage, respectively. Moreover,APE detected 61 crashes, while Monkey detected 44, SAPIENZ40, and STOAT 31.

To further demonstrate APE’s usability and effectiveness,we applied APE to test 1,316 apps in the Google Play Storefor 30 minutes each. APE detected 537 unique crashes in 42error types from 281 of 1,316 apps. We reported 38 crashes todevelopers with steps to reproduce the crashes, where 13 havebeen already fixed, and 5 have been accepted (fixes pending).

A. Experimental Setup

Setup. All tools can test apps directly without any modifica-tion to the apps and Android. However, the publicly availableversion of SAPIENZ only supports Android 4.4 emulators, andtherefore we ran all comparative experiments on emulators.We ran SAPIENZ and STOAT on the same version of Android(i.e., Android 4.4) as described in [3], [4], respectively, andran APE and Monkey on Android 6. All emulators were runon a MacBook Pro 2016 (Intel Core i7 3.3 GHz, 16GB, MacOS 10.13). We ran all tools with their default configurations.For Monkey and SAPIENZ, we additionally followed previouswork [11], [21] to set a 200 milliseconds delay. We reusedartifacts released by STOAT and pushed them to emulatorsbefore testing each app. We followed previous work [3], [11],[12] to give all testing tools except STOAT an hour to testeach app. We gave STOAT an hour for model constructionand two additional hours for model-based fuzzing to assessits full capability. Each experiment was repeated five times.We report the average coverage and total crashes in five runs.The method and instruction coverages were collected everyminute using our VM-based tracing tool.Benchmark Collection. We selected 15 widely-used appsthat are compatible with x86 emulators, as shown in Table II.Specifically, we first randomly selected 11 apps from the edi-tor’s choice collection of the Google Play Store,2 where fourapps were also used by [21]. Second, we randomly selectedanother four well-maintained (i.e., updated since 2017) open-source benchmark apps used by SAPIENZ and STOAT.

TABLE II: Statistics of benchmark apps.

# Application Activity (#) Method (#) Instruction (#) Install (#)1 Citymapper 55 70,182 1,126,836 5,000,0002 Duolingo 57 63,303 1,150,165 100,000,0003 Flowx 16 22,081 449,899 100,0004 Google Drive 87 55,130 1,233,004 1,000,000,0005 Google Translate 35 36,126 787,527 500,000,0006 Nuzzel 42 32,587 533,618 100,0007 30 Day Fitness 34 48,292 668,396 10,000,0008 Zillow 85 109,338 1,554,545 10,000,0009 Flipboard 69 49,535 985,299 500,000,00010 VLC 22 28,264 372,488 100,000,00011 Tunein Radio 52 90,865 1,480,064 100,000,00012 Amaze 6 41,475 752,809 500,00013 Book Catalogue 39 6,683 127,685 100,00014 Any Memo 30 56,474 669,027 100,00015 My Expenses 32 58,389 1,091,503 500,000

661 768,724 12,982,865

B. Coverage of Benchmark Apps

As shown in Table III, APE is clearly the most effectivetool among the four in terms of coverage. Specifically, APEachieved the best per-app coverage on almost all apps andconsistently resulted in 26–78%, 17–22% and 14–26% relativeimprovements over the other three tools in terms of averageactivity coverage, method coverage, and instruction coveragerespectively. Moreover, the coverage of APE also grows muchfaster than the other three tools, as shown in Fig. 6. For STOAT,we reported data in both model construction and the entire

2https://play.google.com/store/apps/topic?id=editors choice

TABLE III: Results on benchmark apps.

# Activity (%) Method (%) Instruction (%) Crashes (#)

Ape Mo Sa St1 Ape Mo Sa St1 St3 Ape Mo Sa St1 St3 Ape Mo Sa St1 St3

1 41 35 31 26 45 43 38 41 42 37 35 30 33 34 0 7 1 0 02 24 17 22 19 29 26 26 25 26 23 20 20 19 20 3 0 2 0 43 63 53 60 28 44 40 39 32 34 38 34 32 25 26 1 1 1 0 44 30 27 26 12 41 39 37 37 38 33 31 29 29 30 0 2 0 0 05 50 44 46 37 34 33 30 30 30 25 24 23 22 22 1 0 1 2 26 27 15 26 15 24 17 22 20 20 21 14 19 17 17 5 3 3 0 07 50 28 40 30 22 15 19 19 20 20 13 17 17 17 0 0 1 0 08 30 26 19 – 21 19 15 – – 19 17 13 – – 5 7 2 – 09 33 18 18 – 35 26 21 – – 28 20 15 – – 4 0 2 – 010 47 35 37 26 34 29 28 26 30 36 31 30 27 31 3 3 2 1 211 27 20 21 – 25 21 20 – – 24 20 16 – – 2 1 3 – 012 67 63 60 50 15 14 12 12 13 12 11 9 9 10 20 16 16 2 513 63 43 56 17 28 23 20 10 12 25 22 18 9 11 0 0 0 0 014 76 47 71 27 15 11 13 10 12 16 11 14 10 12 15 1 3 3 1315 45 39 36 30 18 18 14 14 15 13 14 10 10 11 3 3 3 1 1

39 29 31 22 28 24 23 24 25 24 21 19 20 21 62 44 40 9 31* Mo, Sa, and St are abbreviations for Monkey, SAPIENZ, and STOAT, respectively. St1

and St3 represent data in the model construction and the entire testing, respectively.

testing. STOAT did not report covered activities during model-based fuzzing because STOAT has an internal null intent fuzzerwhich directly starts activities with empty intents. In practice,STOAT can achieve 100% activity coverage on emulators. Dueto the intrusive null intent fuzzing, STOAT always resultedin not responding on three apps (#8, #9 and #11). Hence,the average coverage of STOAT counted 12 apps. All toolsobtained relatively low coverage on some apps, e.g., app #6–8and #11–15. The reason is that a proper environment is neededto further improve the coverage. For example, #app 15 requirespremium user accounts to unlock certain functionalities.

10 20 30 40 50 60

20%

25%

10 20 30 40 50 60

10%

15%

10 20 30 40 50 60

15%

20%

25%

Editor’s Choices (1–11) Open Source (12-15) All

APE Monkey SAPIENZ STOAT

Fig. 6: Progressive instruction coverage. X axis is time inminutes and Y axis is instruction coverage.

C. Detected Crashes of Benchmark Apps

We counted only fatal errors [22] that crashed app processesduring GUI exploration. This is different from the evaluationmethod of STOAT. First, STOAT counted as crashes manyexceptions that do not terminate app processes, e.g., windowleaked exceptions [4], [23]. Second, STOAT also targeted todetect crashes that were triggered by null intent fuzzing [24]in addition to GUI testing. We believe that the class ofcrashes found by GUI testing is more important than thatfound by null intent fuzzing, because the former class canusually be triggered with legitimate user interactions whereasthe latter class has been largely prohibited by certain securitymechanisms on real devices [25] and is mostly reproducible onemulators only. Moreover, the results of our evaluation methodare also consistent with their latest study [26].

https://play.google.com/store/apps/topic?id=editors_choice

TABLE IV: Crashes detected by GUI testing.

Tool Java Exceptions (#) Native Crashes (#) All (#)BR UC BA BR UC BA BR UC BA

APE 161 51 9 67 11 5 228 62 11Monkey 90 34 8 43 10 6 133 44 10SAPIENZ 101 33 10 102 7 7 203 40 13STOAT1 47 9 5 0 0 0 47 9 5STOAT3 315 31 7 0 0 0 315 31 7

* BR, UC, and BA are abbreviations for bug reports, unique crashes and buggyapps, respectively.

We show statistics of crashes detected by GUI testing inTable IV. APE detected the most unique crashes by GUI test-ing. To identify unique crashes, we followed STOAT to use thefull normalized stack traces [4], i.e., stack traces [22] withoutirrelevant information such as messages of exceptions or pidof processes [27]. Since benchmark apps were developed inJava, native crashes [27] were mostly triggered by defects ofthe framework rather then the apps. For example, almost a halfof native crashes detected by APE, Monkey and SAPIENZ werecaused by some defect of the WebView [28].

APE Monkey953 35

(a) Android 6.0

SAPIENZ STOAT

436 27

(b) Android 4.4

Fig. 7: Pairwise comparison of unique crashes (stack traces).

NullPointerExceptionNative

IndexOutOfBoundsException

IllegalStateException

NumberFormatException

IllegalArgumentException

SQLiteException

NotSerializableException0

10

20

3033

115 5 3 2 2 1C

rash

es


ClassCastException

OutOfMemoryError


ConcurrentModificationException

RuntimeException

NumberFormatException


SQLiteException0

5

10

1310

86

2 1 1 1 1 1


ClassCastException

XmlPullParserException

ConcurrentModificationException

ArrayIndexOutOfBoundsException

SecurityException

NoClassDefFoundError



0

5

10

1517

7 62 2 2 1 1 1 1C

rash

es

NullPointerException


NumberFormatExceptionrx.b.j

SecurityException


NotSerializableException05

101520

22

3 2 1 1 1 1

APE Monkey SAPIENZ STOAT

Fig. 8: Distribution of crashes types by GUI testing.

Fig. 7 depicts the pairwise comparison of unique normalizedstack traces for tools on the same version of Android. We donot compare crashes detected by tools on different versions ofAndroid via normalized stack traces because different versionsof Android have different Android framework code. Particu-larly, Android 4.4 employs the dalvik VM while Android 6.0employs the ART runtime [29]. The two runtime environmentshave different thread entry methods. Based on data in Figs. 7and 8, each of the compared tools complements the others incrash detection and has its own advantages.APE. APE can exhaustively exercise parts of the app locallyand also discover more activities than the other tools with a

dynamic GUI model. For app #14, APE detected more crashesthan others because to trigger these crashes it needs to navigatein the file system thoroughly to search for some particular filesand then make some complicated interactions on the files. Thisrequires that the model should be fine enough otherwise thesearching of the file may terminate prematurely. For app #10,APE detected a crash on an activity that was not discoveredby all other tools in five runs.STOAT. STOAT supports the most types of events and can finddeep crashes via model-based fuzzing [4]. We plan to incorpo-rate the powerful fuzzing strategy of STOAT into APE in futurework. Due to the limitation of its implementation, STOAT didnot detect crashes triggered via out-of-order events [4] as theother three tools.Monkey. Monkey spent almost all testing budgets to generateevents, e.g., no restart and no model construction. Hence,Monkey detected some crashes that can only be triggeredunder certain stress. As shown in Fig. 8, only Monkey hasdetected six OutOfMemoryError.SAPIENZ. SAPIENZ inherits from Monkey the same ability totrigger some particular types of crashes (Fig. 8). For example,Monkey and SAPIENZ found many ClassCastExceptionand ConcurrentModificationException, where allof them were triggered via trackball and directional pad events.However, these crashes are not important because trackballsand directional pads are generally not available on modern An-droid phones. Moreover, SAPIENZ aims at minimizing eventsequences at the testing time and thereby missed deep crashesthat can only be detected via long meaningful interactions [4].Actually, we can reduce sequences of interest afterwards [14]without sacrificing the ability to detect deep crashes.

D. Comparative Analysis of Results on Benchmark Apps

1) Model-based vs. Model-free: A model enables APE toeffectively generate events (e.g., avoid clicking non-interactivewidgets) in terms of higher activity coverage with fewer eventsin comparison with other tools. On average, APE, SAPIENZand Monkey generated 21,819, 33,376 and 71,125 events inan hour, respectively. STOAT is built on top of a high-leveltesting framework and thus did not report such events.

A model enables APE to tolerate non-determinism relatedto coordinate changes, while model-free tools such as Mon-key and SAPIENZ rarely can. To reduce non-determinism,SAPIENZ additionally cleared the local app data before re-playing every test script. However, non-determinism causedby data in file systems or from server is still a potential threatto replaying. Besides, SAPIENZ can be easily misguided dueto the lack of connectivity information between GUIs. Forexample, app #8 has a welcome activity that requires usersto customize the app. Since local app data were cleared, thisactivity always appeared during replaying every test script.Events that skip or finish all customizations should always beincluded in every test script, which is difficult to guaranteewithout connectivity information. In contrast, APE can easilyovercome this limitation by traversing the model.

TABLE V: Statistics of models built by APE and STOAT.

APE STOAT

min median max min median maxState (#) 131 347 725 12 87 517

Action (#) 1,277 6,531 16,141 21 820 1,307Transition (#) 836 2,282 3,240 21 834 1,340

2) Dynamic Model vs. Static Model: APE can first tryfine states and actions to explore the apps thoroughly andcoarsen states to avoid unnecessarily repeated exploration ofthe same GUI, and produced finer models than STOAT, asshown in Table V. Note that APE can detach model states tomitigate state explosions caused by certain non-determinismduring systematic exploration (see Section III-D). Our resultsare consistent with previous work [12]. That is, fine modelsare generally better than coarse models (see Table III).

E. Large-Scale Evaluation on Apps from Google Play Store

We additionally ran APE on real Nexus 5 phones to test1,316 apps from the Google Play Store for 30 minutes each.The results are quite encouraging. APE detected 537 uniquecrashes in 42 error types from 281 of 1,316 apps. We reported38 crashes to developers with steps to reproduce the crashes,where 13 have already been fixed, and 5 have been accepted(fixes pending). The results show that APE is an industry-strength practical tool in testing real-world apps.

TABLE VI: Distribution of major crash types.

Error Type #

NullPointerException 226IllegalStateException 58

IllegalArgumentException 47NumberFormatException 45

Native Crash 33RuntimeException 27

ArrayIndexOutOfBoundsException 13IndexOutOfBoundsException 12

Table VI presents the major crash types, i.e., those withmore than 10 crashes each. Since we conducted the evaluationon real Nexus phones, the results excluded trivial or spuriouscrashes that can only be detected on emulators or by null intentfuzzing, and also excluded crashes that can only be detectedon third-party Android builds (e.g., on Samsung phones [30]),due to defects in third-party framework or library code.

V. RELATED WORK

We survey representative related work in this section. Ensur-ing the quality of mobile apps is challenging and involves man-ual work in practice [1]. Many pieces of existing work addressspecific problems of Android apps, such as context [31], [32],concurrent and asynchronous features [33], [34], fragmenta-tion [30], [35], adverse conditions [36]–[38], and energy [39]–[41]. Android apps generally comprise of various foregroundand background components, which require different testingtechniques. Hence, some testing tools focus on backgroundservices [42], where the majority aims at testing the GUI [4],

[7], [10], [12], [15], [17], [43]–[45] and many of them alsohave limited support for both [4], [43].

Many tools attempt to improve back-box tools by ripping theGUI [7], [43], [46]. Moreover, traditional model-based [12],[17], [44], symbolic execution based [10], [47], [48], andsearch based approaches [9] are more guided but face scal-ability issues on large apps [11]. Recently, SAPIENZ, a novelsearch-based approach, has significantly advanced the-state-of-the-art-and-practice [3], [21], [49]. The success of SAPIENZmakes us plan to leverage search-based techniques to improveAPE’s exploration strategy.

Many efforts have been devoted to mitigating the scal-ability issues, e.g., hacking the app or framework to opti-mize performance [13], [36], [50]. For model-based testing,we can also apply a coarse-grained model [13], [15] or aprobabilistic model [4] but a finer model is generally moreeffective [12]. APE applies a novel technique to adaptivelyrefine and coarsen the model, which is inspired by counter-example guided abstraction refinement [51]. Currently, APEuses basic properties (i.e., the number of model actions andstates) to check counterexamples. We plan to incorporate GUImodel checking [52], [53] in future work, e.g., checking coun-terexamples with temporal logic. With fine models, APE mayidentify overly many actions. This problem can be mitigatedby some action summary techniques [21], [54].

VI. CONCLUSION

This paper presents APE, a practical, fully automated model-based tool for effective testing of Android apps. APE hasa general, decision tree-based representation of abstractionfunctions, in order to build a good testing model. It alsohas a novel evolution mechanism to dynamically, continuouslyupdate the model toward a good balance between the modelsize and model precision.

Compared to the-state-of-the-art techniques that are eithermodelless or based on static testing models, APE consistentlyoutperforms them on 15 widely-used benchmark apps in termsof code coverage (14–78% relative improvements) and thenumber of detected bugs (61 unique crashes v.s. 31–44 ones).We have also applied APE to test 1,316 apps from the GooglePlay Store, resulting in 537 crashes. 38 crashes were reportedto developers with steps for reproducing them — 13 have beenfixed and 5 have been confirmed.

All our evaluations demonstrate that APE is effective, prac-tical and promising. Currently, APE has been adopted by ourindustry partner, and integrated into their testing process.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their valuable feed-back. This work was supported by the National Key R&DProgram of China (Grant No. 2018YFB1004805) and theNational Natural Science Foundation of China (Grant No.61690204). Tianxiao Gu, Qirun Zhang, and Zhendong Su weresupported in part by United States NSF Grants 1528133 and1618158, and Google and Mozilla Faculty Research Awards.

REFERENCES

[1] M. E. Joorabchi, A. Mesbah, and P. Kruchten, “Real Challenges inMobile App Development,” in Proceedings of 2013 ACM/IEEE Interna-tional Symposium on Empirical Software Engineering and Measurement(ESEM 2013), 2013, pp. 15–24.

[2] Android, “Android Testing Support Library,” https://developer.android.com/topic/libraries/testing-support-library/index.html, accessed: 2018-04-01, 2018.

[3] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective AutomatedTesting for Android Applications,” in Proceedings of the 25th Interna-tional Symposium on Software Testing and Analysis (ISSTA 2016), 2016,pp. 94–105.

[4] T. Su, G. Meng, Y. Chen, K. Wu, W. Yang, Y. Yao, G. Pu, Y. Liu,and Z. Su, “Guided, Stochastic Model-Based GUI Testing of AndroidApps,” in Proceedings of the 2017 11th Joint Meeting on Foundationsof Software Engineering (ESEC/FSE 2017), 2017, pp. 245–256.

[5] Android, “UI/Application Exerciser Monkey,” https://developer.android.com/studio/test/monkey.html, accessed: 2018-04-01, 2018.

[6] “UI Events,” https://developer.android.com/guide/topics/ui/ui-events, ac-cessed: 2018-04-01.

[7] X. Zeng, D. Li, W. Zheng, F. Xia, Y. Deng, W. Lam, W. Yang, and T. Xie,“Automated Test Input Generation for Android: Are We Really There Yetin an Industrial Case?” in Proceedings of the 2016 24th ACM SIGSOFTInternational Symposium on Foundations of Software Engineering (FSE2016), 2016, pp. 987–992.

[8] L. Clapp, O. Bastani, S. Anand, and A. Aiken, “Minimizing GUI EventTraces,” in Proceedings of the 2016 24th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering (FSE 2016), 2016,pp. 422–434.

[9] R. Mahmood, N. Mirzaei, and S. Malek, “Evodroid: Segmented Evo-lutionary Testing of Android Apps,” in Proceedings of the 22nd ACMSIGSOFT International Symposium on Foundations of Software Engi-neering (FSE 2014), 2014, pp. 599–609.

[10] S. Anand, M. Naik, M. J. Harrold, and H. Yang, “Automated ConcolicTesting of Smartphone Apps,” in Proceedings of the ACM SIGSOFT 20thInternational Symposium on the Foundations of Software Engineering(FSE 2012), 2012, pp. 59:1–59:11.

[11] S. R. Choudhary, A. Gorla, and A. Orso, “Automated Test InputGeneration for Android: Are We There Yet?(e),” in Proceedings ofthe 30th IEEE/ACM International Conference on Automated SoftwareEngineering (ASE 2015), 2015, pp. 429–440.

[12] Y.-M. Baek and D.-H. Bae, “Automated Model-based Android GUITesting Using Multi-level GUI Comparison Criteria,” in Proceedingsof the 31st IEEE/ACM International Conference on Automated SoftwareEngineering (ASE 2016), 2016, pp. 238–249.

[13] T. Gu, C. Cao, T. Liu, C. Sun, J. Deng, X. Ma, and J. Lu, “AimDroid:Activity-Insulated Multi-level Automated Testing for Android Applica-tions,” in Proceedings of the 2017 IEEE International Conference onSoftware Maintenance and Evolution (ICSME 2017), 2017, pp. 103–114.

[14] W. Choi, K. Sen, G. Necula, and W. Wang, “DetReduce: MinimizingAndroid GUI Test Suites for Regression Testing,” in Proceedings of the40th International Conference on Software Engineering (ICSE 2018),2018, pp. 445–455.

[15] W. Choi, G. Necula, and K. Sen, “Guided GUI Testing of Android Appswith Minimal Restart and Approximate Learning,” in Proceedings ofthe 2013 ACM SIGPLAN International Conference on Object OrientedProgramming Systems Languages and Applications (OOPSLA 2013),2013, pp. 623–640.

[16] Android, “Activity,” https://developer.android.com/reference/android/app/Activity.html, accessed: 2018-04-01, 2018.

[17] W. Yang, M. R. Prasad, and T. Xie, “A Grey-box Approach for Auto-mated GUI-model Generation of Mobile Applications,” in InternationalConference on Fundamental Approaches to Software Engineering (FASE2013). Springer, 2013, pp. 250–265.

[18] “UI Automator Viewer,” https://developer.android.com/topic/libraries/testing-support-library/index.html#uia-viewer, accessed: 2018-04-01.

[19] Android, “Android Dashboard,” https://developer.android.com/about/dashboards/index.html, accessed: 2018-04-01, 2018.

[20] “Accessibility,” https://developer.android.com/guide/topics/ui/accessibility/, accessed: 2018-04-01.

[21] K. Mao, M. Harman, and Y. Jia, “Crowd Intelligence Enhances Auto-mated Mobile Testing,” in Proceedings of the 2017 32nd IEEE/ACMInternational Conference on Automated Software Engineering (ASE2017), 2017, pp. 16–26.

[22] “Android Crashes,” https://developer.android.com/topic/performance/vitals/crash, accessed: 2018-04-01.

[23] “Window Leak,” https://stackoverflow.com/questions/2850573/activity-has-leaked-window-that-was-originally-added, accessed:2018-04-01.

[24] R. Sasnauskas and J. Regehr, “Intent Fuzzer: Crafting Intents of Death,”in Proceedings of the 2014 Joint International Workshop on DynamicAnalysis (WODA) and Software and System Performance Testing, De-bugging, and Analytics (PERTEA), 2014, pp. 1–5.

[25] “Export Activity,” https://developer.android.com/guide/topics/manifest/activity-element#exported, accessed: 2018-04-01.

[26] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, G. Pu, and Z. Su,“Large-scale Analysis of Framework-specific Exceptions in AndroidApps,” in Proceedings of the 40th International Conference on SoftwareEngineering (ICSE 2018), 2018, pp. 408–419.

[27] “Android Native Crashes,” https://source.android.com/devices/tech/debug/native-crash, accessed: 2018-04-01.

[28] “Android WebView,” https://developer.android.com/guide/webapps/webview, accessed: 2018-04-01.

[29] “ART and Dalvik,” https://source.android.com/devices/tech/dalvik/, ac-cessed: 2018-04-01.

[30] L. Wei, Y. Liu, and S. Cheung, “Taming Android Fragmentation:Characterizing and Detecting Compatibility Issues for Android Apps,”in Proceedings of the 31st IEEE/ACM International Conference onAutomated Software Engineering (ASE 2016), 2016, pp. 226–237.

[31] C.-J. M. Liang, N. D. Lane, N. Brouwers, L. Zhang, B. F. Karlsson,H. Liu, Y. Liu, J. Tang, X. Shan, R. Chandra, and F. Zhao, “Caiipa: Au-tomated Large-scale Mobile App Testing Through Contextual Fuzzing,”in Proceedings of the 20th Annual International Conference on MobileComputing and Networking (MobiCom 2014), 2014, pp. 519–530.

[32] K. Moran, M. Linares-Vasquez, C. Bernal-Cardenas, C. Vendome, andD. Poshyvanyk, “CrashScope: A Practical Tool for Automated Testingof Android Applications,” in Proceedings of the 39th InternationalConference on Software Engineering Companion (ICSE-C 2017), 2017,pp. 15–18.

[33] L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, and G. Pu, “EfficientlyManifesting Asynchronous Programming Errors in Android Apps,”in Proceedings of the 33rd ACM/IEEE International Conference onAutomated Software Engineering (ASE 2018), 2018, pp. 486–497.

[34] J. Wang, Y. Jiang, C. Xu, Q. Li, T. Gu, J. Ma, X. Ma, and J. Lu,“AATT+: Effectively Manifesting Concurrency Bugs in Android Apps,”Science of Computer Programming, vol. 163, pp. 1 – 18, 2018.

[35] H. Khalid, M. Nagappan, E. Shihab, and A. E. Hassan, “Prioritizing theDevices to Test Your App on: A Case Study of Android Game Apps,”in Proceedings of the 22nd ACM SIGSOFT International Symposium onFoundations of Software Engineering (FSE 2014), 2014, pp. 610–620.

[36] G. Hu, X. Yuan, Y. Tang, and J. Yang, “Efficiently, Effectively DetectingMobile App Bugs with AppDoctor,” in Proceedings of the NinthEuropean Conference on Computer Systems (EuroSys 2014), 2014, pp.18:1–18:15.

[37] C. Q. Adamsen, G. Mezzetti, and A. Møller, “Systematic Executionof Android Test Suites in Adverse Conditions,” in Proceedings of the2015 International Symposium on Software Testing and Analysis (ISSTA2015), 2015, pp. 83–93.

[38] Z. Shan, T. Azim, and I. Neamtiu, “Finding Resume and Restart Errorsin Android Applications,” in Proceedings of the 2016 ACM SIGPLANInternational Conference on Object-Oriented Programming, Systems,Languages, and Applications (OOPSLA 2016), 2016, pp. 864–880.

[39] Y. Liu, C. Xu, S. Cheung, and J. Lu, “GreenDroid: Automated Diagnosisof Energy Inefficiency for Smartphone Applications,” IEEE Transactionson Software Engineering, vol. 40, no. 9, pp. 911–940, Sept 2014.

[40] D. Li, Y. Lyu, J. Gui, and W. G. J. Halfond, “Automated Energy Op-timization of HTTP Requests for Mobile Applications,” in Proceedingsof the 38th International Conference on Software Engineering (ICSE2016), 2016, pp. 249–260.

[41] M. Linares-Vasquez, G. Bavota, C. E. B. Cardenas, R. Oliveto,M. Di Penta, and D. Poshyvanyk, “Optimizing Energy Consumption ofGUIs in Android Apps: A Multi-objective Approach,” in Proceedingsof the 2015 10th Joint Meeting on Foundations of Software Engineering(ESEC/FSE 2015), 2015, pp. 143–154.

https://developer.android.com/topic/libraries/testing-support-library/index.html

https://developer.android.com/topic/libraries/testing-support-library/index.html

https://developer.android.com/studio/test/monkey.html

https://developer.android.com/studio/test/monkey.html

https://developer.android.com/guide/topics/ui/ui-events

https://developer.android.com/reference/android/app/Activity.html

https://developer.android.com/reference/android/app/Activity.html

https://developer.android.com/topic/libraries/testing-support-library/index.html#uia-viewer

https://developer.android.com/topic/libraries/testing-support-library/index.html#uia-viewer

https://developer.android.com/about/dashboards/index.html

https://developer.android.com/about/dashboards/index.html

https://developer.android.com/guide/topics/ui/accessibility/

https://developer.android.com/guide/topics/ui/accessibility/

https://developer.android.com/topic/performance/vitals/crash

https://developer.android.com/topic/performance/vitals/crash

https://stackoverflow.com/questions/2850573/activity-has-leaked-window-that-was-originally-added

https://stackoverflow.com/questions/2850573/activity-has-leaked-window-that-was-originally-added

https://developer.android.com/guide/topics/manifest/activity-element#exported

https://developer.android.com/guide/topics/manifest/activity-element#exported

https://source.android.com/devices/tech/debug/native-crash

https://source.android.com/devices/tech/debug/native-crash

https://developer.android.com/guide/webapps/webview

https://developer.android.com/guide/webapps/webview

https://source.android.com/devices/tech/dalvik/

[42] L. L. Zhang, C.-J. M. Liang, Y. Liu, and E. Chen, “SystematicallyTesting Background Services of Mobile Apps,” in Proceedings of the2017 32nd IEEE/ACM International Conference on Automated SoftwareEngineering (ASE 2017), 2017, pp. 4–15.

[43] A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An Input GenerationSystem for Android Apps,” in Proceedings of the 2013 9th Joint Meetingon Foundations of Software Engineering (ESEC/FSE 2013), 2013, pp.224–234.

[44] T. Azim and I. Neamtiu, “Targeted and Depth-First Exploration forSystematic Testing of Android Apps,” in Proceedings of the 2013 ACMSIGPLAN International Conference on Object Oriented ProgrammingSystems Languages and Applications (OOPSLA 2013), 2013, pp. 641–660.

[45] A. Sadeghi, R. Jabbarvand, and S. Malek, “PATDroid: Permission-awareGUI Testing of Android,” in Proceedings of the 2017 11th Joint Meetingon Foundations of Software Engineering (ESEC/FSE 2017), 2017, pp.220–232.

[46] D. Amalfitano, A. R. Fasolino, P. Tramontana, S. De Carmine, andA. M. Memon, “Using GUI Ripping for Automated Testing of AndroidApplications,” in Proceedings of the 27th IEEE/ACM InternationalConference on Automated Software Engineering (ASE 2012), 2012, pp.258–261.

[47] C. S. Jensen, M. R. Prasad, and A. Møller, “Automated Testing withTargeted Event Sequence Generation,” in Proceedings of the 2013 In-ternational Symposium on Software Testing and Analysis (ISSTA 2013),2013, pp. 67–77.

[48] H. van der Merwe, B. van der Merwe, and W. Visser, “Execution and

Property Specifications for JPF-android,” SIGSOFT Software Engineer-ing Notes, vol. 39, no. 1, pp. 1–5, Feb. 2014.

[49] N. Alshahwan, X. Gao, M. Harman, Y. Jia, K. Mao, A. Mols, T. Tei, andI. Zorin, “Deploying Search Based Software Engineering with Sapienzat Facebook,” in International Symposium on Search Based SoftwareEngineering (SSBSE 2018), 2018, pp. 3–45.

[50] W. Song, X. Qian, and J. Huang, “EHBDroid: Beyond GUI Testing forAndroid Applications,” in Proceedings of the 2017 32nd IEEE/ACMInternational Conference on Automated Software Engineering (ASE2017), 2017, pp. 27–37.

[51] E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith, “Counterexample-Guided Abstraction Refinement,” in Proceedings of the 12th Interna-tional Conference Computer Aided Verification (CAV 2000), 2000, pp.154–169.

[52] M. L. Bolton, E. J. Bass, and R. I. Siminiceanu, “Using FormalVerification to Evaluate Human-Automation Interaction: A Review,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 43,no. 3, pp. 488–503, May 2013.

[53] M. B. Dwyer, V. Carr, and L. Hines, “Model Checking Graphical UserInterfaces Using Abstractions,” in Proceedings of the 6th EuropeanSoftware Engineering Conference Held Jointly with the 5th ACM SIG-SOFT International Symposium on Foundations of Software Engineering(ESEC/FSE 1997), 1997, pp. 244–261.

[54] M. Ermuth and M. Pradel, “Monkey See, Monkey Do: Effective Gener-ation of GUI Tests with Inferred Macro Events,” in Proceedings of the25th International Symposium on Software Testing and Analysis (ISSTA2016), 2016, pp. 82–93.

Practical GUI Testing of Android Applications via … › papers › icse19.pdfPractical GUI Testing...

Documents

Transcript of Practical GUI Testing of Android Applications via … › papers › icse19.pdfPractical GUI Testing...