SYSTEM-LEVEL AUTOMATED TESTING FOR HOME DIGITAL …1473113/... · 2020. 10. 5. · Ismail Tlemcani...

Malardalen UniversitySchool of Innovation Design and Engineering

Vasteras, Sweden

Thesis for the Degree of Master of Science (60 credits) in ComputerScience - Software Engineering 15.0 credits

SYSTEM-LEVEL AUTOMATEDTESTING FOR HOME DIGITAL

VOICE ASSISTANTS

Ismail [email protected]

Examiner: Kristina LundqvistMalardalen University, Vasteras, Sweden

Supervisors: Eduard Paul EnoiuMalardalen University, Vasteras, Sweden

September 2, 2020

Ismail Tlemcani System level testing for HDVA Devices

Acknowledgments

At the beginning of this report, I would like to deeply thanks my supervisor, Mr Eduard PaulEnoiu, for all his guidance and supervision during all the thesis time. His recommandations,encouragements, advices and guidances were very precious for me in this work and helped me a lotto avoid some mistakes that I might have fallen into. I am very thankful to his support throughour meetings. Secondly, I would like to thank all the professors of my home university in Morocco“Institut National des Postes et Telecommunications” for all the knowledge and skills I gainedduring the years I passed there. Finally, I would like to thank the people who keep supporting mein every conditions. Those people are my parents and my brother to whom I will be grateful forlife.

i


Abstract

Home Digital Voice Assistants (HDVA) are devices that are performing tasks based on voicecommands. A normal user can use these devices to perform daily tasks like sending an email, playa song or check for an event online, just to name a few. These systems got very popular in recentyears due to their ease of use and the evolution of their technology that is now handling manycommands and is able to perform complex tasks. HDVA devices are also nowadays used in somecritical cases like for door opening and in some healthcare services.

On the other hand, software testing is an important verification and validation activity used toreveal software faults in systems that include a software part. This activity is used to make surethat the expected behavior of the system matches the actual software execution. This activity resultsin the creation of test cases that are run as scripts in an automatic way.

Because of the fact that HDVA devices are used nowadays in some critical use cases, it is ofutmost importance that these devices are thoroughly tested to make sure that they are behaving inthe correct way. In this thesis, we first investigated the current automation testing frameworksfor HDVA devices that exist in the market by doing a multivocal literature review. This is animportant step to do in order to discover what are the existing frameworks in the market andtherefore decide on the most appropriate research that can be carried out on these. We have, afterdoing the multivocal literature review and listing the available automation testing tools for HDVAdevices, evaluated one tool selected from this review and assessed its usefulness and applicability forprofessionals and researchers in terms of ease of use and resources it uses during test execution.During the evaluation, we focused on automation testing tools for the Amazon Echo device becauseof its popularity on the market and the great amount of resources that are available online on thisdevice and we focused on system testing. After doing the multivocal literature review, we found thatthe Botium framework is the only framework available to use to test the Amazon Echo device on asystem level. We took the Botium framework as the framework to be evaluated and we performedan evaluation on it from a test automation capability perspective. The evaluation was done on avirtual machine which was set up locally with the VMware software. The evaluation showed a slowtest execution capability of the Botium tool. More studies are needed on testing the other popularHDVA devices and on the lower testing levels.

ii


Table of Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 42.1 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Levels of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Testing techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 System testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Test Automation and Manual Testing . . . . . . . . . . . . . . . . . . . . . 6

2.2 Home Digital Voice Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Introduction to HDVA Devices . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Differences between Chatbots and HDVA Devices . . . . . . . . . . . . . . . 72.2.3 Functional Features of Home Digital Voice Assistants . . . . . . . . . . . . 7

2.3 Alexa Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 Alexa Voice Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 The Echo Device: How Alexa Skills Work . . . . . . . . . . . . . . . . . . . 82.3.3 Faults and Vulnerabilities in Home Digital Voice Assistants . . . . . . . . . 9

2.4 Multivocal Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Grey Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Multivocal Literature Reviews . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Research methodology 123.1 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Multivocal Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Detailed description about the MLR steps . . . . . . . . . . . . . . . . . . . 133.3 Case Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Test Automation Tool Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Results 194.1 Multivocal Literature review Results . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Planning the MLR: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Conducting the MLR: Results . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.3 Review of the Test Automation Frameworks found from the MLR . . . . . 22

4.2 Automation Testing Framework Evaluation . . . . . . . . . . . . . . . . . . . . . . 234.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.2 Architecture of the Botium Stack . . . . . . . . . . . . . . . . . . . . . . . . 234.2.3 How is a Chatbot tested with Botium? . . . . . . . . . . . . . . . . . . . . . 244.2.4 Overall Comparison of the Botium Framework with Bespoken for System

Testing on Alexa Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2.5 Botium Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.6 Botium Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.7 Test Runner Report Example . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.8 System Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.9 Experiment: Botium Package . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.10 Alexa Skills Under Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.11 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.12 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions 31

iii


References 34

6 Appendix A : MLR Results per Database 35

iv

List of Figures

1 An example of a V-Model process used in software and system development. . . . 42 Overall flow of Black-Box Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Overall flow of White-box Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Generic workflow of testing requests for Alexa devices . . . . . . . . . . . . . . . . 85 The overall steps of the MLR process used in this thesis . . . . . . . . . . . . . . . 126 Source selection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Architecture of the Botium Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 How is a chatbot tested using Botium . . . . . . . . . . . . . . . . . . . . . . . . . 249 Botium scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2510 Test runner report for a passing test suite . . . . . . . . . . . . . . . . . . . . . . . 2611 Test runner report for a failing test suite . . . . . . . . . . . . . . . . . . . . . . . . 2712 Characteristics of the Windows Machine during Evaluation . . . . . . . . . . . . . 2713 Characteristics of the Ubuntu virtual machine used in the experiment . . . . . . . 28

List of Tables

1 Types of Alexa Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Questions to clarify the motivations to do the MLR process . . . . . . . . . . . . . 133 Search strings used in the MLR process . . . . . . . . . . . . . . . . . . . . . . . . 144 Inclusion Criteria used in the MLR Process . . . . . . . . . . . . . . . . . . . . . . 155 Exclusion Criteria used in the MLR Process . . . . . . . . . . . . . . . . . . . . . . 166 Questions and answers to clarify the motivations to do the MLR process . . . . . . 197 Overall Results of the MLR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Comparison of Bespoken and Botium . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Evaluation results for the United states quiz skill . . . . . . . . . . . . . . . . . . . 2910 Evaluation results for the High low game skill . . . . . . . . . . . . . . . . . . . . . 3011 MLR Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


1 Introduction

Software testing is an activity in software engineering that checks if a given software behaves as itshould [1]. This activity is time consuming and usually expensive, and it is usually done with thehelp of automation tools that are helping the tester in all the software testing process [2]. If thebehavior of the software matches the expected behavior, we say that the test cases are passing,otherwise we say that the test cases are failing. The automation testing frameworks are the toolsthat execute the test cases and are generally capable of producing reports that shows the resultsof testing, including how many test cases have passed and how many have failed.

Software testing consists of the activities of validation and verification [3]. The validationprocess is mainly about checking whether the software captures the client requirements and involvesexecuting the code to check that its output matches the expected output [3]. On the other hand,verification is the process of checking that the system was built in the right way [4]. The verificationprocess is also about checking the non-functional aspects of the software and does not involveexecuting the code.

Software testing can also be categorized in two classes: black-box testing and white-box testing[3]. Black-box testing is about deriving the test cases from the specifications [3]. In black-box test-ing, the tester does not know the internal working of the system under test [5]. The system undertest is defined as a black-box and the test cases are derived only from the system specifications.Meanwhile, white-box testing is the testing of the internal structure of the software system [6]. Inwhite-box testing, the tester must have access to the source code [5]. In white-box testing, thesystem under test is executed under certain inputs and the objective is to try to cover the code indifferent ways.

As a special case of hardware-software systems, Home Digital Voice Assistants (HDVA) arephysical devices that act as agents that can perform tasks based on voice commands. Normalusers of these devices can use them to perform different daily tasks: ask their assistant a question,register an event on their calendar, send an email, play a song or check for an event online, pay aflight ticket, and so on. HDVA devices are nowadays able to interpret human speech and respondvia synthesized voices with great precision. These systems got very popular in recent years due totheir ease of use and the evolution of their technology that is now handling many commands andis able to perform many tasks. Nowadays, HDVA devices are also used in some critical uses likefor door opening and in some healthcare services [7]. In 2019 some authorized partners alreadystarted building skills that send and receive sensitive health information [7]. For these reasons,it is of utmost importance that these devices are thoroughly tested to make sure that they arebehaving in the correct way.

In this context, this thesis is investigating the existing automation testing frameworks for HDVAdevices by doing a multivocal literature review and performing an exploratory empirical evaluation.Using the results of the literature review, we evaluated one automation testing framework for HDVAdevices in terms of its applicability and usability for researchers and professionals working in thisfield. We used the Amazon Echo device as a case study in our research because of its popularity[8] and the number of available online resources available on the usage of this device.

1.1 Motivation

From the early days when computers were invented, understanding the natural language of humanswas always a dream and this motivated many companies to be interested into developing thetechnologies behind voice assistants. Nowadays, home digital voice assistants are getting morepopular because of their ease of use and the advancement made in their technology that is nowable to handle many tasks.

On the other hand, there is a lack of rigorous research performed in testing these devices andmaking sure they are not behaving in a way that could cause any failures (e.g., harm to its user).As an example, by searching on the Google Scholar database with the sentence “automation testingframework for hdva device”, we obtain only four results. The results of the multivocal literaturereview performed in this thesis will also show evidence supporting this claim. This motivated usto bring more knowledge on automation testing for voice assistants.

In addition to that, because of the introduction of these devices in some critical use cases [7],

1


it is of utmost importance to carefully test the behaviors of these devices in order to make surethat they are not causing any problem to their users.

1.2 Problem formulation

The control and intelligence of the Amazon Echo device is based on the software part inside thedevice that is responsible for its functions. The functions of the Echo devices are called Alexaskills. These are the functions that can, for example, answer user’s questions or do some specifictask asked by the user.

There are many approaches for developing an Alexa skill. However, few approaches have beenproposed for automated testing of the skills developed by Amazon or the Alexa developers involvedin its ecosystem. For this purpose, a review of the existing test automation frameworks for HDVAdevices is performed. In addition to that, an evaluation of a testing framework for HDVA deviceswas done in the context of this thesis. The thesis surveyed the existing body of knowledge on testautomation frameworks for HDVAs and particularly for Amazon Echo, the most popular HDVAon the market [8] and evaluated one automation testing framework for the Echo device in termsof applicability and usability for researchers and professionals working in this field.The research questions answered in this thesis are:

• What are the existing available automation testing frameworks for HDVA devices ?

• How can an automation testing framework for HDVA devices be evaluated in terms of appli-cability and usability for professionals and researchers?

The objective of this thesis is to map the body of knowledge on automated test execution forHDVA devices and evaluate an existing automatic test execution framework for HDVA systems toidentify the empirical evidence for, or against, the use of it in practice.

1.3 Objectives

The first objective of this thesis is to map the area of test automation frameworks for HDVAdevices by doing a multivocal literature review on the existing tools for automation testing forHDVA devices. The second objective of this thesis is to evaluate an automation testing tool takenfrom the results of the review that can perform system testing for HDVA devices. The metrics usedto evaluate the framework were inspired by the measures of usability as defined by ISO 9241-11:efficiency, effectiveness, and subjective satisfaction. The metrics we have chosen are:

• Time to execute the test cases. This proxy measure is used to measure the time needed forthe tool to execute a given test set. The less time an automation testing framework takes toexecute a test suite, the better is its test execution efficiency in terms of performance andusability.

• The number of test cases that can be executed in parallel. This measure relates to themaximum number of test cases that can be executed at the same time using the automationframework. This measure is an indication about the size of the test suites that can beexecuted with the tool and the project under test for which the test automation frameworkcan be used for.

• User Experience. We also took into consideration as an aspect to be evaluated the userexperience using the test automation framework. We encompassed in this aspect the learningtime spent to understand and master its basic and advanced features, as well as the timeneeded to configure, install and use all the features of the tool.

• Fault detection capability by manually injecting faults in the test suite. This proxy metricrelates to how the system behaves regarding non usual inputs or when stressed in unusualways. The fault injection process used was a random injection of faults at different places inthe expected outcomes of the test suite.

2


1.4 Research Methodology

For the purpose of this thesis and to answer the research questions, a multivocal literature reviewwas performed to map the existing testing frameworks for HDVA devices. The second step of thework is an evaluation of one automation testing framework for the Echo device. The frameworkchosen was evaluated in terms of applicability that would allow a tester to create scripts containing adictionary of test inputs and expected outcomes for validation. We were also interested in usabilitywhich refers to the quality of the experience that the tester will have when using the software aswell as the fault detection capability of the tool evaluated.

For the purpose of this study, we used the Amazon Echo as the HDVA-under test. We decidedto choose this device because of its popularity across HDVA devices in the market. Amazon Echois in fact the most popular smart speaker in the market by a large margin [8].

The multivocal literature review had the objective of gathering the current knowledge aboutHDVA devices and their current state in the market, and then searching for the automation testingtools for HDVA devices. We followed in this work the guidelines specified by Garousi et al [9].

Overall, the steps we followed are as follows:

• Planning the research, setting the goals and the motivations of the research and setting upthe research questions.

• Conducting the research which consists of setting the search keywords, setting the inclusionand exclusion criteria, conducting the search, gathering the first pool of results, applying theinclusion/exclusion criteria and gathering the final pool of results that will be analysed.

• The last step is the analysis of the data we obtained from the final pool of results and theextraction of the results needed to answer our research questions.

To evaluate the chosen testing framework, we have identified a set of metrics inspired by amethodological framework written by a group of researchers that have defined several metricsnecessary to evaluate an automation testing tool [10]. The chosen metrics are detailed in theObjectives section. In practice, we have used the testing tool to be evaluated on a local machine,connected the tool to the Alexa service, selected random skills from the alexa repository that couldbe tested with this framework, and wrote and executed the test cases needed and reported theresults.

1.5 Contributions

Firstly, this thesis focused on a multivocal literature review of the existing automation frameworksfor HDVA devices in order to map the current knowledge on these tools.

Secondly, this thesis focused on an evaluation of an automation testing framework for theAmazon Echo device. The evaluation had the objective of testing the tool following the differentcriteria presented in the Objectives section.

We expect these contributions to be useful for professionals and researchers working in softwaretesting since the current knowledge around testing and test automation frameworks for HDVAdevices is very limited. We argue that the results of the review are also useful for the companiesdeveloping skills for the Echo device and possibly for the team behind the framework evaluated inthis thesis.

3


2 Background and Related Work

In this section, we explain the concepts that are used in this thesis, from software testing to HDVAdevices as well as the multivocal literature review guidelines used in this thesis work.

2.1 Software Testing

Software testing is an essential activity in the software engineering pipeline and it has the purposeof finding potential defects or miss behaviors in the system under test in order to correct these bydebugging. According to the IEEE standard [11], software testing is “the process of exercising orevaluating a system or system component by manual or automated means to verify that it satisfiesspecified requirements or to identify differences between expected and actual results”.

This activity is usually done with the help of a software that executes and compares the actualbehavior with the expected behaviour. Because requirements involved on real software projectsare complex, they are not easy to test manually and therefore test automation is helpful to makesure that in each development step, the software is capturing the expected needs.

2.1.1 Levels of Testing

Software testing is an important step in the software engineering processes and cannot be removedor replaced in practice by other verification techniques. Generally, as shown in the V model inFigure 1, the levels of testing can be classified as follows: unit testing, integration testing, systemtesting, and acceptance testing.

Figure 1: An example of a V-Model process used in software and system development.

These levels are generic and can be described as follows:

1. Unit testing: It is the level of testing where the smallest units of the software are tested. Inunit testing, the tester checks the functions and components of the software to make surethat they are producing the expected result in isolation.

2. Integration testing: It is the level of testing where individual components of the software areintegrated and then tested as a whole so to make sure that their interaction is not causingany subsequent defect.

4


3. System testing: This level of testing is about testing the product as a whole and not justunits in isolation. The objective of system testing is to test the overall functionality of thesoftware from an end to end perspective.

4. Acceptance testing: It is generally used in the later stages of software development to testthe global product together with the client. It usually entails testing the acceptance of thebusiness requirements from the client.

In this thesis we are focusing on system level testing of HDVA devices, but some of the othertesting levels are touched upon during the literature review and experimental evaluation.

2.1.2 Testing techniques

Testing as an activity can be divided into at least two categories as follows: (i) checking that thespecifications are correctly produced by the system under test or (ii) checking the correct behaviorof the internal structure of the system. The testing activity that only checks the conformancebetween the implementation and specifications of the software is called black-box testing. Thetesting activity that checks the internal structure of the software can be classified as white-box orgrey-box testing.

• Black-box Testing: It is a testing technique in which the software under test internallogical structure is not known or used by the tester. [5]. The overall flow of black-box testingis represented in Figure 2.

Figure 2: Overall flow of Black-Box Testing

Black-box testing can be categorized in at least two classes: functional and non-functionaltesting. Functional testing is about testing the functional features that the software is sup-posed to have. In other terms, functional testing is focusing only on the functional inputsand outputs of the system without taking into account the internal structure of the system.On the other hand, non-functional testing is not directly related to the specific features thatthe software is supposed to have, but is rather focusing on the other aspects that the systemshould have such as safety, security, performance and usability.

• White-box testing: It is a testing technique for testing the internal structure of an appli-cation, as opposed to testing only its functionality in black-box testing. In white-box testing,the internal logic of the application is investigated [5]. The term white-box is used sincein this testing technique, the internal structure of the software is known to the tester andtherefore the tester can check what inputs are needed to cover the code from different per-spectives. The overall flow of white-box testing is shown in Figure 3. When performing thiskind of testing, the application code is used to guide the creation of test scenarios containingspecific test case inputs as well as certain test case outputs.

2.1.3 System testing

In software testing, according to the official ISTQB glossary [12], system testing is the level oftesting at which the system as a whole is meeting its system specifications. At this level, thesystem is verified in terms of meeting both its functional and non-functional requirements specifiedin the requirements specification document. This level of testing is performed after integrationtesting and therefore takes as input all the integrated components that have passed the integrationtesting phase.

5


Figure 3: Overall flow of White-box Testing

Why is system testing important? The current software market is putting pressure on bothdevelopers and testers as well as on software organizations as a whole to develop higher qualitysoftware. It is of utmost importance for testers to follow certain software testing processes in theirprojects to make sure that their product is meeting their client needs in all its aspects.

System testing is important because it is the first level of testing where the software systemis globally tested. In fact, specific functions can work separately but their integration might notwork as expected. In addition, system testing is important because of the fact that it is the firstlevel where the functional as well as the non-functional requirements are tested.

2.1.4 Test Automation and Manual Testing

The general goal of software testing can be defined as follows: to find bugs and errors in a softwareas early as possible in the development process [13]. It is an expensive activity and “can requireup to 50 % of software development costs (and even more for safety critical applications)” [3]. Oneof the goals of this activity is to reduce the costs and human errors as much as possible [3].

Test automation is the activity that has the objective of automating the testing process withthe use of effective automation tools [13]. This activity improves the quality of the softwaretesting process and minimizes human intervention. The tools used in test automation can eitherbe commercial tools or open source tools.

On the other hand, manual testing (also known as exploratory testing) uses the experience ofthe tester to find errors in the parts of the system that are more prone to contain them [14]. Inmanual testing, the tester manually creates and executes the test steps on the software with theobjective of finding errors.

Test automation and manual testing are complementary [14]. Manual testing is effective forcapturing the special cases that might contain errors, but it cannot give an extensive coverage ofall the possibilities because of testing costs associated with executing a large number of test cases.Moreover, automated test cases are good at providing a good coverage on all the possible testcases, but these might not be as good as the ones created by manual testing to capture certainspecial cases. In this thesis we focus on test automation tools used by testers since this practicehas been shown to be of benefit for testing highly complex software systems.

2.2 Home Digital Voice Assistants

Home digital voice assistants are devices equipped with a software having the ability to interprethuman speech and respond via synthesized voices [15]. A normal user can use these devices tosearch for any information one might be looking for in a regular search engine. A user can alsoprogram these devices to perform a certain task like sending an email or sending a message toa contact at a specific time. Another use of home digital voice assistants can be related to theinteraction with the physical environment (e.g., door opening or turning on the home TV or linkingthe software of the device with a real physical object).

Apple’s Siri, Amazon’s Echo, Microsoft’s Cortana, and Google’s Assistant are the most popularvoice assistants nowadays and they can be used in smartphones or in smart home speakers.

6


2.2.1 Introduction to HDVA Devices

Intelligent home digital voice assistants have opened to normal users a new world of communicationand interaction. Creating devices that can communicate with humans in their natural language hasalways been a great challenge for computer science professionals. In fact, from the time computersappeared on the market, humans were able to communicate with these machines only by typingtext. It was just until the recent years that the possibility of communicating to a machine byvoice has gained popularity. In addition to this, the advancement in voice recognition technology,as well as the advancement in different areas of machine learning and artificial intelligence havemade these machines capable of learning new tasks by having access to a large number of examples(similarly to how humans learn). This has made these intelligent devices more powerful and ableto respond to many requests.

2.2.2 Differences between Chatbots and HDVA Devices

Home digital voice assistants evolved from Chatbots, which are software systems designed to answerhumans requests typically using text. HDVA devices extended these Chatbots and can be seen ascyber-physical devices. The following differences between Chatbots and voice assistants are [16]are useful from a testing point of view:

• Chatbots are typically text-based applications and are usually able to respond only to analready existing set of questions while voice assistant usually have more extended features.They cannot only understand what the user is saying but also understand the meaning ofwhat the user is saying with the help of machine learning algorithms implemented on thosedevices.

• Chatbots are designed to respond to limited use cases and usually do not have sophisticatedalgorithms implemented in their controller. Virtual assistants have a wider scope and arenowadays able to perform complicated tasks like choosing the best product to buy accordingto a defined set of criteria or controlling electronic devices in a user’s room.

2.2.3 Functional Features of Home Digital Voice Assistants

Even though each voice assistant is different, they share some basic functions and are generallyable to perform the following tasks [15]:

• send and read text messages, make phone calls, send and read email messages.

• answer basic queries like giving the current time or giving information about the currentweather for a certain location.

• create a list or perform basic math calculations.

In addition to these features, the software in these voice assistants is known as a skill (i.e., inthe Alexa ecosystem), which is responsible for performing the specific tasks requested by the users.Skills are generally available in the online store of the manufacturer and users can enable theseskills for their device. As an example, a skill can be an entertainment quiz asking the user questionsabout the geography and history of United States of America or Sweden. Another example wouldbe a skill that works as a game. In the case of the Google Home device, the software-related partof the device is named an action.

The skills or actions responsible for the device logic are usually built by third party developers,in a similar way to how web applications are developed. The company behind the home digitalvoice assistant device usually provides a platform for developing the skill from start to end. Thisis the case for the Amazon Echo device and the Google Home device and many other majormanufacturers.

The features of the voice assistants are available on smartphones and on physical smart hardwaredevices. In the case of Google Home, its features are integrated into android devices and couldbe installed as a separate application on a phone (i.e., iPhone). In the case of the Amazon Echodevice and Microsoft Cortana device, their features are integrated into separate Android apps andIphone apps.

7


2.3 Alexa Devices

As examples of HDVA systems, there are three types of Alexa devices: Amazon Echo, AmazonTap and Echo Dot [17]. All Amazon Alexa devices work with a cloud-based voice service which isanalyzing the voice commands. The voice service decodes the voice command and translates it intoa text. These text commands are then processed by an Alexa Skill which responds accordingly.

2.3.1 Alexa Voice Service

Alexa voice service is part of the Amazon services that are built around the Alexa devices tosupport voice commands. Alexa voice service was first introduced with the Echo device. Tointeract with a smart device, a user or tester must say the wake word “Alexa” and then say acertain command. The device sends then the user commands to the Alexa voice service which isthe cloud service responsible for processing the voice commands and returning the response back.The Alexa voice service is also responsible for making requests to external third party services ifneeded. The general functionality of the Alexa voice service is shown in Figure 4.

Figure 4: Generic workflow of testing requests for Alexa devices

The generic workflow explains how a tester or a user can input a voice command into the Alexadevice. The test inputs are interpreted as voice commands that are recognized. The actual outputcan be seen as a voice answer given by the Alexa device to the user.

2.3.2 The Echo Device: How Alexa Skills Work

In this section, we describe the specific HDVA device under study in this thesis (i.e., the Alexa/Echodevice). This device provides a set of capabilities to its users. These capabilities are refereed to asAlexa skills. As this device is a home digital voice assistant device, it can respond to user requestsin sophisticated ways. The following question is important when developing and testing Alexaskills:How are skills created? New Alexa skills are developed using the Alexa developer console whichis a special platform made by Amazon for the development of Alexa skills. When a developerwants to create a new skill, and depending on the features of the skill one is planning to create,the developer has the possibility of choosing from a set of predefined models that will make theskill creation easier. The following aspects are recognized as important when developing skills:

• Types of skills: According to the official alexa skill documentation [18], a developer canchose from a list of different types of skills depending on what he wants the skill to perform.The types of skills available for Alexa skill developers, according to the official alexa skilldocumentation [18] are shown in the Table 1.

8


Purpose of the skill Type of used skillA skill that has the objective of control-ling home devices such as lights, doorsand so on.

Smart Home Skill API

A skill that has the objective of control-ling cloud-enabled video services.

Video Skill API

A skill that has the objective of provid-ing a flash briefing content.

Flash briefing skill API

A skill that has the objective of en-abling users to select or listen to audiocontent.

Music Skill API

A skill that can do any type of requests. Custom skill

Table 1: Types of Alexa Skills

• Interaction model: For a complete control over the user experience, the developer of theAlexa skill should define the interaction model [19]. It consists of:

– The name the user will use to call the skill. This is the invocation name of the skill.

– Defining the requests the skill can do. These are defined as Intents.

– The words the user should say to invoke those requests. This is the interaction model.

– The visual or touch interactions the user can have with the skill in the case of alexaenabled devices.

Because development of Alexa skills is not directly related to the purpose of this work, we will notgo into further details on this topic and we leave this for future work.

2.3.3 Faults and Vulnerabilities in Home Digital Voice Assistants

While voice assistants have interesting features for users, they can exhibit some failures and secu-rity problems that needs to be taken into account [17] when testing. Many of the security issuesin Home digital voice assistants are due to the open nature of their input voice channel [17]. Infact, any person can turn on the device by just saying the wake word and therefore one can startinteracting with the device and may access some sensitive information on the holder of the device.Since the authentication in home digital voice assistants is related to a single authentication factor,these devices can also take into consideration voice commands even from people not in the closespace of the device. In fact, a team of researchers have shown how to make remote attacks for anAlexa device and even how to place a fake order on Amazon.com [17]. This is indeed a seriousissue since the voice assistant usually contains sensitive information about the device holder likethe email account, phone number and even the credit card information.

Usual vulnerabilities in Alexa devices. The reception of the voice commands is done bythe Alexa voice service. However, the Alexa voice service has an insecure access control becauseof the fact that it only relies on voice commands. A group of researchers have identified twovulnerabilities related to Alexa’s access control and one vulnerability in the Alexa-enabled smartdevices [17]. These vulnerabilities can be categorized as follows:

• Weak Single-factor Authentication. This vulnerability is related to the fact that theauthentication method employed in Alexa devices uses a limited set of wake words. Bydefault, the wake words is “Alexa”. In the Alexa mobile application or the Alexa website,it can be changed to either “Amazon”, “Echo”, or “Computer”. Therefore, any other usercan make a request to an Alexa device by only trying those four possible options. Moreover,Alexa voice service does not yet have the possibility to recognize the sound of the deviceowner.

• No Physical Presence based Access Control. Alexa devices do not have any verificationsystem that checks the physical presence of the person sending a request. The Alexa device

9


can still accept voice requests even if no people are around. Therefore, a request coming froma user who is not even in the physical vicinity of the device might successfully executed onthe device.

• Insecure Access Control on Alexa-enabled devices. Alexa owners can control thedevices enabled by Alexa by just saying the name of the object. To enable a smart door, auser can as an example just say “My Door”. It is possible to replace the names of the smartobjects, but it is generally easy for a malicious user to guess the voice command that controlsa smart object.

2.4 Multivocal Literature Review

A Multivocal Literature Review (MLR) is a form of review which includes the grey literature takenfrom non academic sources such as blog posts and white papers in addition to the formal academicsources published in journals or conferences [9].

Multivocal literature reviews are known to be useful for both researchers and practitionersbecause of the fact that in addition to providing the current state-of-the-art knowledge of a givensubject, it provides the practical knowledge of the subject under research obtained from non-academic sources.

2.4.1 Grey Literature

There exists many definitions of grey literature. The most common interpretation defines it asfollows: “grey literature is produced on all levels of government, academics, business and industryin print and electronic formats, but which is not controlled by commercial publishers, i.e., wherepublishing is not the primary activity of the producing body” [20]. This is therefore an importantway to broaden the scope of a given research done in a certain topic simply because of the factthat informal resources exceed by far the formal literature sources.

Since the grey literature comprises a wide number of non-controlled and different sources,carrying a grey literature review can turn out to be a difficult task to manage because a lot of thesources will lack some important bibliographic information such as the author and the publisher ofthe content [21]. However, studies such as the one we use as a guideline [9] are performed to provideresearchers with a detailed process on how to conduct a multivocal literature review in general anda grey literature review in particular. The detailed steps of the process that we followed in thiswork are explained in the Research methodology Section.

2.4.2 Multivocal Literature Reviews

A Multivocal Literature Review (MLR) is a study which is performed when there is not sufficientknowledge about a certain subject in the scientific literature. Because of this reason, this processrecommends to use all the accessible knowledge on a topic [22]. By using a variety of sources, theresearcher often gets a more clearer and wider view of the topic and therefore the experiments andconclusions will all be, to some extent, more appropriate and precise.

Because of the fact that the sources that are involved in multivocal literature reviews include alot of documents and pages that do not follow a review process or are not intended for commercialpurposes, the researcher must follow a specific process while doing a multivocal literature reviewto remove all the non-appropriate sources as well as all the non-relevant sources. The process weare following is based on the work of Garousi et al. [9] which is presenting a detailed process aboutconducting a multivocal literature review in software engineering. The first phase of this processconsist of establishing the need for conducting a MLR. This is needed when the available sourcesabout the topic of interest are not sufficient. After establishing the need for conducting an MLR,the researcher must set the goal of the MLR and set the research questions he will investigate inhis research. The next phase of the process is to start performing the MLR and in this phase, theresearcher must set a list of databases one will use as well as a list of search keywords to use inhis research. After conducting the research, a pool of results will be available to the researcherand therefore a list of criteria must be applied to this pool of results to exclude the non-relevant

10


sources and keep only the more appropriate and more relevant results. The last step will consist ofextracting from the final list of results the data of interest in order to answer the research questions.

2.5 Related Work

The first step of this thesis work is about performing a multivocal literature review around theexisting automation testing frameworks for HDVA devices. We used for that purpose the guidelinesfor performing a multivocal literature review in software engineering suggested by Garousi et al[9]. Vahid Garousi and Mika Mantyla applied these guidelines to conduct a multivocal literaturereview to answer the question of what should be automated in software testing [23]. The workdone in this thesis was performed based on these steps.

In the field of automated testing, a group of researchers have established a methodologicalframework for setting the necessary metrics for evaluating an automation testing tool [10]. Themetrics we presented in the Objectives Section are what we have used to evaluate the automationtesting tool.

Regarding the automation testing tool used, Vahid Garousi, Mika Mantyla and Paivi Raulamo-Jurvanen have conducted a grey literature review on the criteria chosen by practitioners whenadopting a test automation tool [24]. In the context of this thesis work, the choice of the tool toevaluate was based solely on the research questions and the objective of the thesis which was toevaluate an available automation testing tool for system testing for the Echo device.

In addition, there is some research performed on evaluating conversational agents and voiceapplications in both the academic literature and the grey literature sources. As an example, Ru-ane et al. [25] and Kaleem et al. [26] are both proposing new frameworks for testing the qualityof conversational agents and assessing the response of the frameworks in different contexts. Thesetwo frameworks were evaluated using custom Chatbots built for the purpose of these experiments.Another work done by Soodeh Atefi and Mohammad Amin Alipour [27] is proposing a new au-tomation testing framework for conversational systems built using the Python language. Thepaper is explaining the challenges that the developers faced in the development of such a system.In addition, the paper of Kulvinder Panesar [28] from York St. John University evaluated a lin-guistically motivated conversational software agent for the objective of investigating the accuracyof the dialogue that can be done by the conversational software agent.

Overall, we observed from the related work, that there is a lack of scientific evidence on testingHDVA devices and the use of automated test frameworks for voice assistants.

11


3 Research methodology

In this chapter, we will introduce the steps followed in the multivocal literature review, as well asthe criteria used in the evaluation of the automation testing tool using a case study method.

3.1 Research Method

The first step performed in this work was a multivocal literature review on HDVA devices ingeneral and on automation testing framework for HDVA devices in particular. The followingresearch question was followed to explore the area of test automation for HDVA devices:

• What are the existing and available automation testing frameworks for HDVA devices?

The specific device which this work is focusing on is the Echo platform since it is the mostpopular HDVA device on the market. According to a recent report, nearly 70% of smart speakersowners are using the Echo devices [8]. The objective of this literature review is to check theexisting work on testing HDVA devices and then evaluate the tooling in realistic scenarios. Theexperimental results of this literature review are shown in the Results section.

The second step of this work consisted of evaluating one automation testing framework for theEcho device. The evaluation used the set of metrics that were defined in the Objectives section.In addition, the characteristics of the machine used in the experiment and the test execution dataobtained are presented in the Results section.

3.2 Multivocal Literature Review

We used in our multivocal literature review the guidelines proposed by Garousi et al [9]. Theprocess presented in this paper is shown in Figure 5.

Figure 5: The overall steps of the MLR process used in this thesis 1

The goal of the multivocal literature review performed in this thesis is to map the knowledgearound testing HDVA devices. The motivation behind doing this research is mainly due to thelimited amount of knowledge available around that subject. The research question the multivocalliterature review is answering is:

1Process outlined in Garousi, Vahid and Felderer, Michael and Mantyla, Mika. (2017). Guidelines for includingthe grey literature and conducting multivocal literature reviews in software engineering

12


• What are the existing available automation testing frameworks for HDVA devices?

Planning the MLR: Prior to conducting the multivocal literature review, we ensured that con-ducting such a work is necessary. In particular, we identified and reviewed every research about thephenomenon of interest. In our case, the phenomenon we are trying to understand is the testing ofHDVA devices and in particular the existence of automation testing framework for HDVA devices(i.e., for the Echo device).

By using the guidelines of Garousi et al. [9], the following step was to answer a list of questionsrelated to conducting a multivocal literature review in the area of test automation for HDVAdevices. The list of questions we answered are shown in Table 2.

Question Possible answersIs the subject complex and notsolvable by considering only theformal literature ?

Yes/No

Is there a lack of volume or qual-ity of evidence, or a lack of con-sensus of outcome measurementin the formal literature ?

Yes/No

Is it the goal to validate or cor-roborate scientific outcomes withpractical experiences ?

Yes/No

Is it the goal to challenge as-sumptions or falsify results frompractice using academic researchor vice versa?

Yes/No

Would a synthesis of insights andevidence from the industrial andacademic community be useful toone or even both communities?

Yes/No

Is there a large volume of prac-titioner sources indicating highpractitioner interest in the topic?

Yes/No

Table 2: Questions to clarify the motivations to do the MLR process

Conducting the MLR: After setting the motivations for conducting the MLR process and beingsure that it is useful in our context, the MLR process should be conducted. The process forperforming a multivocal literature review [9] is composed of the following five steps:

• Search Process.

• Source Selection.

• Study Quality assessment.

• Data Extraction.

• Data Synthesis.

The results obtained from each of these steps are presented in the Results section. In thefollowing subsection we will give a detailed description of each step.

3.2.1 Detailed description about the MLR steps

We explain in this section the steps of the MLR process according to the MLR guidelines forconducting an MLR literature review in software engineering [9]. We follow in detail the followingsteps:

13


• Search process: This first step is about choosing the appropriate database to search forinformation. According to our base guidelines [9], the appropriate databases to look forinformation are the databases that properly and systematically allow the researcher to answerhis research questions. Because MLR research is about searching into the grey literature inaddition to the formal literature, the researcher has to choose some appropriate databases tolook for grey literature and formal literature. We used the following two types of databases:

– A generic web search engine: We used the google web search engine to search forthe grey literature sources that are related to our subject. The main reason for choosingthis particular database is because of its generic topics that are covered and its veryhigh performance [29]. This database is commonly used in grey literature studies anda valid choice for grey literature in software engineering [9].

– Specialized databases and websites: Searching formal research papers on HDVAdevices is an important task to perform since these sources are important for researchersand practitioners. Our work was performed within the Software Testing Research Groupat Malardalen University [30]. The databases we used to search for the academic researchpapers that are describing the subject of this thesis are the Google Scholar databaseand the IEEE Xplore database. We chose the Google Scholar database because of itshigh precision and its ability of retrieving a big number of relevant results from a varietyof sources [31]. In addition, it contains many relevant articles in computer science. Wealso used the IEEE Xplore database since it is one of the major databases for computerscience research [32] and to improve the completeness of our search process.

• Source selection: This step in the MLR process is about performing the search on thechosen databases with some fixed settings, and then choosing only the most appropriateinformation from the list of results extracted. The details regarding each step in the sourceselection step are explained below:

– The first step is related to choosing the search strings that will be used on the databasesselected in the previous step. In our case, we conducted a search on the sources wehave selected (i.e., Google web search engine, Google Scholar, IEEE Xplore) with thefollowing search strings that are shown in the Table 3.

A B CTest Framework HDVA

Amazon AlexaConversationalAgentVoice assistantGoogle HomeGoogle NestApple HomeAmazon Echo

Table 3: Search strings used in the MLR process

– The second step is about fixing the search settings that will be used on the chosendatabases. We have selected to search only the articles from 2014 because of our focuson the Alexa device. The Alexa device was in fact released in 2014. For our thesis work,the search settings used are:

∗ On the Google scholar database: The following settings were used using theGoogle Scholar advanced search settings:

· With all of the words: Test Framework

· With at least one of the words: “HDVA” “Amazon Alexa” “ConversationalAgent” “Voice assistant” “Google Home” “Google Nest” “Microsoft Cortana”“Apple Home” “Amazon Echo”.

14


· Return articles dated between: 2014 - 2020

We excluded patents and the citations from these results.

∗ On the IEEE Xplore database: The search conducted on the IEEE Xploredatabase was based on the following command:((“Full Text & Metadata”:Test Framework) AND ((“All Metadata”:HDVA) OR(“All Metadata”:Amazon Alexa) OR (“All Metadata”:Conversational Agent) OR(“All Metadata”:Voice assistant) OR (“All Metadata”:Google Home) OR (“AllMetadata”:Google Nest) OR (“All Metadata”:Microsoft Cortana) OR (“All Meta-data”:Apple Home) OR (“All Metadata”:Amazon Echo)))

We chose to work with only the articles from 2014 in this search.

∗ On the Google web search engine: We used the following settings in the searchdone on the google search engine using its advanced settings:

· all these words: Test Framework

· Any of these words: HDVA Alexa ConversationalAgent VoiceAssistant Google-Home GoogleNest Cortana AppleHome AmazonEcho

· Language: English

– The third step is about choosing the inclusion and exclusion criteria for including therelevant results and excluding the non relevant ones. Searching on online databasessuch as the regular Google web search engine or the Google Scholar database providesa large number of results and therefore it is important to set criteria to only retrievethe most relevant results.

For instance, according to our base guidelines for conducting an MLR research [9], thedate of publication, the author reputation, the journal reputation are some criteria thatcould help a researcher to identify the quality of a source. One benefit of applyinginclusion/exclusion criteria is that the more sources a researcher can exclude by thesecriteria, the less time one will have to spend trying to evaluate the quality of each source.The inclusions/exclusions criteria we have used in our research are shown in the Table4 and 5.

Inclusion criteriaCriteria relatedto the authors

Criteria relatedto the content ofthe document

Tool used

The authors or theinstitution involvedin the document arewell known

The date of publi-cation of the docu-ment is recent

There is evidenceof demonstrationon the usage of thetool

The content is re-lated to softwaretesting of HDVAdevices

The tool can beevaluated

The content is re-lated to automa-tion testing frame-work for HDVA de-vices

Table 4: Inclusion Criteria used in the MLR Process

15


Exclusion criteriaCriterias relatedto the content ofthe source

Date of the pub-lication

Tool used

The content is notrelated to softwareengineering

The date of publi-cation of the docu-ment is ancient

No tool is presented

The content is notrelated to softwaretesting

Table 5: Exclusion Criteria used in the MLR Process

– The last step is about fixing the stopping rules that, in addition to the use of theinclusion and exclusion criteria, they will be used to only include the most relevantresults from the list of results provided by the primary chosen databases.

There are many reasons to set the stopping rules in an MLR process. Firstly, thenumber of results provided by the databases are large. As an example, by typing“automation testing framework” on the Google Scholar database, we receive more thanone million results. It is therefore important to have a stopping rule because of thepractical impossibility of going through all the results. Secondly, a lot of results aresimilar to each other and therefore going through a lot of similar results does not bringadditional value to the research. The third reason for having stopping rules is the factthat the relevance and the quality of the results is decreasing with the number of pages.The more relevant results are located in the first pages and therefore we decided thatthere is no point in searching in all the pages since a lot of the results are not relevant.

By following the guidelines of Garousi et al. [9], we devised the following stopping rules:

1. Theoretical saturation: We noticed that at a given point, no new concepts emergedanymore.

2. Effort bounded: We included only the first 100 results according to a commonpractice in MLR research [9].

The researcher obtains at the end of this step a list of relevant sources for his research. Thenext step that the researcher needs to perform is the quality assessment step.

• Quality assessment of sources: The quality assessment of the sources is about determin-ing “the extent to which a source is valid and free from bias” [9]. Because the publicationprocess in grey literature reviews is less controlled than in formal literature reviews whereeach publication goes through a controlled review process [9], the grey literature sources aregenerally more difficult to assess [9].

The authors of the base guidelines for conducting a MLR research in software engineering[9] are using a list of quality assessment items to assess the results obtained in the sourceselection step and those items can be reduced or extended to fit the goal of the research.The list of items proposed by Garousi et al. [9] are similar to the ones we have used inthe inclusion/exclusion criteria and therefore we think that the obtained results are alreadyrelevant to our research.

• Data extraction: The data extraction step is the step of setting the attributes that will beused to extract the relevant data from the pool of results. Those attributes should allow theresearcher to answer his research questions. For our case, the choice of the attributes wasperformed according to our base guidelines [9] where the choice of the attributes was madebased on the past knowledge of the subject and the objective of the research. The objectiveof the MLR research is to map the body of knowledge around existing automation testingframeworks for HDVA devices. The attributes were chosen for that purpose.

We used the following attributes:

16


– The name of the test automation framework

– The device under test: It is the physical device that contains the software that is testedby the automation testing framework. The Echo device and the Google Home deviceare examples of such devices of interest for this work.

– The level of testing performed by the test automation framework: The levels of testingrefer to the different testing activities that are involved in a software development lifecycle [3] and we were interested in knowing what is the exact testing level at which thetool works.

– Level of automation: This refers to what part of the testing process is automated usingthe test automation framework.

In addition to that, for a successful data extraction process, the guidelines for MLR research[9] are suggesting the use of a mapping between the research questions and the attributeschosen and extract as much quantitative and qualitative data as needed to sufficiently ad-dressing each research question. For our case, the research question for which the attributesare chosen is:

– What are the existing available automation testing frameworks for HDVA devices ?

The next step of the process is to perform the data synthesis step in which we synthesize thedata obtained to answer the research question.

• Data synthesis: This last step of the MLR process is about providing a synthesis of theresults in a suitable way to answer the research questions. According to the MLR researchguidelines [9], the correct way to do the data synthesis step should depend on the researchquestion. We collected the results in a tabular form by doing a qualitative and quantitativeanalysis of the obtained results. The data synthesis results are shown in the Results section.

3.3 Case Study Design

A case study is an empirical research method that aims to investigate some given phenomenon inits specific context [33]. In computer science, this is a method perfomed to achieve a comprehensiveunderstanding of a technology, a tool or method through an extensive description and analysis ofthe object under study.

In computer science, the case study methodology is conducted mainly when it is not possibleto separate the context from the phenomenon under interest or when a researcher is looking toinvestigate some theory or some tool in some specific setting [34]. It is also difficult to generalizethe results of case studies since the research is conducted in some specific settings and these veryspecific conditions are influencing the phenomenon under study [34].

Software engineering is a process that involves the design, the development and the maintenanceof various software artifacts. Research in software engineering aims mainly at seeing how theseactivities are conducted by software engineers and every other humans involved in this process [33].This makes the case study method a well suited research method for software engineering becauseof the fact that the subjects under research are related to their context and cannot be studied inisolation [33].

For this thesis work, the second step was an evaluation of the available automation testing toolsthat could perform system testing for the Alexa device. This is a case study with the objectiveof evaluating a given framework following a set of metrics mentioned in the Objectives section.According to the guidelines of conducting a case study research in software engineering by PerRuneson and Martin Host [33], five steps needs to be performed:

1. Case study design: This step is about defining the objective of the case study, the subjectunder study and how the research will be conducted. In our case, the objective is to evaluatean automation testing tool that is capable of doing system testing for the Alexa device. Thesubject under study is the tool that will be evaluated following the set of metrics definedpreviously in the Objectives section. The evaluation will be conducted by running the toolon a set of test suites and reporting the results.

17


2. Preparation for data collection: This step describes the procedures used to collect data. Inour case, the execution time measurements are collected from the report generated by the testautomation framework. An example of the report returned by the tool is shown in Section4.2.7. The number of test cases are collected after experimenting the tool with different testsets having different sizes and observing for which test suite size the tool stops working. Thethird aspect evaluated relates to our user experience using the tool. Since this information issubjective we use the learning time spent to understand the tool and its features, as well asthe time needed to configure and install the needed tools. The fault detection measurementsare collected by running a test suite containing faults at random places and seeing if the toolwas capable of detecting these.

3. Collecting evidence: This step describes the execution of the experiment on the system understudy. For our case, it is about executing the tool on the created test suites.

4. Analysis of the collected data: It is about analysing the collected data to obtain some evidenceanswering our research question. This step was performed after collecting all the data fromall the experiments on the object under study.

5. The last step is about reporting the results and conclusions made from the collected data ina suitable way to answer the research question.

3.4 Test Automation Tool Evaluation

There exist a need in the software engineering industry to have a set of guidelines on how to comparesoftware testing tools [10]. A group of researchers have defined a methodological framework fordefining the criteria that can be used in the evaluation of a testing tool by any testing practitioner[10]. This methodological framework was inspired by the measures of usability as defined in theISO 9241-11 which are focusing on: efficiency, effectiveness, and subjective satisfaction. Accordingto the research paper done for this subject [10], the metrics that should be used in the evaluationof a testing framework should answer the following questions:

• How does the tool contribute to the effectiveness of testing (fault-finding capabilities) whenit is used in real settings ?

• How does the tool contribute to the efficiency of testing (e.g., learning time, test executiontime) when it is used in real settings?

• How satisfied are the testing practitioners during the learning, installing, configuring andusage of the tool when it is used in real settings ?

In our evaluation of the automation testing framework obtained from the MLR research, we havechosen metrics based on the methodological framework [10] mentioned above. As we have men-tioned in the Objectives section, the chosen metrics are:

• Time to execute the test cases.

• The maximum number of test cases that can be executed in parallel.

• Description of our personal experience using the automation testing framework tool. Thecriteria that we took into consideration in this aspect was described in the Objectives section.

• Fault detection capability by manually injecting faults in the test set.

In this thesis, we have used the MLR search process to map the current state of the practicein automated testing of HDVA devices and obtain the data needed to select a suitable frameworkfor further experimentation.

18


4 Results

In this section we show the results of our multivocal literature review and of the experimentalevaluation done on the automation testing tool for HDVA devices.

4.1 Multivocal Literature review Results

In this section, we will present the result of the work done in the multivocal literature reviewprocess

4.1.1 Planning the MLR: Results

This step is about clarifying the motivations behind conducting a multivocal literature review. Byfollowing the guidelines of Garousi et al [9], we answered the list of questions related to conductinga multivocal literature review mentioned in the Research methodology section. The list of questionsand answers are shown in Table 6.

Question Possible answers AnswerIs the subject complex and notsolvable by considering only theformal literature ?

Yes/No Yes

Is there a lack of volume or qual-ity of evidence, or a lack of con-sensus of outcome measurementin the formal literature ?

Yes/No Yes

Is it the goal to validate or cor-roborate scientific outcomes withpractical experiences ?

Yes/No Yes

Is it the goal to challenge as-sumptions or falsify results frompractice using academic researchor vice versa?

Yes/No No

Would a synthesis of insights andevidence from the industrial andacademic community be useful toone or even both communities?

Yes/No Yes

Is there a large volume of prac-titioner sources indicating highpractitioner interest in the topic?

Yes/No Yes

Table 6: Questions and answers to clarify the motivations to do the MLR process

Explanations regarding the answers: We will explain in this paragraph the reasons behindour answers for each of the questions that are helping a researcher setting the motivations for doinga MLR:

• Is the subject complex and not solvable by considering only the formal literature? It is the case for the subject of this multivocal literature review since there are not toomany researches done on automation testing frameworks for HDVA devices. As we have saidon the Motivation section, by searching on the Google Scholar database with the sentence“automation testing framework for hdva device”, we obtain only 4 results. The results ofthis MLR review does also serve as a support for this statement.

• Is there a lack of volume or quality of evidence, or a lack of consensus of outcomemeasurement in the formal literature ? The answer for this question is yes also sincethere are not too many researches done on this subject and therefore there is a lack of evidenceregarding the subject of this multivocal literature review

19


• Is it the goal to validate or corroborate scientific outcomes with practical expe-riences ? Since this thesis work is including an evaluation of one automation testing toolfor the Echo device, the answer is yes.

• Is it the goal to challenge assumptions or falsify results from practice usingacademic research or vice versa? Our goal in this research is not to challenge or falsifyany previous result or assumption. The goal is to map the knowledge around the availableautomation testing tools for HDVA devices and therefore the answer is no.

• Would a synthesis of insights and evidence from the industrial and academiccommunity be useful to one or even both communities? The results of the MLRreview will be useful for Software testing researchers and we also think that the results willbe beneficial to the people involved in the Alexa ecosystem.

• Is there a large volume of practitioner sources indicating high practitioner inter-est in the topic? The first reason behind answering yes to this question is the importanceof Software testing for all the people working in Software engineering [35]. The second reasonis the fact that this thesis work was done in a Software Testing laboratory of a well knownuniversity [30] and the interest for this topic was communicated to us by the people workingin this laboratory. In addition to that, most of the work done on voice assistants focused onuser experience [36] and few attempts have been made on testing voice assistants [25]. Theseare the reasons behind answering yes to this question.

4.1.2 Conducting the MLR: Results

1. Search Process: This step is about choosing the appropriate databases to look for informa-tion. As we have stated in the Research methodology section, we used the google web searchengine to search for the grey literature sources that are related to our subject and the IEEEXplore database and the Google Scholar database to look for the academic research papersrelated to the subject of this thesis.

2. Source selection: The source selection step is about performing the research with the rightsettings as explained in the Research methodology section. After searching with the searchstrings and search settings explained in the Research methodology section, we obtained 7770results for the Google scholar database, 249 results for the IEEE Xplore database and 167results for the regular Google web search engine. We applied on those results the inclu-sion/exclusion criteria explained in the Research methodology section.

• Stopping rules: By following the guidelines of Garousi et al [9], the stopping rules wehave set are:

(a) Theoretical saturation: we noticed that no new concepts emerged after 80 results.

(b) Effort bounded: We included only the first 100 results according to a commonpractise of authors in MLR research.

These stopping rules were applied on the first pool of results we obtained.

• First pool of results: After conducting the search on the chosen database and ap-plying the inclusion/exclusion criteria, we obtained 11 relevant results from the regularGoogle web search engine and two relevant results from the Google Scholar engine. Af-ter removing the duplicate results from the IEEE Xplore database, we did not find anyrelevant result in this database. The results are shown in Appendix A. An overview ofthe search selection step is shown in Figure 6.

20


Figure 6: Source selection Results

3. Study of quality assessment: Because of the fact that the the inclusion/exclusion criteriawe used are similar to the list of items proposed Garousi et al. [9], the results we obtainedare following those recommendations and therefore relevant to this MLR research.

4. Final pool of results: After obtaining a pool of relevant results, we conducted the dataextraction step and the data synthesis step explained in the Research methodology sectionto obain the final results of this process for answering the research question. The results ofthis process is shown in the Table 7 and these tools are described in more details in Section4.1.3.

21


Test automationframework

Device undertest

Level of testing Level of automation

Botium [37] Chatbots in generaland the Alexa de-vices in particular

Unit testing, end toend testing and sys-tem testing

Automated test execution

Amazon Alexa simula-tor [38]

Alexa device Simulation tool Automated response inthe console

Bespoken [39] Alexa device,Google Home andChatbots in general

Manual testing,Unit testing, endto end testing andSystem testing

Automated test execution

Jovo framework [40] Alexa device andGoogle Home

Unit testing Automated test execution

BrainMacInosh’sAlexa-Skill-Test-Framework [41]

Alexa device Unit testing Automated test execution

Google Actions simula-tor [42]

Google Home Simulation tool Automated response inthe console

Microsoft Bot Frame-work emulator [43]

Microsoft cortana Simulation tool Automated response inthe console

Table 7: Overall Results of the MLR.

4.1.3 Review of the Test Automation Frameworks found from the MLR

We give in this section a description of each of the tools found after performing the MLR process.The availability of the tools, the level of testing that each tool targets and the limitation of thesetools are explained in this section. We also explain in this section the reasons behind our choiceregarding what automation testing framework tool can be evaluated within this thesis work.

• Botium Tool Set: This set of tools can perform testing for Chatbots in general and voiceapplications in particular. The company behind the tool supports a free version and severalcommercial versions with different features [37]. The free version can perform system testingfor the Alexa device. It is therefore a valid choice to be evaluated in our thesis work. Weexplain the features of this tool in details in the next section.

• Amazon Alexa Simulator: It is the free tool developed by Amazon for Alexa skill de-velopers to simulate the interaction with the custom skills developed [38]. It is provided onthe Test section in the developer console. The input and output are directly displayed inthe console. This tool is made for simulating the interaction with only the custom skillsdeveloped inside the console. It cannot simulate an interaction with an external skill enabledby an Alexa user. This is because a normal user can only enable a random Alexa skill onhis device and there is no access to the code of the skill. The tool also does not offer thepossibility to write test scripts that will include expected outputs. Because the objective ofthis work is to evaluate a tool that does system testing for every skill, this tool is not a goodcandidate to be evaluated in this case study.

• Bespoken: Set of tools developed by the private company “Bespoken”. The company offersa free version that can perform manual testing and unit testing. The company has severalcommercial plans offering different features [39]. The tool offers a user interface from wherethe tester can run his tests. The way the tool performs automated testing is as follows [44]:

– Turning the script into audio using text-to-speech technology.

– Sending the audio to the voice service of the smart device.

– Converting the audio response to text and displaying it in the dashboard.

System testing is only done by using the commercial version and therefore Bespoken cannotbe evaluated in this work.

22


• Jovo Framework: Set of tools used for building and testing Alexa Skills and Google Homeactions. The company behind the tool offers an open-source version as well as commercialplans supporting different features for their users [40]. The free version cannot performsystem testing of Alexa skills and therefore it is not a suitable candidate to be evaluated inthis work.

• BrainMacInosh’s Alexa-Skill-Test-Framework: This is an open-source tool made fortesting the individual intents of an Alexa skill. The tool does not perform system testing foran Alexa skill and therefore it is not a suitable candidate to be evaluated in this work.

• Google Actions Simulator: It is the tool developed by Google to simulate the interactionswith the custom google action created. The tool offers an interface that can be used tosimulate the interaction with the hardware device and its settings. Since this tool is not usedon the Alexa device, we cannot evaluate it in this work.

• Microsoft Bot Framework Emulator: It is a tool developed by Microsoft to simulatethe interactions with the custom cortana skills. This tool targets cortana-based devices andit is outside the scope of this thesis.

4.2 Automation Testing Framework Evaluation

After performing a multivocal literature review and based on the qualitative review of these frame-works, we identified that the only available framework that could be used for system testing onthe Echo device is the Botium framework. We present this framework in this section as well as thecomponent that will be used in this work.

4.2.1 Introduction

Botium is a framework “for testing and training conversational AI” [45]. It is a set of open sourcecomponents that are helping chatbot developers in testing their chatbots.

4.2.2 Architecture of the Botium Stack

As mentioned earlier, the Botium stack is composed of many components, each having a differentrole. All the components involved in the Botium stack are built on the top of the Botium Corecomponent which is the core component [46]. The architecture showing how all the componentsinvolved in the Botium stack interact with each other is shown in the Figure 7.

Figure 7: Architecture of the Botium Stack

The Botium stack: The set of components that are in the Botium stack are [45]:

23


• Botium Core: It is the component that is providing the core functionalities. All the othercomponents are built on the top of Botium Core.

• Botium CLI: Botium CLI is the command line made for interacting with Botium Core andwith each of its features. Botium CLI does not have a graphical interface, but it provides aset of commands to use every feature of Botium.

• Botium Bindings: As shown in Figure 7, Botium Bindings is the component that linksBotium to an external testing framework that is responsible for running the test scripts.After running a test script containing a set of test cases, the test runner outputs a summaryof the execution showing the failing test cases if they exist.

• Botium Box: The Botium box is the complete software that includes all the Botium com-ponents. Botium Box provides a graphical interface that can be used to interact with everyfeature of Botium.

4.2.3 How is a Chatbot tested with Botium?

As we have stated previously, the core component in the Botium stack that is responsible forexecuting the tests is the Botium core component. It is the component that receives the testscripts and transforms it to the correct format that could be sent to the voice service. BotiumCore is also responsible for receiving the answer from the voice service and comparing it to theexpected output and producing the test report. Figure 8 shows the process [44] used in this thesisfrom the creation of test data to the generation of test reports.

Figure 8: How is a chatbot tested using Botium

4.2.4 Overall Comparison of the Botium Framework with Bespoken for System Test-ing on Alexa Devices

From the results of our MLR research, we identified the available automation testing frameworksfor HDVA devices. We found that the Botium tool and the Bespoken tool are the only two toolsthat can perform system testing for the Alexa device. Since Bespoken is a commercial tool, wewill show in this section a brief comparison of these two tools based on data collected during ourMLR, such as the comparison made by the company “Emtec Digital” in one of their blog posts[47] and by our own research on these two tools. We show some of the details of this comparisonin the Table 8.

24


This comparison is based on qualitative data collected in terms of the cost of acquiring theframework, ease of use, time needed to configure the tooling, the status of the documentation, thenumber of features and the technical support.

Feature Bespoken BotiumCost of Acquisition Commercial product Open-source ProductEase of use Easy to use Relatively hard to useTime needed to in-stall and configurethe tool

Low time Significant time

Documentation Good documentationLack of good documenta-tion. Needs to be im-proved.

Features supported High LowTechnical support High Low

Table 8: Comparison of Bespoken and Botium

These results are suggesting that Botium is a very good alternative to Bespoken, while it isnot easy to use without training and the time to configure the tool can influence the test sessionpreparation time. In addition, the documentation can be improved since this lacks many details.In the next sections, we use Botium to further evaluate its features in practice.

4.2.5 Botium Test Cases

In Botium, the test cases are described in scripts describing the user interaction with the Chatbotor the smart device under test. These scripts are saved in files called convo files. An example of aBotium test set is shown in Figure 9.

Figure 9: Botium scripts

Detailed description of a test case for a Botium script: According to the book “Introductionto Software Testing” [3], a test case is composed of “the test case values, expected results, prefixvalues, and postfix values necessary for a complete execution and evaluation of the software undertest”. In the case of a Botium script, we instantiate this definition to exemplify this concept asfollows:

• The test case values are the input values necessary to complete the execution of the test seton the software under test. In the case of a Botium script, these are written after the sign“#me” as shown in Figure 9.

• The expected results are the results that are expected to be produced if the program satisfiesits intended behavior. In the case of a Botium script, these are written after the sign “#bot”as shown in Figure 9.

25


• The prefix values are any inputs necessary to put the software into an appropriate state toreceive the test case values. In the case of a Botium script testing an Alexa skill, there areno prefix values needed.

• The postfix values are any inputs that needs to be sent to the software after the test casevalues are received. In the case of a Botium script that is testing an Alexa skill, there are nopostfix value one needs to use.

4.2.6 Botium Package

The Botium package used during this experiment is Botium Bindings [48]. It is a connector usedto link the Botium core software to test runner like mocha [49]. The test runner used in thisexperiment is mocha which is a JavaScript test runner. We have also used the freely availableBotium connector for Alexa skills [50].

4.2.7 Test Runner Report Example

In the case of the mocha test runner which is responsible for running the Botium test cases, itproduces results that are showing the files containing the test suite that were executed, the timeit took to run all those test suites, and if any test suite contains a failing test case, the test runnerreturns this test case showing the mismatch between the expected output and the actual output.

As an example, in the case of a passing test suite, the results produced by the test runner isshown in Figure 10.

Figure 10: Test runner report for a passing test suite

In the case of a test suite containing some failing test cases, the results produced by the testrunner are shown in Figure 11.

26


Figure 11: Test runner report for a failing test suite

Overall, one can see that the results are actionable and a tester can execute multiple test casesand debug the code using the information obtained from the actual logging output. We used thistest runner report for executing multiple test suites and collecting the measurement data.

4.2.8 System Characteristics

The machine we used in this experiment is an Ubuntu 20 VMWare [51] virtual machine installedon a Windows 8.1 laptop machine. The characteristics of the windows machine are showed inFigure 12. We show these characteristics for reproducibility purposes since these are influencingthe execution time.

Figure 12: Characteristics of the Windows Machine during Evaluation

The characteristics of the Ubuntu 20 virtual machine used are shown in Figure 13. The amountof memory used can influence the test execution data collected during the experiment.

27


Figure 13: Characteristics of the Ubuntu virtual machine used in the experiment

4.2.9 Experiment: Botium Package

As we have stated previously, the objective of evaluating the Botium tool is to evaluate its efficiency,effectiveness and usefulness for software engineers and researchers working in software testing aswell as the practitioners involved in the Alexa ecosystem. More specifically, we are interested inevaluating the efficiency and effectiveness of using the Botium tool to test any random Alexa skill.In this step, the experiment conducted on the Botium package consisted of the following steps:

• Installing and configuring the Botium Bindings package on the local machine.

• Choosing random and available Alexa skills that could be tested with the Botium Bindingspackage.

• Configuring the test runner on our local machine.

• Defining a set of test suites and executing the test runner and reporting the results regardingthe execution time metric and the maximum number of test cases that can be executed atthe same time metric.

• Injecting random faults at different places in the expected outcomes and reporting the faultdetection metric.

4.2.10 Alexa Skills Under Test

The Alexa skills tested in this experiment are described as follows:

• United states quiz skill [52]: An Alexa skill that users can use as a quiz about United Statesand the state’s capitals. Users can also ask the device about a state or a capital and the skillwill reply with relevant information about it.

• High low game skill [53]: In this skill, the Alexa device will choose a random number between0 and 100 and the user will have to guess this random number. Alexa will tell the user if hisguess was higher or lower than the random number.

We argue that these are realistic skills that can show how applicable a test automation frame-work is for testing HDVA devices.

28


4.2.11 Metrics

The metrics used in the experiment are:

• Time to execute test cases. This was collected directly from the Botium test results.

• The maximum number of test cases that can be executed at the same time. This is a proxymeasure for test parallelization that can show the efficiency of a test automation framework.

• Description of our personal experience using the tool. The aspects that we took into consid-eration were explained in the Objectives section

• Fault detection capability by manually injecting faults in the test set. We randomly injectedfaults at different places in the expected outcomes.

4.2.12 Results

United States Quiz Skill results: The results of the evaluation done with the machine usedshows that the maximum number of test cases that could be executed at the same time was 22.The rest of the results are shown in Table 9. It is interesting to mention here that the time neededto execute the test suites in this experiment was between 22 and 57 seconds. This shows that forthis quiz skill the automation framework can take quite a lot of time for testing larger amountof test suites. In addition, we show the results of the fault detection process in which 100% ofthe faults added in the expected outcomes are found. Given that these are rather simple artificialfaults, more studies are needed to assess a more realistic fault detection scenario.

Test suiteFaults introducedin the expectedoutcomes

Number oftest cases

Time needed to ex-ecute the test set

Fault detec-tion (1,0)

1 0 10 33,45 s 0

2In the test caseNo10

10 34,400 s 1

3 0 16 47,84 s 0


16 42 s 1

5 0 20 52,63 s 0


20 28 s 1

7 0 22 58,80 s 0


22 57,391 s 1

Table 9: Evaluation results for the United states quiz skill

An additional testing experience: We introduced four faults in the Javascript code of the skill.Because of the fact that an Alexa skill developer has to deploy his skill in order for his changes totake effect, the skill will not function properly if the code contains errors and it raises in that casethe error message “There was a problem with the requested skill’s response”. In our case, afterconnecting the Botium Bindings tool with the Alexa skill and running a test case containing theerror message, the tool was successfully capable of detecting that error message.

High Low Game Skill results: The maximum number of test cases that could be executed atthe same time for this skill was 23. The rest of the results are shown in Table 10. Similarly to theprevious results, the time needed to execute the test sets of this experience was between 27 and58 seconds. This shows that the tool might take an important time to execute larger pools of testsuites. In addition, we show the results of the fault detection process in which 100% of the faultsare found which can show that this framework can be used for mutation testing.

29


Test suiteFaults introducedin the expectedoutcomes

Number oftest cases

Time needed to ex-ecute the test set

Fault detec-tion (1,0)

1 0 10 27,85 s 0


10 27,96 s 1

3 0 16 39,97 s 0


16 44 s 1

5 0 20 50,36 s 0


20 52 s 1

7 0 23 57,80 s 0


23 58,64 s 1

Table 10: Evaluation results for the High low game skill

An additional testing experience: Similarly to the experience with the first skill, we introducedfour faults in the Javascript code of the skill. After deploying the changes, the skill stoppedfunctioning properly and it raises the error message “There was a problem with the requestedskill’s response”. In our case, after connecting the Botium Bindings package with the Alexa skilland running a test case containing the error message, the tool was successfully capable of detectingthat error message.

4.3 Discussions

Overall, we have provided empirical evidence supporting the first research question related tothe available automation testing frameworks for HDVA devices. We found several tools servingdifferent testing objectives. The manufacturers of these devices developed tools integrated in theirdevelopers console to simulate the interaction that a normal user has with the HDVA device. Wenote here that there are few external tools that are capable of performing higher levels of testingfor HDVA devices. In fact, from our results, we found only two such tools that can perform systemtesting for the Alexa device and the Google Home device.

Regarding the second research question of this thesis work (i.e., which is about evaluating anautomation testing framework tool for HDVA devices in terms of applicability and usability andresources used), our results suggest that the test execution time for the Botium tool is influencingtest efficiency in terms of the time to execute a given test set. As an example, for a test set ofonly 16 test cases, it took the test runner approximately 45 seconds to execute it. Despite the factthat the evaluation was performed on a virtual machine with limited capabilities, this shows thatexecuting test cases using Botium can be expensive from an testing running time point of view.

Secondly, we observed that for the faults injected in the test sets which were mainly introducedin the expected outputs, the Botium package used was capable of achieving 100% fault detection.This result shows that Botium can be used also for mutation testing.

Our third observation regarding the usage of the Botium tool is related to the time needed tounderstand all the documentation and in figuring out the correct component that we needed to use.We think that it will be greatly beneficial for the users of the tool to have a clearer documentationthat is showing what can each component involved in the Botium stack do.

We want to also add that the features offered by the tool, in both the commercial version andthe open-source version, are limited in comparison with other commercial tools. A comparison ofthe Botium tool with another tool was performed in this thesis.

As a general observation, we note here that these testing tools for HDVA devices, such as theBotium tool, do not test non-functional requirements. They are only developed for testing thefunctionality of an HDVA skill. The security, performance and maintainability of the skill areaspects that are not tested when using these tools. It might be useful for the testing communityif such tools exists.

30


5 Conclusions

The work performed in this thesis focused on mapping the current knowledge around testing homedigital voice assistants which are devices that are already changing the lives of a lot of peoplearound the world. The objective of this work was to investigate the extent to which these devicesare reliable and to what extent the software part of those devices are properly tested.

The first contribution of this work was to map the current knowledge around the current avail-able automation testing frameworks for HDVA devices and in particular for the Alexa device. Thesecond contribution was the evaluation done of the automation testing tool in terms of applicability,usability and resources used.

Our first motivation for doing this work is the fact that HDVA devices are very popular andare starting to be used in some critical use cases [7] and therefore an extensive testing of thesedevices is an important task that needs to be done. Our second motivation for carrying this workis related to the limited amount of researches done testing those smart devices.

To achieve the objective related to the first contribution, we performed a multivocal literaturereview on automation testing frameworks for home digital voice assistants in general and for theAlexa devices in particular. We followed for that purpose the guidelines of Garousi et al [9].

To achieve the objective related to the second contribution, we evaluated an automation testingframework capable of doing system testing for the Echo device. The evaluation was done based ona set of metrics that were defined in the Objectives section and inspired by a common standardfor usability defined by ISO 9241-11. We chose to evaluate the Botium software because it was theonly available software platform for system testing for the Echo device.

The work done in this thesis has some limitations. The first limitation is related to the inclusionof only the Echo device for the purpose of this study. We think that doing the experiments donein this work for the other HDVA devices will be beneficial for all the people to whom this workwill be beneficial. The second limitation is related to the limited capacity of the machine used inthe experiments done in this work. It would also be beneficial to do the experiments of this workon machines with higher performance to see if the conclusions done regarding the tool will still bevalid.

As a possible future perspective on this work, an evaluation of the existing automation testingframework for Alexa devices for lower testing levels could help get a deeper understanding. Asecond future perspective of this work would be to do a similar work for another HDVA devicesuch as the Google Home device which is the second most popular HDVA device in the market[54].

31


References

[1] A. Bertolino, “Software testing research: Achievements, challenges, dreams,” in Future ofSoftware Engineering (FOSE ’07), 2007, pp. 85–103.

[2] R. Angmo and M. Sharma, “Performance evaluation of web based automation testing tools,”09 2014, pp. 731–735.

[3] P. Ammann and J. Offutt, Introduction to Software Testing, 12 2016.

[4] A. Dasso and A. Funes, Verification, validation and testing in software engineering, 01 2006.

[5] M. Ehmer and F. Khan, “A comparative study of white box, black box and grey box testingtechniques,” International Journal of Advanced Computer Science and Applications, vol. 3,06 2012.

[6] S. Nidhra, “Black box and white box testing techniques - a literature review,” InternationalJournal of Embedded Systems and Applications, vol. 2, pp. 29–50, 06 2012.

[7] “Introducing new alexa healthcare skills,” https://developer.amazon.com/blogs/alexa/post/ff33dbc7-6cf5-4db8-b203-99144a251a21/introducing-new-alexa-healthcare-skills.

[8] “Nearly 70% of us smart speakers owners use amazon echo devices,” https://techcrunch.com/2020/02/10/nearly-70-of-u-s-smart-speaker-owners-use-amazon-echo-devices/.

[9] V. Garousi, M. Felderer, and M. V. Mantylad, “Guidelines for including grey literature andconducting multivocal literature reviews in software engineering,” Information and SoftwareTechnology (IST) journal, 2017.

[10] T. Vos, B. Marın, M. Escalona, A. Escalona, and A. Marchetto, “A methodological frameworkfor evaluating software testing techniques and tools,” 08 2012.

[11] “Ieee standard glossary of software engineering terminology,” ANSI/ IEEE Std 729-1983,1983.

[12] “Istqb glossary,” https://glossary.istqb.org/en/search/.

[13] S. Gojare, R. Joshi, and D. Gaigaware, “Analysis and design of selenium webdriver automationtesting framework,” Procedia Computer Science, vol. 50, pp. 341–346, 12 2015.

[14] A. Leitner, I. Ciupa, B. Meyer, and M. Howard, “Reconciling manual and automated testing:The autotest experience,” 01 2007, p. 261.

[15] M. Hoy, “Alexa, siri, cortana, and more: An introduction to voice assistants,” Medical Refer-ence Services Quarterly, vol. 37, pp. 81–88, 01 2018.

[16] “What is the difference between a chatbot and virtual assistant,” https://analyticsindiamag.com/what-is-the-difference-between-a-chatbot-and-virtual-assistant/.

[17] X. Lei, G.-H. Tu, A. Liu, C. Li, and T. Xie, “The insecurity of home digital voice assistants -amazon alexa as a case study,” 12 2017.

[18] “Build skills with the alexa skills kit,” https://developer.amazon.com/en-US/docs/alexa/ask-overviews/build-skills-with-the-alexa-skills-kit.html/.

[19] “Understand the different skill models,” https://developer.amazon.com/en-US/docs/alexa/ask-overviews/understanding-the-different-types-of-skills.html.

[20] J. Schopfel and D. J. Farace, Encyclopedia of Library and Information Sciences. CRC Press,2010.

[21] Q. Mahood, D. Van Eerd, and E. Irvin, “Searching for grey literature for systematic reviews:Challenges and benefits,” Research Synthesis Methods, vol. 5, 09 2014.

32

https://developer.amazon.com/blogs/alexa/post/ff33dbc7-6cf5-4db8-b203-99144a251a21/introducing-new-alexa-healthcare-skills

https://developer.amazon.com/blogs/alexa/post/ff33dbc7-6cf5-4db8-b203-99144a251a21/introducing-new-alexa-healthcare-skills

https://techcrunch.com/2020/02/10/nearly-70-of-u-s-smart-speaker-owners-use-amazon-echo-devices/

https://techcrunch.com/2020/02/10/nearly-70-of-u-s-smart-speaker-owners-use-amazon-echo-devices/

https://glossary.istqb.org/en/search/

https://analyticsindiamag.com/what-is-the-difference-between-a-chatbot-and-virtual-assistant/

https://analyticsindiamag.com/what-is-the-difference-between-a-chatbot-and-virtual-assistant/

https://developer.amazon.com/en-US/docs/alexa/ask-overviews/build-skills-with-the-alexa-skills-kit.html/

https://developer.amazon.com/en-US/docs/alexa/ask-overviews/build-skills-with-the-alexa-skills-kit.html/

https://developer.amazon.com/en-US/docs/alexa/ask-overviews/understanding-the-different-types-of-skills.html

https://developer.amazon.com/en-US/docs/alexa/ask-overviews/understanding-the-different-types-of-skills.html


[22] R. Ogawa and B. Malen, “Towards rigor in reviews of multivocal literatures: Applying theexploratory case study method,” Review of Educational Research, vol. 61, pp. 265–286, 091991.

[23] V. Garousi and M. Mantyla, “When and what to automate in software testing? a multi-vocalliterature review,” Information and Software Technology, vol. 76, 04 2016.

[24] P. Raulamo-Jurvanen, M. Mantyla, and V. Garousi, “Choosing the right test automation tool:a grey literature review of practitioner sources,” 06 2017, pp. 21–30.

[25] E. Ruane, T. Faure, R. Smith, D. Bean, J. Carson-Berndsen, and A. Ventresque, “Botest: aframework to test the quality of conversational agents using divergent input examples,” 032018, pp. 1–2.

[26] J. O. K. C. Mohammed Kaleem, Omar Alobadi, “Framework for the formulation of metricsfor conversational agent evaluation,” 05 2016.

[27] S. Atefi and M. Alipour, “An automated testing framework for conversational agents,” 022019.

[28] K. Panesar, “An evaluation of a linguistically motivated conversational software agent frame-work,” Journal of Computer-Assisted Linguistic Research, vol. 3, p. 41, 07 2019.

[29] S. Deka and N. Lahkar, “Performance evaluation and comparison of the five most used searchengines in retrieving web resources,” Online Information Review, vol. 34, pp. 757–771, 092010.

[30] “Mdh software testing laboratory,” http://www.es.mdh.se/research-groups/27-SoftwareTesting Laboratory.

[31] W. Walters, “Comparative recall and precision of simple and expert searches in google scholarand eight other databases,” Portal: Libraries and the Academy, vol. 11, pp. 971–1006, 102011.

[32] “The top list of computer science research databases,” https://paperpile.com/g/research-databases-computer-science/.

[33] P. Runeson and M. Host, “Guidelines for conducting and reporting case study research insoftware engineering,” Empirical Software Engineering, vol. 14, pp. 131–164, 04 2009.

[34] W. Afzal, “Dva 463: Research methods in computer science 2019,” 2019.

[35] A. Uddin and A. Anand, “Importance of software testing in the process of software develop-ment,” pp. 2321–0613, 01 2019.

[36] E. Luger and A. Sellen, “”like having a really bad pa”: The gulf between user expectationand experience of conversational agents,” 05 2016, pp. 5286–5297.

[37] “Botium,” https://www.botium.at/.

[38] “Amazon alexa simulator,” https://developer.amazon.com/en-US/docs/alexa/devconsole/test-your-skill.html/.

[39] “Bespoken,” https://bespoken.io/.

[40] “Jovo framework,” https://www.jovo.tech/.

[41] “Brainmacinosh’s alexa-skill-test-framework,” https://github.com/BrianMacIntosh/alexa-skill-test-framework.

[42] “Actions simulator,” https://developers.google.com/assistant/console/simulator.

[43] “Bot framework emulator,” https://docs.microsoft.com/en-us/azure/bot-service/bot-service-debug-cortana-skill?view=azure-bot-service-3.0.

33

http://www.es.mdh.se/research-groups/27-Software_Testing_Laboratory

http://www.es.mdh.se/research-groups/27-Software_Testing_Laboratory

https://paperpile.com/g/research-databases-computer-science/

https://paperpile.com/g/research-databases-computer-science/

https://www.botium.at/

https://developer.amazon.com/en-US/docs/alexa/devconsole/test-your-skill.html/

https://developer.amazon.com/en-US/docs/alexa/devconsole/test-your-skill.html/

https://bespoken.io/

https://www.jovo.tech/

https://github.com/BrianMacIntosh/alexa-skill-test-framework

https://github.com/BrianMacIntosh/alexa-skill-test-framework

https://developers.google.com/assistant/console/simulator

https://docs.microsoft.com/en-us/azure/bot-service/bot-service-debug-cortana-skill?view=azure-bot-service-3.0

https://docs.microsoft.com/en-us/azure/bot-service/bot-service-debug-cortana-skill?view=azure-bot-service-3.0


[44] “Automated testing for alexa skills and more,” https://bespoken.io/blog/automated-testing-alexa-skills/.

[45] “Botium box release notes,” https://botium.atlassian.net/wiki/spaces/BOTIUM/pages/20807681/Botium+Box+Release+Notes.

[46] “Botium in a nutshell, part 1: Overview,” https://medium.com/@floriantreml/botium-in-a-nutshell-part-1-overview-f8d0ceaf8fb4.

[47] E. Digital, “Testing voice user interfaces botium vs. bespoken.” [Online]. Available: https://www.emtec.digital/think-hub/blogs/testing-voice-user-interfaces-botium-vs-bespoken/

[48] “botium-bindings,” https://www.npmjs.com/package/botium-bindings.

[49] “Mocha - the fun, simple, flexible javascript test framework,” https://mochajs.org/.

[50] “Botium connector for amazon alexa skills api,” https://github.com/codeforequity-at/botium-connector-alexa-smapi.

[51] “Vmware,” https://www.vmware.com.

[52] “United states quiz,” https://www.amazon.com/Jeff-Blankenburg-United-States-Quiz/dp/B06X9GQBRL.

[53] “High low game,” https://www.amazon.com/Iplay-Games-High-Low-Game/dp/B072XH5JZJ.

[54] “Smart speaker with intelligent personal assistant quarterly shipment sharefrom 2016 to 2019, by vendor,” https://www.statista.com/statistics/792604/worldwide-smart-speaker-market-share/.

[55] “alexa-skill-test-framework,” https://www.npmjs.com/package/alexa-skill-test-framework.

[56] “Alexa skill test framework for typescript and ask2,” https://github.com/taimos/ask-sdk-test.

[57] “Effective integration testing of alexa skills,” https://medium.com/@bachlmayr/effective-integration-testing-of-alexa-skills-f5734f99931a.

[58] “Testing alexa skills,” https://hackernoon.com/testing-alexa-skills-d1e6c25b1dea.

[59] “A framework for testing alexa skills developed in python with the alexa-skills-kit-sdk-for-python.” https://pypi.org/project/py-ask-sdk-test/.

34

https://bespoken.io/blog/automated-testing-alexa-skills/

https://bespoken.io/blog/automated-testing-alexa-skills/

https://botium.atlassian.net/wiki/spaces/BOTIUM/pages/20807681/Botium+Box+Release+Notes

https://botium.atlassian.net/wiki/spaces/BOTIUM/pages/20807681/Botium+Box+Release+Notes

https://medium.com/@floriantreml/botium-in-a-nutshell-part-1-overview-f8d0ceaf8fb4

https://medium.com/@floriantreml/botium-in-a-nutshell-part-1-overview-f8d0ceaf8fb4

https://www.emtec.digital/think-hub/blogs/testing-voice-user-interfaces-botium-vs-bespoken/

https://www.emtec.digital/think-hub/blogs/testing-voice-user-interfaces-botium-vs-bespoken/

https://www.npmjs.com/package/botium-bindings

https://mochajs.org/

https://github.com/codeforequity-at/botium-connector-alexa-smapi

https://github.com/codeforequity-at/botium-connector-alexa-smapi

https://www.vmware.com

https://www.amazon.com/Jeff-Blankenburg-United-States-Quiz/dp/B06X9GQBRL

https://www.amazon.com/Jeff-Blankenburg-United-States-Quiz/dp/B06X9GQBRL

https://www.amazon.com/Iplay-Games-High-Low-Game/dp/B072XH5JZJ

https://www.amazon.com/Iplay-Games-High-Low-Game/dp/B072XH5JZJ

https://www.statista.com/statistics/792604/worldwide-smart-speaker-market-share/

https://www.statista.com/statistics/792604/worldwide-smart-speaker-market-share/

https://www.npmjs.com/package/alexa-skill-test-framework

https://github.com/taimos/ask-sdk-test

https://medium.com/@bachlmayr/effective-integration-testing-of-alexa-skills-f5734f99931a

https://medium.com/@bachlmayr/effective-integration-testing-of-alexa-skills-f5734f99931a

https://hackernoon.com/testing-alexa-skills-d1e6c25b1dea

https://pypi.org/project/py-ask-sdk-test/


6 Appendix A : MLR Results per Database

Database Title Device under test ?Google Scholar BoTest: a Framework to Test

the Quality of ConversationalAgents Using Divergent InputExamples [25]

A custom bot built for the pur-pose of the research : Chitchat-Bot

An Automated Testing Frame-work for Conversational Agents[27]

Not specified

Google Web search Alexa Skill Test Framework [55] Alexa Device.Framework for easy offline black-box testing of Alexa skills [41]

Alexa Device

Alexa Skill Test Framework forTypescript and ASK2 [56]

Alexa Device

Effective Integration Testing ofAlexa Skills [57]

Alexa Device

Testing Alexa Skills [58] Alexa DeviceUnit Testing [41] for Voice Apps Alexa DeviceBespoken [39] Alexa and Google Home and

ChatbotsTesting and debugging Cortanaskills [43]

Microsoft Cortana

A framework for testing AlexaSkills developed in Pythonwith the alexa-skills-kit-sdk-for-python [59]

Alexa Device

An Automated Testing Frame-work for Conversational Agents[27]

Conversational agent in general

Botium - Bots testing bots [37] Chatbots in general

Table 11: MLR Results.

35

SYSTEM-LEVEL AUTOMATED TESTING FOR HOME DIGITAL …1473113/... · 2020. 10. 5. · Ismail Tlemcani...

Documents

Transcript of SYSTEM-LEVEL AUTOMATED TESTING FOR HOME DIGITAL …1473113/... · 2020. 10. 5. · Ismail Tlemcani...