Stats Final

“The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” - Edsger W. Dijkstra

The following was my final exam submission. We had a total of two hours to read the following prompt and write a research proposal.

“Norberg cites, the popular iPad sketching app, as inspiration. One of its great innovations was the “Expressive Ink Engine,” which interprets, scrawls, and smooths them into beautiful lines on screen. “Autopilot is our Ink Engine,” Norberg says.

On mobile devices, where input is limited, algorithms and AI can serve as powerful mediators, smoothing rough edges both literally and figuratively. But as our devices shrink and input becomes ever tougher, this mediation becomes increasingly important. On the iPad’s large screen, you can let users do, say, 90 percent of the driving and let algorithms fill in the rest. As the new Pacemaker (Links to an external site.) app suggests, the opposite ratio might be best.

It’s an approach we’ll almost certainly continue to see developers experiment with. The smaller our devices become, the more chances there are for apps that act all on their own.” - Kyle Vanhemert, On the Apple Watch, the Best DJ is an AI.

Recommend a research plan to design an experiment and investigate this trend. For example, generate relevant hypothesis, and discuss the intended population, independent and dependent variables, design of the experiment, storage of data, and statistical/graphical tests/results you expect to provide in support of your findings.

You do not need to consider every possible IVs and DVs for your research plan. Choose the most important ones according to you, briefly argue why they are important, and consistently lay out the data collection and the data analysis plan.

Fluency of Experience vs. User ControlStatistical Analaysis Final Essay

Research Proposal:

According to Kyle Vanhemert’s article in Wired, when it came to making this iPad-friendly app available for Apple Watch, “The watch is really interesting because of the constraints. To do something with those limitations is extremely hard…[thus] Pacemaker’s creators consciously opted to give users less control. However, with a larger device, the iPad, apps like Sketch allow for “90% of the driving.” This initial observation leads us to begin generating a theory, or possible explanation for this occurrence. From this theory we can generate a hypothesis. Since we are looking at many factors in this example, we can provide multiple appropriate hypothesizes:

H1: Participants will show significantly less satisfaction when inputting tasks into small devices with low levels of AI than small devices with high levels of AI.

H2: Participants satisfaction will increase as size of device decreases in negative correlation to the level of AI

Alternative Hypotheses: Small devices with low levels of AI increase user satisfaction more than large devices with high levels of AI.

Null Hypothesis: Level of AI and device size have no correlation and no impact on user satisfaction.

The next step is to determine the details of the experiment. It is vital to identify the intended populations so that we can collect a random sample from this population to test. Given that mobile devices are sold to the masses and thus used by a wide variety of users, we would want a sample of participants that had varying levels of computing experience. We also want to include a variety of ages. It’s dangerous to assume that only “young people” will by an Apple watch. Sure, younger people may have more disposable income and be interested in the newest tech gear, but mobile devices can assist people of any age, gender, even income bracket as even poorer demographics purchase mobile devices.

Having identified a population and gathered a sample of participants, it is necessary to determine how we are going to perform this experiment. While watching participants use their devices as they go about their daily lives would be a very informative ethnographic study, conducting an actual experiment in lab will probably give us the clearest, most useful data. Being in a lab will allow us control to manipulate variables and carefully measure exactly the variables we are looking for. A lab-based experiment will reduce random errors that may occur if the study was done by say a diary study in which users wrote about their experiences, they may make mistakes and use the wrong device for the wrong task, etc. Furthermore, conducting an experiment in which we have the user perform tasks then measure their time, errors, and perceived workload will enable us to ensure that the right tasks performed were done with the right device (validity) and that the participants were given the same instructions in the same manner and used the devices in a similar, distraction-free environment (reliability).

Now that we’ve determined to conduct an experiment in a lab, it’s necessary to define the variables in this study and determine what we are manipulating and what we are measuring. Independent variables are those that you manipulate. Therefore, I would have my independent variables be the mobile device. In order to keep the experiment relatively straightforward the users would be tested on the Apple Watch, and iPhone, and the iPad. Each input would also be tested at three levels of AI. Low would provide an experience in which the experience was controlled by the user at a 90% and the rest by algorithms, Medium in which 50/50 experience is controlled by human and AI, and high, in which only 10% of the control is in the hands of the user. I would also want to test different types of tasks to see if different types of task, saying drawing, are affected differently than say making selections in small boxes and entering text, such as with a survey software. Therefore, the independent variables for this study will include Level of AI, Mobile device, and Task.

In order to evaluate the fluency of the user’s experience, I have determined several dependent variables. First, during the test, I’d like to collect quantitative data on efficiency (time it takes to complete each task) and effectiveness (error count). Then, to further get an idea of the user’s overall satisfaction with each device at each level, I would have the participants complete a NASA-TLX questionnaire after each task with each device. This includes the user’s self reported levels of mental demand, physical demand, temporal demand, performance, effort, and frustration. I believe these measurements would help us determine the fluency of the user’s experience. Is the user frustrated, making many errors, and struggling or is the user able to complete the task quickly and with ease. I also feel confident using this well-established questionnaire as it as a proven track record of content and criterion validity. Not only is this an established instrument for measuring satisfaction, but I believe that these questions help us determine the fluency of the user’s experience. Thus NASA-TXL has high content validity as well.

It is imperative to note that the sample of our intended population that participates in the study is going to vary depending on several factors, including funding (can we pay participants to do the study, or do we hope they just volunteer their time and can we advertise this study widely, or do we not have the money to do so), time allowed to complete the study, number of researchers able to help conduct the study, and number of mobile devices we can provide the participants. However, for this study, we would require a fair amount of participants because I would like to conduct a between-subject designed experiment. I do this because it is important that the participants do not become familiar with and learn the apps as they go through the experiment. I want them to use the app for the first time. If, for example, they complete a task on Apple watch with limited AI, then perform that same task on Apple Watch with high AI, it will be difficult to determine if their output differences are because of the additional AI or because they have already become familiar with the app since the previous trial.

Based on this, the experiment would be set up as follows, this user study explored the effect of level of AI control and mobile device size on the fluency of the users’ experience.

As many users as possible will take part in the study, the goal is for at least 50. Each user will complete two tasks, one that involves drawing and swiping and a second task that involves checking boxes and entering text.

Each of the two tasks will be completed with each of the three devices (Apple Watch, iPhone, iPad) at each of the three levels of AI control (high, medium, and low). Users time and error count will be recorded for each task.

After completing the tasks, users will then complete a NASA-TLX questionnaire for each task in 3 different levels of AI.

It’s important to use care with the data being collected in the experiment. Now when collecting this data, data integrity, data that is accurate and verifiable, is paramount. Storing the data in a spreadsheet is simply not going to suffice, it can cause data inconsistencies, redundancies. Even though we can use .csv files for R, a spreadsheet is not a database and thus does not provide all the advantages that a database does.

One of the more confusing, but nonetheless essentials aspects of designing a database, is determining the database schema. Each value within the relation has a relationship is in relationship with the relation as a whole. And understanding how to set these relationships is key to a successful database. For example, knowing the primary key(s), in this case they will need to be specified with a task ID. Insertion anomalies, deletion anomlalies, and update anomalies are bound to occur when we have unrelated date. Normalization reduces this problem when we keep related data together. My normalized data is shown below.

One relation that could provide very useful is a demographic relation. The age, gender, handedness, and mobile device experience of the user could be interesting to look at in the end, and may provide interesting information if analyzed. It doesn’t hurt to gather this information and store it in a relation. The attribute for the first relation would be participant, age, gender, and handedness. The attributes for the second relation would be participants (as the primary key would be the participant’s ID), iPhone experience, iPad experience, Apple Watch experience. It would look something like below, is in 3NF:

Participant Age Gender Handedness

001 46 M Right

And the second relation for the demographics would be:

Participant iPhone Experience iPad Experience Apple Watch Experience

001 Yes No No

And finally:

Participant Participant Name

001 Joe

Of course, for running data analysis in R, it would still work if these charts also included mobile device experience as attributes in the same .csv file, but when setting up a database that is well designed, it is important to normalize the data to the highest degree possible.

I would also want a relation for Task Satisfaction that would include the following attributes: Participant, Device, all the NASA variables (mental demand, physical demand, temporal demand, performance, effort, and frustration) and Level of AI. This would need to be normalized as well:

Task Task Description

1 Trace house

2 Complete survey

Trial ID Participant Task Device Level of AI Mental Frustration All other NASA scores….

001 001 1 Watch Low 45 75

002 001 1 iPhone Low 45 85

003 001 1 iPad Low 35 25

004 001 1 Watch Medium 20 35

005 001 1 iPhone Medium 95 65

006 001 1 iPad Medium 75 85

007 001 1 Watch High 20 40

And this would continue for all the users at all levels of AI, all devices, and for both tasks.

In addition to NASA-TX scores. I also collected information on time and errors. This data would be stored as follows:

Trial ID Participant Task Device Level of AI Time (ms)

001 001 1 Watch Low 4545

002 001 1 iPhone Low 1245

003 001 1 iPad Low 3345

004 001 1 Watch Medium 2032

005 001 1 iPhone Medium 9586

006 001 1 iPad Medium 7525

007 001 1 Watch High 2083

Trial ID Participant Task Device Level of AI Error Count

001 001 1 Watch Low 34

002 001 1 iPhone Low 5

003 001 1 iPad Low 3

001 1 Watch Medium 22

001 1 iPhone Medium 9

001 1 iPad Medium 5

001 1 Watch High 3

Once the experiment is complete and all the data is collected into a database, it is time to analyze the data. Of course, it is important before even booting up R, to review how the experiment went. Were there any systematic or random errors? Remove that data. Were all the instruments working properly to produce valid results? Was the experiment done in a reliable manner, were all researchers on the same page about how to administer the experiment. Were all parts of the experiment measured, i.e. did all participants take the NASA-TXL questionnaire after each AI level, device change, and task?

Then comes time to run a statistical analysis. There are many tests I would run, so that I can look at the results in a variety of ways. I would like to see the overall efficiency (calculating the mean, standard deviation, standard error, and 95% confidence intervals of mean for the overall efficiency, effectiveness, and mean NASA-TXL of each task. This will give me an idea of which type of tasks may be easier to perform in general. I’d like the then run the same tests for each of the conditions, for example, a watch at high level AI for task 1.

I would also like to see if my data is normally distributed for my dependent variables at each level of AI in each device. I will also run test to find the median and inter-quartile range for each condition.

Most importantly, I would like to determine see correlations. I would run a pearson’s correlation to determine if there are correlations between size of the device and the level of AI that provides for the most fluent experience. This will be essential in proving or nullifying my hypothesis.

I will also make a line chart to show mean effectiveness, mean efficiency, and mean workload (mean of the mean NASA-TLX score) for both tasks 1 and 2, at the three levels of AI and three devices. Also chart error bars at 95% confidence interval of the mean and compare means at alpha. 05

Stats Final

Documents

Transcript of Stats Final