Serial Hook-ups: A Comparative Usability Study of Secure Device...

Serial Hook-ups: A Comparative Usability Studyof Secure Device Pairing Methods

Alfred KobsaDept. of Informatics

University of California, [email protected]

Rahim SonawallaDept. of Informatics


Gene TsudikDept. of Computer Science


Ersin Uzun∗

Dept. of Computer ScienceUniversity of California, Irvine

[email protected]

Yang WangDept. of Informatics


ABSTRACTSecure Device Pairing is the bootstrapping of secure commu-nication between two previously unassociated devices over awireless channel. The human-imperceptible nature of wire-less communication, lack of any prior security context, andabsence of a common trust infrastructure open the doorfor Man-in-the-Middle (aka Evil Twin) attacks. A num-ber of methods have been proposed to mitigate these at-tacks, each requiring user assistance in authenticating infor-mation exchanged over the wireless channel via some human-perceptible auxiliary channels, e.g., visual, acoustic or tac-tile.

In this paper, we present results of the first comprehensiveand comparative study of eleven notable secure device pair-ing methods. Usability measures include: task performancetimes, ratings on System Usability Scale (SUS), task com-pletion rates, and perceived security. Study subjects werecontrolled for age, gender and prior experience with devicepairing. We present overall results and identify problem-atic methods for certain classes of users as well as methodsbest-suited for various device configurations.

1. INTRODUCTIONWireless communication is very popular and is likely to

remain so in the future. In particular, medium- and short-range wireless communication methods (such as Bluetooth,WiFi, Zigbee and WUSB) are becoming ubiquitous on per-sonal devices, such as cell-phones, headsets, cameras andmemory sticks. In the past, wireless devices communicatedmostly with the (wired) infrastructure, e.g., cell-phones withbase stations or laptops with access points. However, mod-ern devices increasingly need to communicate among them-

∗Corresponding author.

Copyright is held by the author/owner. Permission to make digital or hardcopies of all or part of this work for personal or classroom use is grantedwithout fee.Symposium on Usable Privacy and Security (SOUPS) 2009, July 15–17,2009, Mountain View, CA USA.

selves, e.g., a Bluetooth headset with a cell-phone, a mem-ory stick with a PDA, a PDA with a wireless printer, or awireless access point with a laptop.

The convenience of seamless mobility and ubiquitous con-nectivity that comes with personal wireless devices is tem-pered by increased security and privacy risks. Compared toits wired counterpart, wireless communication is subject toeasier eavesdropping and other attacks. Specifically, the pro-cess of setting up an initial security context between wirelessdevices is prone to so-called Man-in-the-Middle (MiTM) at-tacks, also known as Evil Twin attacks. Countering MiTMattacks requires the communication channel to be authen-ticated, which in turn requires either a pre-shared secretor a common trust infrastructure, neither of which exists.Secure device pairing therefore aims at authenticating com-munication channels by using a human-perceptible auxiliarychannel.

The main challenge in secure device pairing stems fromtwo factors: (1) inherent exposure to attacks and (2) humanimperceptibility of wireless channels. Traditional crypto-graphic means of establishing secure communication (suchas authenticated key exchange protocols) are unsuitable forthe problem at hand, since the communication channel isnot authenticated and unfamiliar devices have no prior secu-rity context or common point of trust. Among a multitudeof device types and their manufacturers, there is no com-mon security infrastructure and none is likely to materializein the near future. This is due in part to the diversity ofdevices, lack of standards, and glacial progress of standard-ization bodies. However, there is wide-spread acceptance onthe part of device manufacturers and the research commu-nity that some form of human user involvement in securedevice pairing is unavoidable [32].

One natural and well-explored research direction aimed ataddressing the problem is the use of auxiliary “out-of-band”(OOB) channels, which are both perceivable and manage-able by the human user. An OOB channel takes advan-tage of human sensory capabilities to authenticate human-imperceptible information exchanged over the wireless chan-nel (which is subject to MiTM attacks). OOB channels canbe realized using acoustic, visual and tactile senses. Themain idea is that a human-perceivable OOB channel, un-like the main wireless channel, exposes MiTM attacks to theuser.

Since some degree of user involvement is unavoidable, us-ability becomes a crucial issue. Also, since a typical OOBchannel is low-bandwidth, the amount of information trans-ferred over it needs to be minimized for reasons of bothusability and efficiency. Most pairing methods (see Section2) involve sending only a few bits (e.g., 15) over the OOBchannel to achieve reasonable security. At the same time,some devices (e.g., Bluetooth headsets and wireless accesspoints) have very limited hardware capacities and poor orvery rudimentary user interfaces, making it a challenge tocommunicate even a few bits.

In the last decade, many secure device pairing methodshave been proposed, each claiming certain advantages andexhibiting certain shortcomings. As described in Section 2.2,their fundamental distinguishing characteristics concern thenature of the OOB channel and assumptions regarding theuser interface and device features.

Even though some methods have been field-tested on theirown (e.g, [29, 22]), no comprehensive and comparative us-ability evaluation of these methods has been carried out1.Very recently, Kumar, et al. [18] reported on the first studycomparing a number of prominent methods. However, sinceit focused mainly on security, [18] has not yielded notableusability results.2 Furthermore, the study in [18] has thefollowing issues:

1. The set of participants was narrow, comprising mostlyyoung (and male) graduate students very familiar withthe newest technology.

2. Test administrators were also the developers of someof the tested methods.This undermines the perceivedneutrality of the study.

3. Although the sequence of methods tested by each sub-ject was random, it was not controlled for overall uni-form distribution.

4. Each method was tested multiple times with differenterrors that simulated MiTM attacks. Although nec-essary for security measurements, this undermines theaccuracy of usability evaluation, due to subject fatigue,uneven number of test cases and varying degrees oflearning effect among methods.3

In general, we observe that many (if not most) secure de-vice pairing methods have been developed by security re-searchers who, not surprisingly, are experts in security andnot usability or HCI. What seems simple and user-friendlyto a seasoned security professional might not be either to anaverage user. Non-specialist users are often initially clue-less about manipulating new devices and have insufficientunderstanding of security issues and the very meaning ofuser participation in secure device pairing. This disconnectbetween developers and average users as well as aforemen-tioned issues with [18], serve as the chief motivation for thestudy presented in this paper.1Concurrent with and independent of our work, similar re-search has been conducted by Kainda, et al. [14].2The only usability insight was obtained by asking partic-ipants to rank the perceived difficulty of tested methods,from easy to hard within each pre-defined group. Partici-pants’ post-hoc perception is only one of several usabilityindicators though.3The large number of error scenarios also resulted in thestudy being broken into three batches, each batch separatedby several days and taking place in a different environment.

Organization: Section 2, reviews prominent secure de-vice pairing methods. Section 3 discusses our criteria for se-lecting candidate methods. The rest of the paper describesthe design of our study (Section 4), presents its results (Sec-tion 5) and discusses its implications.

2. BACKGROUNDAs a background, we now overview relevant cryptographic

protocols and secure device pairing methods that use theseprotocols. (Those familiar with secure device pairing maywish to skip this section with no lack of continuity). Theterm cryptographic protocol denotes the entire interactioninvolved, and information exchanged, in the course of thepairing method. The term pairing method refers to the pair-ing process as viewed by the user, i.e., user actions. Asdiscussed below, a particular cryptographic protocol can berealized using many pairing methods.

2.1 Cryptographic ProtocolsA very simple protocol for device pairing was first sug-

gested in [2]: devices A and B exchange their respectivepublic keys pkA and pkB over the insecure channel, and thecorresponding hashes H(pkA) and H(pkB) over the OOBchannel. Although non-interactive, this protocol requiresH() to be a (weakly) collision-resistant hash function andthus needs at least 80 bits of OOB data in each direction.MANA protocols [9] reduce the size of OOB messages tok bits while limiting attacker’s success probability to 2−k.However, they impose a stronger assumption on the OOBchannel: the adversary cannot delay or replay any OOBmessages.

An alternative approach involves Short Authenticated Strings(SAS). The first SAS protocol was proposed in [34]. It lim-its attack probability to 2−k for a k-bit OOB channel, evenwhen the adversary can delay and/or replay OOB messages.This protocol uses commitment schemes (which can be basedon hash functions such as SHA-2) and requires four rounds ofcommunication over the wireless channel. Subsequent work[19, 24] yielded 3-round SAS protocols.4 Generally, SASprotocols are used in device pairing settings where either:(1) the OOB channel is used to transmit something (i.e.,the SAS itself) from one device to another, or (2) the useris asked to compare two values emitted by the respectivedevices.

Some pairing methods require the user to generate a se-cret random value and somehow enter it into both devices.Devices then perform authenticated key exchange, using theuser-generated secret as a means of one-time authentication.Cryptographic protocols used for this purpose are calledPassword-Authenticated Key Exchange (PAKE) protocols[4].

2.2 Device Pairing MethodsBased on cryptographic protocols described above, a num-

ber of pairing methods have been proposed. They operateover different OOB channels and offer varying degrees ofsecurity and usability.

“Resurrecting Duckling” [31] is the initial attempt to ad-dress the device pairing problem in the presence of MiTMattacks. It requires standardized physical interfaces and ca-

4Recently, [27, 28] proposed even more efficient SAS proto-cols which are used in several pairing methods we studied.

bles. Though appropriate in the 1990s, it is clearly obso-lete today, due to the greatly increased diversity of devices.Requiring physical equipment (i.e., a cable) also defeats thepurpose of using wireless connections. Another early methodis“Talking to Strangers”[2], where infrared (IR) communica-tion is used as the OOB channel. It requires almost no userinvolvement, except for the initial setup. Unlike many othermethods, it has been extensively tested [1]. This method isdeceptively simple: since IR is line-of-sight, set-up requiresthe user to find IR ports on both devices – not a trivial taskfor many – and align them. Also, despite its line-of-sightproperty, IR is not completely immune to MiTM attacks.The main drawback is that IR has been largely displaced byother wireless technologies, such as Bluetooth, and is avail-able on very few modern devices.

Another early approach involves image comparison. Itencodes the OOB data into images and asks the user tocompare them on two devices. Prominent examples include“Snowflake” [10], “Random Arts Visual Hash” [25] and “Col-orful Flag” [7]. Such methods require both devices to havedisplays with sufficiently high resolution. Their applicabilityis thus limited to high-end devices, such as laptops, PDAsand certain cell-phones.

In [22], McCune, et al. proposed the “Seeing-is-Believing”(SiB) method. In its original form, SiB requires a bidirec-tional visual OOB channel: each device, one after the other,encodes OOB data into a two-dimensional barcode whichit displays on its screen and the other device “reads it” us-ing a photo camera, operated by the user. At a minimum,SiB requires both devices to have a camera and a displayfor bidirectional authentication. Thus, it is unsuitable forlower-end devices. Our study includes an SiB variant from[27, 28] which only requires one device to have a camera.We refer to it as “See-Believe”.

A related approach, called “Visual authentication basedon Integrity Checking” (VIC) was explored in [27]. Like SiB,it uses the visual OOB channel and requires one device tohave a continuous visual receiver, e.g., a light detector or avideo camera. The other device must have at least one LED.The LED-equipped device transmits OOB data by blinking,while the other receives it by recording the transmission andextracting information based on inter-blink gaps. The re-ceiver device indicates success/failure to the user who, inturn, informs the other to accept or abort. We refer to thismethod as “Video”.

[26] proposed several pairing methods based on synchro-nized audio-visual patterns: “Blink-Blink”, “Beep-Beep” and“Beep-Blink”. All of them involve users comparing very sim-ple audiovisual patterns, e.g., in the form of “beeping” and“blinking”, transmitted as simultaneous streams, formingtwo synchronized channels. One advantage of these meth-ods is that they only require devices to have two LEDs or abasic speaker.

Another recent method is “Loud-and-Clear” (L&C) [11].It uses the audio (acoustic) OOB channel along with vocal-ized MadLib sentences which represent the digest of infor-mation exchanged over the main wireless channel. There aretwo L&C variants: “Display-Speaker”and“Speaker-Speaker”.In the latter, the user compares two vocalized sentences andin the former – a displayed sentence with its vocalized coun-terpart. Minimal device requirements include a speaker (oraudio-out port) on one device and a speaker or a display onthe other. The user is required to compare two respective

(vocalized and/or displayed) MadLib sentences and eitheraccept or abort the protocol based on the outcome of thecomparison. In this paper, we use the L&C variant basedon SAS protocols [24, 19] to reduce the number of wordsin the MadLib sentences. Depending on the required userinteraction, we call the two L&C variants as “Listen-Look”and “Listen-Listen”.

As a follow-on to L&C, HAPADEP [30] considered pair-ing devices that have no common wireless channel, at leastnot at pairing time. HAPADEP uses pure audio to trans-mit cryptographic protocol messages and asks the user tomerely monitor device interaction for any extraneous audiointerference. It requires both devices to have speakers andmicrophones. To cater to very basic device settings, we in-clude a HAPADEP variant that uses the wireless channelfor cryptographic protocol messages, and the audio as theOOB channel. We call this variant “Over-Audio”. It needsonly one device to be equipped with a speaker, and the otherwith a microphone. Also, on the user’s part, it involves noentry of data and no comparisons.

A small-scale experimental investigation [33] presentedthe results of a comparative usability study of four simplepairing methods for devices with displays capable of showinga few (4-8) decimal digits:

Compare-and-Confirm: The user compares two (4-, 6- or8-digit) numbers displayed by respective devices.

Select-and-Confirm: One device displays a single number.The other displays a set of numbers and user selectsone that matches the number displayed by the firstdevice.

Copy-and-Confirm: The user copies a number from onedevice to the other.

Choose-and-Enter: The user picks a “random” 4-to-8-digitnumber and enters it into both devices.

Though all these methods are very simple, [33] shows thatSelect-and-Confirm and Copy-and-Confirm are slow and error-prone. Furthermore, Choose-and-Enter is insecure, sincestudies show that numbers selected by users exhibit verypoor randomness.

Yet another approach – BEDA [29] – involves the userpressing device buttons, thus utilizing the tactile OOB chan-nel. BEDA has several variants: LED-Button, Beep-Button,Vibration-Button and Button-Button. In the first two (basedon the SAS protocol [27]), whenever the sending device blinksits LED (or vibrates or beeps), the user presses a button onthe receiving device. Each 3-bit block of the SAS string isencoded as the delay between consecutive blinks (or vibra-tions or beeps). Thus, repeated button presses transmit theSAS from one device to another. In the Button-Button vari-ant – which works with any Password-Authenticated KeyExchange (PAKE) protocol [4] – the user simultaneouslypresses buttons on both devices and random user-controlledinter-button-press delays are used as a means of establishinga common secret. In this paper, we refer to BEDA variantsLED-Button, Beep-Button and Vibration-Button as “LED-Press”, “Beep-Press” and “Vibrate-Press”, respectively.

There are other methods involving technologies that arecurrently expensive and/or uncommon on commodity de-vices. We briefly summarize a few. [15] suggested usingultrasound as the OOB channel. A related technique useslaser and requires each device to have a laser transceiver

Device/Equipment Requirements User Actions

Pairing Method Sending Device Receiving Device Phase I: Setup Phase II: Exchange Phase III: Outcome OOB Channels

Visual Comparison BasedImage-ComparePIN-CompareSentence-Compare

Display + user-input on both NONE

Compare:two imagestwo numberstwo phrases

Abort or accept on both devices Visual

Seeing is Believing (SiB)See-Believe

Display + user-input

Photo camera + user-output

NONEAlign camera on receiving device with displayed barcode on sending device, take picture

Abort or accept on sending device based on receiving device decision

Visual

Visual Integrity Code (VIC)Video

LED + user-input

User-output +Light detector or video camera

NONEInitiate transmittal of OOB data by sending device, align camera or light detector on receiving device.


Visual

Loud & Clear (L&C)Listen-LookListen-Listen

User-input on both +display on one & speaker on the other, orspeaker on both

NONECompare:two vocalizationsDisplayed phrase with vocalization

Abort or accept on both devicesAcoustic, or Acoustic+ visual

Button-Enabled (BEDA) Vibrate-PressLED-PressBeep-Press

User input +vibration LEDbeeper

User output +One button Touch or hold both devices

For each signal (display, sound or vibration) by sending device, press a button on receiving device

Abort or accept on sending device based receiving device decision

TactileVisual + tactileAcoustic+ tactile

Audio Pairing (HAPADEP)Over-Audio

Speaker + user-input

Microphone + user-output

NONEWait for signal from receiving device.

Abort or accept on sending device

Acoustic

Resurrecting Duckling Hardware port (e.g., USB) on and a cable Connect cable to devices NONE NONE Cable

Talking to Strangers IR port on both Find, activate, align IR ports NONE NONE IR

Copy–and-ConfirmDisplay +user-input

Keypad + user-output

NONEEnter value displayed by sending device into receiving device


Visual

Choose-and-Enter User input on both devices NONESelect “random” value and enter it into each device

NONE(unless synch. Error)

Tactile

Audio/Visual Synch.Beep-BeepBlink-BlinkBlink-Beep

User-input on both +Beeper on each LED on eachBeeper on one & LED on the other

NONE

Monitor synchronized:beeping, or blinking, or beeping & blinking

Abort on both devices if no synchrony

Visual AudioAudio + visual

Smart-its-Friends, Shake-Well-Before-Use

2-axis accelerometers on both +user-output on one

Hold both devicesShake/twirl devices together, until output signal

NONE (unless synch. error)

Tactile + motion

Figure 1: Feature Summary of Notable Device Pairing Methods

[21]. In “Smart-Its-Friends” [13], a common movement pat-tern is used to communicate a shared secret to both devicesas they are twirled and shaken together by the user. A sim-ilar approach is developed in “Shake Well Before Use” [20].Both techniques require devices to have 2-axis accelerome-ters. Although some new cell-phones (e.g., the iPhone) arethus equipped, accelerometers are rare on most other de-vices. Also, physical shaking/twirling is an activity unsuit-able for delicate as well as stationary or large/bulky devices.

Methods SummaryFigure 1 summarizes our discussion of existing methods bycomparing their salient features. The following terminologyis used:

Sending/Receiving Device: applies to all methods wherethe OOB channel is used in one direction.

Phase I: Setup: user actions to bootstrap the method.

Phase II: Exchange: user actions as part of the protocol.

Phase III: Outcome: user actions finalizing the method.

User-input: any means of user input, e.g., a button.

User-output: any user-perceivable means of output, e.g.,an LED.

3. SELECTION OF PAIRING METHODSAs follows from the above overview, there is a large body

of prior research on secure device pairing and many proposedmethods. As shown in Figure 1, there are about 20 notablemethods (counting variations). In the course of performingextensive pilot tests, we determined that only about half ofall methods ought to be included in a within-subject study,mainly to avoid user fatigue. We therefore eliminated thefollowing methods (at the bottom of Figure 1):

Resurrecting-Duckling: obsolete, requires cables.

Talking-to-Strangers: obsolete, IR ports are uncommon.

Copy-and-Confirm: performed poorly in prior evaluationsdue to high user error rate.

Choose-and-Enter: performed poorly in prior evaluationsdue to low security.

Simple Audio/Visual Synchronization (Beep-Beep, Blink-Blink, Beep-Blink): performed poorly in prior evalua-tions due to user annoyance and high error rate.

Smart-its-Friends, Shake-Well-Before-Use, Ultrasound- andLaser-based methods: require interfaces that are un-common on many current types of devices.

All remaining methods were included in our study.

4. EXPERIMENTAL DESIGN

4.1 ApparatusWe used the two Nokia cell-phone models E51 and E615

as test devices. Both models have been released for at leasttwo years and do not represent the cutting edge. We se-lected these particular models to avoid devices with exotic orexpensive features and faster-than-average processors. An-other reason for choosing these devices is the number ofcommonly available interfaces, such as:

User-input: keypad (subsumes button), microphone, videocamera (subsumes photo camera)

User-output: vibration, speaker (subsumes beeper), colorscreen (subsumes LED)

Wireless: Bluetooth, Wi-Fi and cellular (GSM)

In all our tests, Bluetooth was used as the (human-impercept-ible) wireless channel; it is both inexpensive and widelyavailable. For methods that involve beeping, the cell-phonespeaker is trivial to use as a beeper. Whenever a buttonis needed, one of the keypad keys is easily configured forthat purpose. An LED is simulated with a small LED-likeimage glowing (alternating between light and dark) on thecell-phone screen.6

In comparative usability studies, meaningful and fair re-sults can only be achieved if all methods are tested undersimilar conditions. In our case, the fair comparison basisis formed by using (1) the same test devices, (2) consistentGUI design practices (e.g., safe defaults), and (3) the sametargeted level of security for all methods. We also auto-mated timing and logging to minimize administrator errorsand biases.

In order to have a unified test platform, our implementa-tion of the eleven selected device pairing methods was basedupon the open-source comparative usability testing frame-work developed by Kostiainen, et al. [17]. It provides basiccommunication primitives as well as automated logging andtiming functionality. However, we still had to implementseparate user interfaces and simulated functionality for alltested methods in JAVA-MIDP. For all methods, we keptSAS string length (and secret OOB string length in Button-Button) constant at 15 bits. It is well-known that this sizeprovides a reasonable level of security [34].

As implemented on our test platforms, each device pair-ing method very closely approximates user experience witha real implementation. One difference is that our versions oftested methods omit initial rounds of the underlying cryp-tographic protocol over the (human-imperceptible) wirelesschannel. However, this omission is completely transparentto users.

The only methods noticeably different from their real-world implementations were Seeing-is-Believing and Video.Due to the difficulty of implementing image and video pro-cessing on cell-phones, we chose to simulate their opera-tions.7 Specifically, we saved the captured barcode image

5See http://www.nokiausa.com/A4579382 and europe.nokia.com/A4142101 for their respective specifications.6Even though both tested cell-phones have LEDs, there areunfortunately no system calls to access them via Java MIDP.7The current CMU implementation of Seeing-is-Believing issupported on Nokia models N70 [23] and 6620 [22] as receiv-

in Seeing-is-Believing (and the recorded video of blinkingscreen in Video) on the test device and manually analyzedlater whether their quality was sufficient for image recog-nition. From the user’s perspective, the only difference isthat these pairing methods do not fail, which is not prob-lematic since each user only tests these methods once. Also,execution times of these two methods were penalized by afew seconds in our tested implementations, since a systemsecurity notification popped up each time the camera wasactivated by our third-party testing software.

4.2 SubjectsThe 22 study participants were adults, mainly from Irvine,

California. Most of them were University of California stu-dents and staff. They were balanced by age group (eightwere between 18 and 25, seven between 26 and 40, and sevenwere 41 and over), and also separately by gender (i.e., elevenfrom each gender).

4.3 ProceduresWe conducted a within-subjects experiment, in which all

participants were subject to the following procedures:Background Questionnaire: Subjects were polled on

age, gender, ownership of mobile device, experience withdevice pairing, and experience with different functionalityoffered by mobile devices, e.g., messaging, gaming, music.

Scenario Presentation: Subjects were asked to imaginethat they had just bought a new cell-phone and that a storeemployee had already set up everything for them. Whenthey returned home they wanted to pair their new cell-phonewith their old one.

Experiment with pairing methods: Subjects sequen-tially performed the following procedures for each of theeleven tested methods.

1. They were given brief and simple instructions on thenext pairing method, both textually on one of the de-vices and orally by the test administrator.

2. They tried pairing the devices with one of the testedmethods to establish a connection, and thereby per-formed one of the following actions:

Beep-Press: When device A beeps, the user presses akey on device B. The user then accepts or rejectsthe outcome on A based on the output (green orred LED) on B.

LED-Press: When an LED on A turns ON, the userpresses a key on B. The user then accepts orrejects the outcome on A based on the output(green or red LED) on B.

Image-Compare: Both A and B display a visual pat-tern. The user compares the two patterns and de-cides whether they match and enters the decisioninto both devices.

ing devices. The current Nokia implementation of Videois supported only on Nokia 6630 [16] as the receiving de-vice. Since we wanted to perform our tests on the samedevices throughout, neither implementation could be used.Moreover, porting existing implementations onto our deviceswas not viable since characteristics of cameras on these cell-phones are quite different and each performs its own adjust-ments to images and video, at the operating system level.

Listen-Listen: Both phones “vocalize” a 3-word sen-tence. The user decides whether the two sen-tences match and enters the decision into bothdevices.

Listen-Look: A displays a 3-word sentence, while Bvocalizes a 3-word sentence. The user decideswhether the sentences match and enter the de-cision into both devices.

PIN-Compare: Both A and B display a 5-digit num-ber. The user decides whether the numbers areidentical and enters the decision into both devices.

Sentence-Compare: Both A and B display a 3-wordsentence. The user decides whether the sentencesmatch and enters the decision into both devices.

Over-Audio: A transmits data over audio and B re-ceives the transmission (records it). User confirmsthat no other nearby source emits audio duringthe process.

Seeing-is-Believing (See-Believe): With device A,the user takes a photo of a barcode displayed byB. Based on the output by A, the user either ac-cepts or rejects the outcome on B.

Vibrate-Press: Whenever A vibrates, the user pressesa key on B. The user then accepts or rejects theoutcome on A based on the output (green or redLED) of B.

Video: With device A, the user takes a video clip ofa blinking pattern displayed by B. Based on theoutput by A, the user either accepts or rejects theoutcome on B.

To avoid order effects (particularly due to training andfatigue), the sequence of performing the eleven pair-ing methods tasks was counter-balanced using a LatinSquare design.

3. Subjective Perceptions: Subjects completed theSystem Usability Scale (SUS) questionnaire [5], a widelyused and highly reliable 10-item Likert scale that pollssubjects’ satisfaction with computer systems [3]. Weused the original questions from [5], but replaced “sys-tem” with “method”. Subjects also rated the perceivedsecurity of each method, on the same scale.

4. Observable usability indicators: The following mea-sures of observable usability indicators were taken foreach device pairing: task performance time, errors (ifany), and task completion (i.e., whether or not a con-nection was established). Subjects were videotapedduring the pairing process (but were told beforehandthat only their hands and the devices would be cap-tured).

5. Qualitative post-test questionnaire and inter-view: Subjects completed a brief questionnaire thatasked them to name the three easiest and the threehardest methods. It also asked them to pick two meth-ods they would like to see on their personal device andto indicate why (the options given were “easy”, “se-cure” and “fun”). Subjects could explain orally if theypreferred a method that was not tested.

For each subject, the entire experiment lasted between 30and 45 minutes.

5. RESULTSAs described above, the following measures were collected

before and during the experiment, which form the within-subjects usability measures and between-subjects factors ofour study:

Within-subjects usability measures: task performancetime, SUS score, perceived security, and task comple-tion (a categorical variable).

Between-subjects factors: age group, gender, and priorexperience with device pairing.

Unless indicated otherwise, statistical significance will bereported at the 5% level (flagged as “*” or “significant”, or“<”“or >” in comparisons), and at the 1% level (flagged as“**” or “highly significant”, or “�” or “�” in comparisons).

5.1 Cross-correlation of usability measuresIt is a common assumption in HCI that usability measures

are typically not independent of each other, but rather cor-related. For instance, user satisfaction is assumed to benegatively correlated to some degree with task performancetimes. A broad meta-study by Frøjær et al. [8] challengesthis view though. The authors recommend that “unless do-main specific studies suggest otherwise, effectiveness, effi-ciency, and satisfaction should be considered independentaspects of usability and all be included in usability testing.”

We therefore performed linear cross-correlations of thefour usability measures. Table 1 shows the correlation coef-ficients and their statistical significance.

Task Perceivedperformance SUS security

timeSUS -0.383** -

Perceived Security -0.211** 0.512** -Task completion -0.248** 0.126 0.039

Table 1: Cross-Correlation of Usability Measures

In the Social Sciences, coefficients from -0.3 to -0.1 and0.1 to 0.3 are generally regarded as small, and coefficientsbetween -0.5 to -0.3 and 0.3 to 0.5 as medium [6]. In linewith the findings of [8], we cannot regard any of our usabil-ity measures as sufficiently correlated with others that theycould be justifiably omitted. On the other hand, since themeasures are lowly correlated, it does make sense to alsolook at them as a whole. In the following, we are going topresent an analysis of each usability measure individually,and thereafter perform a cluster analysis based on a princi-pal component analysis globally for all usability measures.

5.2 Individual Usability MeasuresA Repeated Measures Analysis of Variance was performed

for each usability measure except for task completion (whichis categorical). The between-subjects factors are age group,gender, and prior experience with device pairing, while thewithin-subjects factor is method. Pairwise (unpaired one-tailed) t-tests were performed between different levels of thebetween-subjects factors on each within-subjects measure,except for task completion that was subject to a Chi-square

Figure 2: Effects of method on the four usability measures

test instead. Pairwise (paired one-tailed) t-tests between dif-ferent methods were performed on each within-subjects mea-sure. Pairwise (one-tailed) McNemar’s Chi-squared testswere performed between different methods on task comple-tion.

5.2.1 OverviewFigure 2 shows the averages of these usability measures

for each method, normalized by the maximum average forthe respective measure which equals 100%. Coincidentally,PIN-Compare fares best and thus represents 100% along allusability measures. To improve the visibility of the three-dimensional bar chart and to consistently associate greaterheight with ”good”, the inverses of the task performancestimes are plotted in the front layer. Again to improve vis-ibility, the methods are sorted from left to right by inversetask performance time. This order should however not beinterpreted as an overall ranking (except for PIN-Comparethat tops all four measures).

A first impression from Figure 2 is that a few methodsrank equally along all four usability measures, whereas con-siderable differences along the different measures exist forall other methods. This suggests that a ranking of methodsshould consider all four usability measures rather than onlya single one. Section 5.3 will present such a ranking based ona Principal Component Analysis of all four measures. An-other observation from Figure 2 is that the task performancetimes of the different methods vary considerably (rangingfrom an average of 15.7 sec for PIN-Compare to 93.4 sec forVibrate-Press), while the other measures show a far lowervariability. This suggests that more attention should be paidto the task performance times than to the other measures,since it is a very distinguishing characteristic.

5.2.2 Effect of MethodRepeated measures analysis of variance revealed that meth-

od has a highly significant effect on task performance time,SUS score, and perceived security (p�0.001 in all cases).Below we will discuss the effects on these individual usabil-ity measures in more detail.

Task performance time: The following differences betweenthe means of the pairing methods were statisticallysignificant, including their transitive hulls (19 pairwisecomparisons were made to arrive at these results)8:

PIN-Compare � Image-Compare < Listen-LookPIN-Compare < Sentence-Compare < Over-AudioOver-Audio < Listen-Listen < LED-PressListen-Look � See-Believe < Video-CompareListen-Look < Listen-Listen < LED-PressVideo-Compare < LED-Press < Beep-PressVideo-Compare < Vibrate-Press

Particularly noteworthy is the high significance of thedifference between the means of two fastest methods:PIN-Compare and Sentence-Compare.

8In the case of multiple comparisons, researchers can as-sign an acceptable type I error α (false positives) either toeach individual comparison or jointly across all comparisons.Hochberg and Tamhane [12] advise that if“inferences are un-related in terms of their content or intended use (althoughthey may be statistically dependent), then they should betreated separately and not jointly.” This position has beenadopted in this study. To judge the results along a joint α,a significance level α/n can be chosen for each test, with nbeing the number of pairwise comparisons performed (Bon-ferroni correction).

Figure 3: Average task performance time by age group

SUS score: The following significant differences betweenmeans were observed, including their transitive hulls(22 pairwise comparisons were performed to arrive atthese results):

PIN-Compare > Sentence-Compare > Over-AudioSentence-Compare � Image-Compare > See-BelieveOver-Audio > Listen-Look > Listen-ListenListen-Look > LED-Press, Vibrate-PressSee-Believe > Video-Compare > Beep-Press

Perceived Security: The following groups of methods canbe distinguished, in decreasing order of perceived se-curity (40 pairwise comparisons were made to arriveat these results):

1. PIN-Compare

2. Sentence-Compare, Over-Audio, Listen-Look

3. See-Believe, LED-Press, Vibrate-Press, Beep-Press,Listen-Listen, Video

Differences in means between these groups were foundto be (highly) significant, while difference in meanswithin the groups were not.

Task completion: Task completion rates were generallyvery high, and no noteworthy statistically significantpairwise difference could be found.

5.2.3 Effect of AgeAs one would expect, the 18-25 year age group exhibited

the shortest task performance time (30.6 sec on average).Surprisingly, though, it was not the oldest age group butrather the middle age group that had the longest task per-formance time (26-40 years: 57.1 sec; 41 and above: 44.2sec).9 Figure 3 shows the task performance times of the9The difference between the young and the middle group issignificant at the p�0.001 level, and the difference betweenthe young and the old group significant at the p=0.019 level.

different pairing methods listed in the same sequence as inFigure 2 (in contrast to this prior figure though, the taskperformance times and not their inverses are plotted, andhence short height is “good”). The youngest age group canbe seen in the front plane, the oldest age group in the mid-dle plane, and the middle age group in the rear plane. Thetime differences between ages were particularly stark for thefollowing methods:

Beep-Press: means of age group 18-25: 39.4 sec; means ofpooled age groups 26-40 and 41 and above: 89.4 sec(the difference is highly significant).

Vibrate-Press: means of age group 18-25: 68.4 sec; meansof pooled age groups 26-40 and 41 and above: 107.7 sec(the difference is approaching significance, p=0.089).

LED-Press: means of pooled age groups 18-25 and 41 andabove:10 53.3 sec; means of age group 26-40: 96.9 sec(the difference is significant).

Different age groups also had somewhat different taskcompletion rates: 94.4% on average for age group 18-25,88.7% for the age group 26-40, and 86.3% for the age group41 and above. The difference between the young and theold age group is significant, and the difference between theyoung and middle age group approaches significance (p=0.074). Different age groups also exhibit differences with re-gard to perceived security. The security rating of the 26-40year olds (7.1 out of 10 on average) is significantly higherthan that of the 41 and above group (5.9 out of 10). The18-25 year olds perceived the security somewhat in between(6.3 out of 10).

10Note that this is a pool of all participants but for the middleage group.

Figure 4: Average task performance time by gender

5.2.4 Effect of GenderMales generally assigned higher SUS scores than females

(average for males = 76.5, average for females = 66.07, p <0.001), and also perceived the security of the pairing meth-ods higher than females (average for males = 7.0, averagefor females = 5.9, p < 0.001). Figure 4 shows that femaleswere generally also somewhat slower than males (average formales = 39.4 sec, average for females = 47.1 sec, p=0.07).The differences in time were particularly stark for Listen-Listen (average for males = 28.4 sec, average for females= 53.5 sec, significant) and for Vibrate-Press (average formales = 76.9 sec, average for females = 109.9 sec, not sig-nificant).

5.2.5 Effect of ExperienceFigure 5 shows the average task performance time per

method, split by subjects who had prior experience with de-vice pairing and those who had not. No overall significanteffect of experience could be observed. This can probably beregarded as an indicator that the pairing methods were easyenough and the experimental instructions effective enoughthat all subjects attained roughly the same skill level, inde-pendent of prior experience. The sizable task performancetime difference for LED-Press was not statistically signifi-cant due to enormous within-group variability.

5.3 Cluster AnalysisA cluster analysis based on principal components was per-

formed to determine methods that are closely related withregard on our usability measures. Table 2 lists the four prin-cipal components that explain 100% of the variance in thedata. The first component PC1 explains nearly 75% of thevariance, and the second component adds just 16.6% to this.

PC1 PC2 PC3 PC4Standard deviation 1.7292 0.8137 0.5307 0.2575

Proportion of Variance 0.7475 0.1655 0.0704 0.0166Cumulative Proportion 0.7475 0.9130 0.9834 1.0000

Table 2: Principle Components of Usability Mea-sures

Figure 5: Average task performance time by expe-rience

For all practical purposes, we may thus disregard PC2through PC4 since they contribute little. Table 3 shows thefactor loadings of PC1. Not surprisingly, task performancetime loads negatively while all other factors show a positiveloading.

xTask performance time -0.47

SUS 0.56Perceived security 0.53

Task completion 0.42

Table 3: Factor Loadings of PC1

Figure 6 shows the result of a cluster analysis on prin-cipal components. Three clusters with six, two and threeconnection methods, respectively, can be distinguished (analternative two-cluster solution would have merged clusters2 and 3, but it makes less sense conceptually). Since Com-ponent 2 can be largely disregarded, methods towards theleft side of Figure 6 can be regarded as “good” and methodstowards the right as “bad”.

5.4 Post-experimental ranking of easiest andhardest methods

As part of the exit questionnaire, subjects were asked torank-order the three easiest and the three hardest methodsin their view. The rank-order average across all subjects ona 1 (easiest) to 11 (hardest) scale can be seen in Table 4.

Easiest 1. PIN-Compare 1.82. Sentence-Compare 2.13. Over-Audio 2.9

↓ ......... ...9. Seeing-is-Believing 9.210. LED-Press 9.6

Hardest 11. Video 10.3

Table 4: Post-Experimental Ranking of Easiest andHardest Methods

Figure 6: Result of cluster analysis based on principal components

6. DISCUSSIONAs shown in Table 1 further above, several highly signifi-

cant correlations can be observed between our usability mea-sures (p<0.001). Particularly noteworthy is the medium-strong positive correlation between perceived security andSUS score. Few participants had a technical backgroundsufficient to objectively assess the security of different de-vice pairing methods. It seems that they partly relied ontheir usability rating of the methods instead.

We believe that Figure 6 is the clearest representation ofour study’s overall results. In it, the two methods in Clus-ter 2 (PIN- and Sentence-Compare) perform best overall,and the three methods in Cluster 3 (Over-Audio, Image-Compare and Listen-Look) come in as close second. How-ever, viewed in isolation, PIN-Compare stands out againstall others.

The main common feature of all methods in the two topclusters is that each requires a user comparison and a de-cision based on presented visual information, except Over-Audio which does not present any information to the userand is thus least taxing. In the remaining cases, the compar-ison is only between visual information, except for Listen-Look where it is between visual and audio information. Asexpected, the comparison of limited visual information (shortPINs in PIN-Compare) ranks higher in usability than meth-ods that require comparing more extensive visual informa-tion (Sentence- and Image-Compare11). In contrast, meth-ods in the lower-ranked Cluster 1 require users to performmanual actions (press buttons, take pictures or video clips)

11The images generated by the Image-Compare method arepatterns and do not contain recognizable objects.

or to listen to two successively spoken sentences, which isgenerally more taxing. Such added requirements seem tohave a negative effect on the usability measures in our study.

It is heartening that subjects’ post-experimental rankingof the easiest and hardiest methods (see Table 4) matchesexactly the ranking along the first principal component ofour usability measures (see Figure 6). Subjects reportedthat ease-of-use is by far the most important reason for themto favor a method. They praised methods which involved“noguesswork,” and they liked “comparisons that require littleeffort”. Only a few participants listed perceived security asa preference criterion (mostly in tandem with “ease”), andonly one person cited “fun”.12

One direct and practical consequence of our cluster anal-ysis is the following set of design guidelines, based on thecapabilities of the two devices involved:

1. If both devices have (even rudimentary) displays, theadvisable methods are, in order of preference: PIN-Compare, Sentence-Compare and Image-Compare.

2. If only one device has a display, and the other audiooutput, then Listen-Look is the best choice.

3. If neither device has a display, but one has audio out-put and the other audio input (microphone), then Over-Audio is recommended.

These guidelines can help manufacturers to implement meth-ods best-suited for specific pairs of devices. On a powerful

12This might be interpreted as indicating that tasks involvingsecurity should not be perceived as “fun”.

mobile device (e.g., a high-end cell-phone or a PDA) withrich I/O (and user) interfaces, the guidelines can also beused to determine the optimal pairing method based on thecapabilities of the other device.13

A practical consequence of our study of between-subjectsdifferences is the following set of population-specific guide-lines:

• Listen-Listen should be avoided, particularly for fe-male users, who took nearly twice as long to use thismethods compared with males.

• Beep-Press is suitable for the younger age group only(if at all), since the other age groups took more thantwice as long.

• LED-Press should be avoided, particularly for the mid-dle age group, since this group took nearly twice aslong as other age groups.

7. CONCLUSIONS AND FUTURE WORKThe study described in this paper sheds some much needed

light on usability factors of many secure device pairing meth-ods. Our experiment yielded numerous interesting results.In particular, the study clearly points to some methods thatshould be avoided altogether and several others (especially,those based on visual comparisons) that are well-suited formost users. It helps spot methods that are not well-suitedfor certain subgroups of the user population with regard toage, gender, and possibly also prior experience with devicepairing. It also helps identify methods best-suited for set-tings where one or both device(s) lack displays.

However, there remain a number of issues for future work,such as:

• Since each secure device pairing method aims to pro-tect the user(s) against MiTM attacks, a comparativeevaluation of all methods under such attacks needs tobe performed. Individual methods vary in terms offragility and specifics of applicable attacks.

• On a related note, it would be useful to investigate var-ious pairing methods in non-ideal settings, i.e., whenthe environment is not conducive to a specific method.Examples include performing visual comparisons withinsufficient light, or using the audio channel in thepresence of ambient noise.

• Our study was conducted with the population of healthy(physically unimpaired) adults. It would be valuableto perform similar studies with handicapped users, e.g.,vision- or hearing-impaired as well as those with lim-ited manual dexterity.

• All methods in our study were new to the participants.However, the effect of learning over time may signif-icantly change the security and usability results. Along-term study is needed to investigate this effect.

• Finally, our study only considered a situation whereone user pairs two devices. When two users need topair their respective devices, the setting changes anda separate effort must be made to evaluate usabilityfactors of various methods.

13To do so, either the user would have to be involved, ordevices would exchange their respective capabilities over thewireless channel.

8. ACKNOWLEDGMENTSThis research was supported by NSF grant CNS-0831526.

We thank Stacey Arnold and Nazia Chorwadwala for theirhelp in conducting usability experiments, and Sameer Patilfor his assistance in evaluating the results. Finally, we arevery grateful to the participants who made this study pos-sible.

9. REFERENCES[1] D. Balfanz, G. Durfee, R. Grinter, D. Smetters, and

P. Stewart. Network-in-a-Box: how to set up a securewireless network in under a minute. In USENIXSecurity, pages 207–222, 2004.

[2] D. Balfanz, D. Smetters, P. Stewart, and H. Wong.Talking to strangers: Authentication in ad-hocwireless networks. In Network and Distributed SystemSecurity Symposium (NDSS), 2002.

[3] A. Bangor, P. T. Kortum, and J. T. Miller. Anempirical evaluation of the system usability scale.International Journal of Human-ComputerInteraction, 24(6):574–594, 2008. DOI10.1080/10447310802205776.

[4] V. Boyko, P. MacKenzie, and S. Patel. Provablysecure password-authenticated key exchange usingdiffie-hellman. In Advances in Cryptology-Eurocrypt,pages 156–171. Springer, 2000.

[5] J. Brooke. SUS: a “quick and dirty” usability scale. InP. W. Jordan, B. Thomas, B. A. Weerdmeester, andA. L. McClelland, editors, Usability Evaluation inIndustry. Taylor and Francis, London, 1996.

[6] J. Cohen, P. Cohen, S. G. West, and L. S. Aiken.Applied multiple regression/correlation analysis for thebehavioral sciences. Lawrence Erlbaum Associates,Hillsdale, NJ, 1983.

[7] C. M. Ellison and S. Dohrmann. Public-key supportfor group collaboration. ACM Transactions onInformation and System Security (TISSEC),6(4):547–565, 2003.

[8] E. Frøkjær, M. Hertzum, and K. Hornbæk. Measuringusability: are effectiveness, efficiency, and satisfactionreally correlated? In CHI ’00: Proceedings of theSIGCHI conference on Human factors in computingsystems, pages 345–352, 2000.

[9] C. Gehrmann, C. J. Mitchell, and K. Nyberg. Manualauthentication for wireless devices. RSA CryptoBytes,7(1):29–37, 2004.

[10] I. Goldberg. Visual key fingerprint code.http://www.cs.berkeley.edu/iang/visprint.c,1996.

[11] M. T. Goodrich, M. Sirivianos, J. Solis, G. Tsudik,and E. Uzun. Loud and clear: Human-verifiableauthentication based on audio. In ICDCS ’06:Proceedings of the 26th IEEE International Conferenceon Distributed Computing Systems, page 10, 2006.

[12] Y. Hochberg and A. C. Tamhane. MultipleComparison Procedures. Wiley, New York, 1987.

[13] L. Holmquist, F. Mattern, B. Schiele, P. Alahuhta,M. Beigl, and H. Gellersen. Smart-its friends: Atechnique for users to easily establish connectionsbetween smart artefacts. In Ubiquitous Computing(UbiComp), pages 116–122, London, UK, 2001.Springer-Verlag.

[14] R. Kainda, I. Flechais, and A. W. Roscoe. Usabilityand security of out-of-band channels in secure devicepairing protocols. In 2009 Symposium On UsablePrivacy and Security (SOUPS), Mountain View, CA(this volume), 2009.

[15] T. Kindberg and K. Zhang. Validating and securingspontaneous associations between wireless devices. InInformation Security Conference (ISC), pages 44–53,2003.

[16] K. Kostiainen. Personal Communication, Mar 2008.

[17] K. Kostiainen and E. Uzun. Framework forcomparative usability testing of distributedapplications. In Security User Studies: Methodologiesand Best Practices Workshop, 2007.

[18] A. Kumar, N. Saxena, G. Tsudik, and E. Uzun.Caveat Emptor: A Comparative Study of SecureDevice Pairing Methods. In IEEE InternationalConference on Pervasive Computing andCommunications (IEEE PerCom’09), 2009.

[19] S. Laur and K. Nyberg. Efficient mutual dataauthentication using manually authenticated strings.In International Conference on Cryptology andNetwork Security (CANS), volume 4301, pages90–107, 2006.

[20] R. Mayrhofer and H. Gellersen. Shake well before use:Authentication based on accelerometer data. InPervasive Computing (PERVASIVE), pages 144–161.

[21] R. Mayrhofer and M. Welch. A human-verifiableauthentication protocol using visible laser light. InInternational Conference on Availability, Reliabilityand Security (ARES), pages 1143–1148, 2007.

[22] J. McCune, A. Perrig, and M. Reiter.Seeing-Is-Believing: using camera phones forhuman-verifiable authentication. In Proceedings of the2005 IEEE Symposium on Security and Privacy, pages110–124, 2005.

[23] J. M. McCune. Personal Communication, Mar 2008.

[24] S. Pasini and S. Vaudenay. SAS-Based AuthenticatedKey Agreement. In Public key cryptography-PKC2006: 9th International Conference on Theory AndPractice in Public-Key Cryptography, pages 395–409,2006.

[25] A. Perrig and D. Song. Hash visualization: a newtechnique to improve real-world security. InInternational Workshop on Cryptographic Techniquesand E-Commerce, 1999.

[26] R. Prasad and N. Saxena. Efficient device pairingusing ”human-comparable” synchronized audiovisualpatterns. In Conference on Applied Cryptography andNetwork Security (ACNS), pages 328–345, 2008.

[27] N. Saxena, J. Ekberg, K. Kostiainen, and N. Asokan.Secure device pairing based on a visual channel. In2006 IEEE Symposium on Security and Privacy, pages306–313, 2006.

[28] N. Saxena and M. B. Uddin. Automated devicepairing for asymmetric pairing scenarios. InInformation and Communications Security (ICICS),pages 311–327, 2008.

[29] C. Soriente, G. Tsudik, and E. Uzun. BEDA:button-enabled device association. In UbiCompWorkshop Proceedings: International Workshop onSecurity for Spontaneous Interaction (IWSSI), 2007.

[30] C. Soriente, G. Tsudik, and E. Uzun. HAPADEP:human-assisted pure audio device pairing. InInformation Security, pages 385–400, 2008.

[31] F. Stajano and R. J. Anderson. The resurrectingduckling: Security issues for ad-hoc wireless networks.In Security Protocols Workshop, 1999.

[32] J. Suomalainen, J. Valkonen, and N. Asokan. Securityassociations in personal networks: A comparativeanalysis. In F. Stajano, C. Meadows, S. Capkun, andT. Moore, editors, Security and Privacy in Ad-hoc andSensor Networks Workshop (ESAS), pages 43–57,2007.

[33] E. Uzun, K. Karvonen, and N. Asokan. Usabilityanalysis of secure pairing methods. In FinancialCryptography and Data Security (FC’07) & UsableSecurity (USEC’07), pages 307–324, 2007.

[34] S. Vaudenay. Secure communications over insecurechannels based on short authenticated strings. InAdvances in Cryptology-CRYPTO, pages 309–326,2005.

Serial Hook-ups: A Comparative Usability Study of Secure Device...

Documents

Transcript of Serial Hook-ups: A Comparative Usability Study of Secure Device...