Can Mobile Diaries Accurately Capture Consumer Behavior?...

57
“Can Mobile Diaries Accurately Capture Consumer Behavior? A Large-scale Test on TV Viewing and Empirically-based Guidelines” © 2015 Mitchell J. Lovett and Renana Peres; Report Summary © 2015 Marketing Science Institute MSI working papers are distributed for the benefit of MSI corporate and academic members and the general public. Reports are not to be reproduced or published in any form or by any means, electronic or mechanical, without written permission. Marketing Science Institute Working Paper Series 2015 Report No. 15-125 Can Mobile Diaries Accurately Capture Consumer Behavior? A Large-scale Test on TV Viewing and Empirically-based Guidelines Mitchell J. Lovett and Renana Peres

Transcript of Can Mobile Diaries Accurately Capture Consumer Behavior?...

“Can Mobile Diaries Accurately Capture Consumer Behavior? A Large-scale Test on TV

Viewing and Empirically-based Guidelines” © 2015 Mitchell J. Lovett and Renana Peres;

Report Summary © 2015 Marketing Science Institute

MSI working papers are distributed for the benefit of MSI corporate and academic members

and the general public. Reports are not to be reproduced or published in any form or by any

means, electronic or mechanical, without written permission.

Marketing Science Institute Working Paper Series 2015

Report No. 15-125

Can Mobile Diaries Accurately Capture Consumer

Behavior? A Large-scale Test on TV Viewing and

Empirically-based Guidelines

Mitchell J. Lovett and Renana Peres

Report Summary

Mobile consumer diaries can capture many interesting behaviors not captured in passive data

measurement. They are increasingly used in psychology, geography, and medicine, and

commercial marketing. However, scholarly quantitative research in marketing tends to focus on

passive data measurements rather than on self-reports. To become a standard tool, mobile diary

methods require more research on their accuracy, as well as guidance on how to design more-

accurate consumer mobile diaries.

Here, in a large-scale mobile diary study, Mitchell Lovett and Renana Peres evaluate the

accuracy of mobile diary studies and provide empirically-based design guidance. They use

mobile diary data collected by the Council for Research Excellence (CRE), the Keller Fay

Group, and Nielsen. Over a three-week period, 1,702 U.S. TV viewers reported on viewing and

communications related to prime-time TV shows. A subsample of 151 respondents was passively

monitored using Nielsen’s People Meter. Their data were used to evaluate the accuracy of the

diary reports.

Findings

Overall, respondent compliance was high, with 92% of the diarists completing three weeks of

reporting. Overall, respondents tended not to generate false reports, but they failed to report

some viewings. Specifically, 65% of People Meter records were reported in the diary and 93% of

diary reports had a matching People Meter record. Mobile diary reports were highly correlated

(.90) with aggregate ratings.

Among other findings:

Long viewings (3 minutes or higher) had a higher recall rate (up to 80%).

Reporting showed a pulsing pattern: individuals either do not report at all on a given day

or report accurately.

Alarms increased the recall rate, but generated a small increase in false entries.

Respondents’ compliance level was high, with only a slight decrease in accuracy after

participation incentives ended.

Non-smartphone owners were more accurate, but otherwise exhibited minimal biases in

activity.

Marketing implications

Mobile diaries can complement passive data measurement such as Nielsen’s People Meter Panel;

diaries capture out-of-home viewing and viewing on non-metered devices (associated with sports

programs, day-time periods, younger viewers, and viewing with other people).

In addition, mobile diaries can be a powerful tool to capture behavioral and process variables that

cannot be monitored through passive data measurement: consumption through multiple channels,

lower stages in the hierarchy of effects, exposures and perceptions, and experiences in the

moment.

Marketing Science Institute Working Paper Series 1

Mitchell J. Lovett is Associate Professor of Marketing, Simon Business School, University of

Rochester. Renana Peres is Professor of Marketing, School of Business Administration, Hebrew

University of Jerusalem.

Acknowledgments We thank all those who helped to collect the dataset. Our industry collaborators include Ed

Keller, Brad Fay, and Ben Schneider from the Keller Fay Group, Beth Rockwood and Richard

Zackon from the Council for Research Excellence, and Jessica Hogue and David Chester from

Nielsen. We thank Peter Fader from the Wharton School for creating the contact with the CRE.

We gratefully thank our research assistants at the Hebrew University: Aliza Busbib, Yoav

Haimi, Dana Leikehmacher, Sria Louis, and Haneen Matar for the hours and hours of their life.

We thank Garrett Johnson, Jacob Goldenberg, Eitan Muller, and Christine Pierce for their

insights on an earlier draft of the paper.

This study was also supported by the Marketing Science Institute, The Israel Internet

Association, Kmart International Center for Marketing and Retailing at the Hebrew University of

Jerusalem, and the Israel Science Foundation.

Marketing Science Institute Working Paper Series 2

Introduction

Diaries recorded on smartphones are growing as a mode of research in a variety of

domains including marketing, psychology, geography, health, and medicine. Such diaries, named

hereafter as "mobile diaries," have been extensively used in health research, including studies on

physical exercise (Heinonen et al 2012), sexual encounters (Hensel et al 2012) and alcohol

consumption (Collins, Kashdan, and Gollnisch 2003), and in research on family dynamics

(Rönkä et al 2010), mood (Matthews et al 2008) and mental symptoms such as anxiety or stress

(Proudfoot et al 2010). In marketing, mobile diaries are being increasingly used in marketing

practice: research companies have developed mobile diary practices and some companies now

specialize in mobile diary research (e.g., OnDevice Research). Scholarly marketing research,

however, has thus far used mobile diaries sparingly and primarily for qualitative research

(Patterson 2005; Elliott and Elliott 2003). Quantitative researchers are only now starting to

consider using mobile diaries to collect data (Cooke and Zubcsek 2015).

This lack of adoption by marketing scholars might not be surprising in the current world

where large-scale databases collected via passive measurement are widely available. Passive data

collection is less intrusive, and produces more objective data that is believed to be more accurate

(e.g., Einav, Leibtag, and Nevo 2010). Though in the past, many scholarly marketing studies

relied on pen and paper diaries (Wind and Lerner 1979; Kahn, Kalwani and Morrison 1986;

Sudman and Ferber 2011 for a review), as technology enabled more passive measurement,

quantitative research shifted away from diaries and self-reports in general.

However, we suggest that mobile diaries represent a large, untapped potential for

research in marketing. Many phenomena in marketing cannot be captured using passive data

measurement, and, even for those that can, passive measurements may provide too little

Marketing Science Institute Working Paper Series 3

information to identify the underlying processes. Mobile diaries have the potential to fill this data

gap. For example, while the recent explosion in word-of-mouth (WOM) research has focused

where the data is (online WOM), Keller and Fay (2012) estimate that over 85% of conversations

about brands occur offline where passive measurement is non-existent. Mobile diaries can

provide data on such offline WOM activity. Other currently "hot" topics that can benefit from

data collected via mobile diaries are product usage (Lin and Chang 2012), social interactions

(Chen, Wang and Xie 2011), the use of subjective expectations to evaluate dynamic trade-offs

(Khan, Chu, and Kalra 2011), attitudes, and experiences in the moment. In each of these

domains, mobile diaries have the potential to complement passive measurement to provide a

richer understanding of the process and context of decisions.

For mobile diaries to provide value as a research tool in quantitative marketing, they need

to be able to attain an acceptable level of accuracy. As with any self-report, the accuracy of

mobile diaries might suffer from various issues including forgetfulness (McKenzie 1983),

subjective retrospective interpretation of events (recall bias), and compliance issues (Toh and Hu

2009; Bolger, Davis, and Rafaeli, 2003, Green et al 2006). Prior to the advent of mobile diaries,

electronic collection of diary data (e.g., personal digital assistants (PDAs)) already had the

potential to alleviate some of these concerns through the ability to signal or alert respondents,

obtain time-stamps, and check and ensure response completeness during entry (Bolger, Davis,

and Rafaeli 2003). Indeed, research documents that PDA diaries have higher compliance rates

than traditional diaries (Stone et al. 2002). However, despite this higher compliance, in three

studies across varied contexts, Green et al. (2006) find that paper diaries and electronic diaries

collected via PDAs and beepers were equivalent in data quality.

Marketing Science Institute Working Paper Series 4

Mobile diaries (via smartphones) have the potential to provide more accurate data than

collection via PDAs. In contrast to PDAs, which never served a meaningful role in the lives of a

majority of the population, smartphones have a 71% penetration rate (Nielsen 2014a), are deeply

embedded in their owners lives (Smith 2015) with 80% of owners having their smartphones with

them 22 hours a day (Stadd 2013). This constant availability and attention that smartphones

receive could translate into high levels of diary compliance and accuracy. Yet, to the best of our

knowledge, the accuracy of mobile diaries has not been evaluated. The first goal of this paper is

to provide an estimate of the level of accuracy for a large-scale mobile diary study.

The second goal of this paper is to provide empirical evidence on the relationship

between key mobile diary design decisions and the accuracy of the resulting data. Research has

also not yet provided empirically-based guidance on the design of mobile diary studies.

Currently, guidance on designing mobile diaries is based on logical argument, rather than

empirical evidence (e.g., Bolger, Davis, & Rafaeli 2003; Shiffman, Stone, and Hufford 2008;

Reis, Gable, and Maniaci 2014). Because mobile diaries are a new methodology, understanding

how compliance and accuracy vary by the research design can help to inform appropriate use of

this tool in marketing.

To achieve our research goals we sought data on a sample of individuals that both

completed a mobile diary study and were simultaneously monitored via a widely-accepted

measurement device that can serve as a benchmark. The data we use come from a large-scale,

mobile diary study that was a collaboration with the Council for Research Excellence (CRE), a

consortium of research professionals from TV networks and media and research agencies who

support research on audience measurement issues, the Keller Fay Group, a marketing research

firm, and Nielsen, the primary organization that provides TV ratings. Although the primary focus

Marketing Science Institute Working Paper Series 5

of the study was on the drivers of TV viewing, the study also contained a “research-on-research”

component that allowed evaluation of the mobile diary methodology itself. To be clear, this

context is attractive not because we in any way wish to suggest mobile diaries should replace

People Meters, but rather because of the availability of the widely-accepted benchmark on which

to base our accuracy measurements.

The study was fielded over a six-week period during the opening of the fall television

season of 2013. The sample contains 1702 U.S. TV viewers, ages 15-54, who reported on

viewing and communications related to prime time TV shows. In a rare opportunity for an

academic study, our mobile diary sample includes 151 individuals who have a Nielsen People

Meter (NPM) installed in their home (they are part of the Nielsen Convergence Panel, which we

describe in more detail below). From this group we obtain over 420,000 People Meter log

entries. Similar to the evaluation of paper and pencil diaries (e.g., McKenzie 1983, Lee, Hu, and

Toh 2000), we compare these People Meter entries to their diary reports.

We find that respondents were compliant with the procedure: 92% completed the

incentivized 3-week period. Comparing their reports to the individual-level People Meter

records, we find that diary reports show Precision (the percent of viewing reports correspond to

actual viewing) is 92.7% , Recall (the percent of metered viewing that is captured by diary

entries) is 64.7% indicating low levels of false positive reports (diary reports with no matching

People Meter records). Many of the false positive reports can be attributed to out-of-home

viewing, delayed viewing, and other non-metered viewing (e.g., Hulu, network website, or non-

TV viewing). Many of the false negative reports can be attributed to non-alarmed periods and to

short viewings of less than 15 minutes. We also evaluate the accuracy by comparing aggregate

Nielsen National People Meter ratings to the weighted sample percentage of viewing from the

Marketing Science Institute Working Paper Series 6

mobile diaries. We find a correlation of 0.90 and that 85% of the ratings predicted by the mobile

diary study fall within one rating point of the metered Nielsen ratings. Overall, our findings are

quite positive about the accuracy of mobile diary data between the individual-level accuracy

results and these aggregate accuracy results.

However, we also document two issues that can reduce accuracy and are important to

address in any mobile diary study. First, mobile diary reporting exhibits a "pulsing" pattern

across days where people often report on most of their viewing, but sometimes do not report at

all in a day. We describe the potential measurement problems, and provide an example of how to

test for it. Second, Recall increases with the length of viewing. When viewing less than four

minutes Recall is below 30%, and Recall levels increase to 77% by 24-minutes. Hence, short

activities may need special care in order to capture them accurately with mobile diaries.

We also examined the data for guidance on alarming, study length, incentives, and the

need to include non-smartphone owners. Because alarm timing was randomly assigned, we used

this experimental variation to identify that alarms increase the likelihood to report by 16

percentage points with only a small increase in false positives. However, alarms alone may not

be sufficient for many settings, as even for viewing 33.7% of responses are not alarmed. We

recommend a combination of both relatively frequent alarms and an allowance for self-initiated

entries.

How long can mobile diaries be? Our mobile diary study is long and demanding for

respondents, yet reporting levels and the levels of Recall (False Positive Rate) are consistently

high (low) for the three incentivized weeks. Further, 39% of respondents voluntarily completed

at least one additional week, levels of Recall decline only slightly after incentives end, and

activity levels remained consistent with passive measurement. This lack of fatigue is notable

Marketing Science Institute Working Paper Series 7

since sweeps weeks diaries last only one week. These results suggest that longer diary studies are

feasible, but incentives appear important to maintain participation.

Do mobile diary studies need to recruit non-smartphone owners? In our study,

respondents from this group were given a smartphone to complete the diary. Although these

individuals differ demographically (belong more to the youngest and oldest age groups, and

contain relatively fewer African-Americans), their activity and reporting levels are similar to

smartphone owners, and their accuracy level is slightly higher. As a result we find accounting for

non-smartphone owners to not be critical to measurement in our setting.

Because the context of our study is television viewing, this study also contributes to the

practice of measuring TV audiences. TV ratings are a central measurement in marketing practice,

forming the basis of more than $60 billion in annual television advertising (eMarketer 2014).

The national measurement system largely relies on People Meters and has faced a number of

critical challenges and debate (Boyer 1987; Carter 1990; Milavsky 1992; Danaher and Beed

1993; Napoli 2005; Carter and Steel 2014). Viewing is now shifting to multiple devices, out-of-

home, and on-the-go (Ericsson 2013). Out-of-home viewing is important to ratings (Nielsen

2009), 72% of TV viewers watch videos on a mobile device at least weekly, and 42% do it out of

the home (Ericsson 2013). This shift leaves the traditional People Meter measurement, which is

connected to a TV at home, unable to capture these important behaviors. By contrast, mobile

diaries almost always stay with the owner, suggesting their potential to capture these new

viewing behaviors. Indeed, we demonstrate that mobile diaries can capture these non-metered

behaviors. This ability to capture behaviors that metered measurements miss is indicative of the

potential of mobile diaries to augment passively collected data in quantitative marketing

research.

Marketing Science Institute Working Paper Series 8

The rest of the paper is organized as follows. First, we present the study methodology and

discuss the key design decisions. Second, we evaluate the quality of the mobile diary data, by

checking the reporting activity and accuracy of diary reporting and comparing it to Nielsen

People Meter data. Finally, we discuss these results for their implications for designing mobile

diary studies.

Marketing Science Institute Working Paper Series 9

Designing the Mobile Diary Study

As mentioned above, our data come from a CRE-supported industry collaboration whose

primary purpose was to evaluate divers of television viewing. Our industry partners at Nielsen

and Keller Fay managed the programming and data collection implementation. Although we

were able to make recommendations regarding the study design, we did not have a full control

over the study design nor direct involvement in data collection.

The study period covers the beginning of the fall broadcast season running from

September to November of 2013. The mobile diary study focused on primetime TV and aimed to

capture both viewing and communications about these programs. We focus on the viewing diary

entries. Below we discuss the design. We focus on four design decisions that take on greater

importance for gaining accuracy in mobile diary studies--the sample, the diary entry task, the

monitoring design, and the study duration.

Decision: The Sample of Diarists

Samples for mobile diaries are in most respects ideally constructed in the same manner as other

diary studies where the main issue is ensuring sufficient sample for low response groups (see

Toh and Hu 2009). We use a quota sample with 1,702 respondents that for all aggregate

calculations is weighted to be representative of the U.S. population age 15-54. We use standard

approaches here, which we discuss in Appendix A.

Two issues specific to diaries relate to smartphone ownership. First, the majority of

participants (1,386, 81.4%) used their personal smartphones as the platform for their mobile

diary. However, some participants were recruited who did not own a smartphone (316, 18.6%) in

order to ensure some representation of non-smartphone users. These participants were provided

Marketing Science Institute Working Paper Series 10

with a smartphone for the duration of the study. We discuss in Appendix A how this sample

differs from the main sample in terms of demographics, and, in the results section, we address

how this sample differs in terms of activity, reporting, and accuracy. The second issue is the

mobile phone operating system. The app was available for panelists using iOS (58.7%) or

Android (41.3%), but quotas were not set based on the operating system. We did not find

significant differences in demographic, viewing, or reporting patterns between these two groups.

Decision: The Diary Entry Task

Each diary entry began via a "home screen" similar to the one depicted in the left panel of Figure

1. If indicating they watched or communicated about a prime time TV show in the past hour,

then they were led through a series of questions about that experience. The relevant questions for

a viewing entry are presented in schematic form in Figure 2. If respondents completed an entry,

they were prompted to complete another if they had watched or communicated about multiple

programs in the past hour. (Figures follow References throughouttables.)

An important aspect of diary surveys is keeping a low burden for completing diary

entries. In question 4 of Figure 2, which is illustrated in the right panel of Figure 1, we seek to

obtain a precise indicator of the program the respondents viewed so that we can link our diary

data to other data. This linking is challenging because thousands of programs are available,

Tribune program lists available prior to the airings are imperfect, respondents may not have

knowledge of the exact show name or spelling, and respondents might become frustrated if they

cannot find a desired program. We designed the program identification question so that

respondents would choose the program they viewed from a dynamic "look-up" list where they

entered the first few letters and show options appeared. The list included 2755 shows. The list

was designed to include the most viewed shows on national TV, as well as new shows for the

Marketing Science Institute Working Paper Series 11

season. If the show was not on the list, the respondent had the option to type it manually. These

manual entries were then recoded into the complete show listing.

Decision: Time-based and Event-based Monitoring Designs

The next decision relates to the way entries are initiated. Two different designs are common:

diarists can be prompted according to a schedule (signal-contingent design) or the time of entries

can be selected by the diarists with the intention of matching the activity of interest (event-

contingent design) (Reis, Gable, and Maniaci 2014). Those authors suggest the signal-contingent

design is best for establishing relative frequencies of events, whereas the event-contingent design

is better for relatively rare activities. In our case, we were trying to monitor two different

activities: communications and viewing. Based on prior research, communications were expected

to be relatively rare and sporadic, whereas viewing was expected to be more predictable and

common. The concern was to balance prompting to ensure some coverage of the day for

communications and non-live viewing, while still obtaining sufficient information about prime

time without being overly taxing.

As a result, we used a combination of signal-contingent alarms and event-contingent self-

initiated reports. The signal-contingent component included 3 daily alarms during prime time

(20:00-23:00 Eastern). 1

The time of the first prime time alarm was randomly set to 20:30 or

21:00 (EST), and the following two alarms were in intervals of one hour. In addition,

respondents were prompted randomly twice during non-prime time hours – one alarm at a

random time between 8:00-13:30, and one at a random time between 14:30 and 19:30. Alarms

were random both within and between participants. If a participant did not respond to an alarm,

1 Most shows are broadcast in dual feed mode, where a 20:00 show is broadcast at 20:00 in both Eastern and Pacific

time zones, but 19:00 in Central and Mountain. Therefore, primetime starts at 20:00 in Eastern and Pacific, and

19:00 in Central and Mountain.

Marketing Science Institute Working Paper Series 12

there were no reminders, and he/she was prompted as scheduled for the next alarm. The

respondents were given the option to indicate "going to sleep," which stopped alarms until the

next random alarm after 08:00 the next day.

The app also allowed so-called event-contingent entries; that is, diarists were encouraged

to self-initiate responses in the app at any time they were viewing (or communicating about) TV

but not prompted. A self-initiated report did not cancel the subsequent alarm, so alarm timing

was independent of the self-initiated reports. In this way, we balance the need for a signal-

contingent design to obtain the frequency of viewing activities and the need for an event-

contingent design to obtain the less regularly occurring viewing and communications events.

Although we allow self-initiated entries at any time, we focus our analysis on the prime time

period. In the analysis below, we also leverage our random alarming and self-initiated design to

investigate reporting accuracy under different alarm conditions.

Decision: Study Duration and Incentives

The appropriate study length and incentives depend heavily on the population studied, but since

our study focused on a general population, our experience and choices could be informative to

others. We designed the incentive to last for the first three weeks of each panelist's diary period.

Diarists received a monetary incentive of $50, which required a minimum of 14 days of

participation during the first three weeks of their diary period. This was considered an aggressive

target by the field team and raised concerns about fatigue and attrition. For comparison, a

standard paper and pencil viewing diary during sweeps weeks lasts only 1 week, but involves

tracking more time periods per day.

Marketing Science Institute Working Paper Series 13

We also investigated how willing respondents were to participate in even longer studies.

After three weeks, the respondents were instructed that they had completed the study for

payment, but that they could continue to participate with no payment for up to 3 additional

weeks. If the participant did not remove the app from their smart phone, alarms would continue,

and many respondents continued to complete diary entries. In the next section we examine how

the diary tenure of respondents relates to activity, reporting, and attrition.

Marketing Science Institute Working Paper Series 14

Evaluating the Quality of Mobile Diary Data

Based on this design, we obtained 173,035 diary reports. We define “report” to include every

diary entry. A “viewing” report is defined as any report in which option 1 was selected in Q1

(see Figure 2 above). Aggregating to the respondent day, we have 42,380 days in which

respondents completed at least one report (which we call respondent reporting days). In line with

the literature on mobile diaries from medical research ( Henselet al 2012), respondents

demonstrated high reporting activity with 1559 (92%) panelists completing at least 21 days in the

diary. Hence, attrition was not a major concern here (Toh and Hu 2009).

Table 1 provides summary statistics on reporting activity. The average lifetime (number

of days from first report to last report) of a respondent is 27.7 days, on which he/she reported on

almost 90% of the days (24.9 days) on average. The average number of reports per respondent-

day is 4.1. Our focus in this paper is on the viewing reports. On average, 1.7 out of the 4.1 (41%)

reports (per respondent reporting day) were viewing reports, and not all respondent reporting

days include a viewing report. On average a respondent had 24.9 days in which he/she generates

at least one diary entry, and only 17.8 days with viewing reports. For the rest of the section, we

evaluate the quality of the mobile diary data, focusing on just the viewing reports. (Tables follow

References throughout.)

Individual-level People Meter Data for Benchmarking

We evaluate the quality of the mobile diary data by comparing the respondents’ self-

reported viewing to their viewing records from Nielsen's People Meter. For these analyses, we

use the sample of 151 individuals from the Nielsen Convergence Panel who have People Meters

installed in their home and who also completed the mobile diary study.

Marketing Science Institute Working Paper Series 15

Convergence Panel households are recruited from those exiting the Nielsen National

People Meter panel who provide passive viewing data for Nielsen ratings. When they are forced

to exit the National People Meter panel after two-years, some panelists are recruited to become

members of the Convergence Panel. In this capacity, they continue to be monitored following the

standard procedure for National NPM panel members. Hence, the Convergence Panel

households are experienced with the People Meter procedure and considered by Nielsen as

highly cooperative (Nielsen 2015). However, their data are no longer included in the rating

calculations, and instead used for testing purposes. In a rare opportunity for academic studies, we

were able to obtain access to this panel for our study.

For this group we have both their diary reports as well as People Meter data for the same

period as the diary study. We term this group “The metered-diary” group. Over the survey

period, the metered-diary group generated 3,927 mobile diary viewing reports, and over 420,000

People Meter records for 28,000 airings.2 Appendix A describes how this sample differs from the

main sample. We use this data for most of the analysis contained in this section. Except where

noted, we focus our analysis on prime-time viewing hours during the first 3 diary weeks when

participants were incentivized (i.e., the study focus).

In the following subsections, we use this individual-level data to address six design-

related issues and offer empirically-based guidelines. We end this section by demonstrating the

level of aggregate accuracy of our mobile diary study by comparing it to aggregate TV ratings

provided by Nielsen.

2 Matching the program and channel name of the people meter (which based on internal Nielsen coding) to the

viewer-familiar name used in the mobile diary was a major challenge. The match combined automated matching

with a large-scale manual matching process.

Marketing Science Institute Working Paper Series 16

Reporting Exhibits “Random” Pulsing

We observe a non-uniform reporting pattern, which we term "pulsing." Figure 3 presents

three panelists as examples of pulsing. The figure contains for the first 21 days of each

respondent's diary, the number of daily prime time diary viewing reports (number of boxes)

along with the extent of prime time viewing according to the People Meter (vertical lines with

longer lines mean more viewing in prime time). Although the diary reports clearly relate to the

metered behavior, the diary reports appear to have a "pulsing" pattern where reports tend to

cluster together more than actual activity. Of the days when the People Meter identifies that a

person had at least one prime time viewings 66.9% also have at least one diary entries about

viewing during prime time. For comparison, 79.9% of days with diary entries about prime time

viewing (henceforth, active diary days) have People Meter viewing during prime time. This

difference, we suggest, in part, arises from pulsing.

To measure pulsing we modify the entropy measure of Zhang, Bradlow and Small

(2013), what they call "clumpiness," to allow comparison between individuals and between the

two datasets. Specifically, we normalize the entropy measure by dividing it by the logarithm of

the number of viewings. We find that the clumpiness for the diary data is 10.9, whereas that of

the People Meter data is 6.6 (the difference is significant with t=3.87 and p-value<.001). Hence,

metered viewings are more uniformly spread than diary reporting.

How problematic is pulsing for capturing activities? First, we find that 28% of total prime

time viewing minutes (as measured by the People Meter) is on days without a diary viewing

entry about prime time, and only 22% is on days without any viewing diary entries. So most

viewing is occurring when people are "pulsed" on. Second, according to the People Meter, the

average minutes of prime time viewing on days without a viewing diary entry (5.7 minutes) is

Marketing Science Institute Working Paper Series 17

much lower than with one (22.0 minutes). Third, the percent of days without diary reports with at

least 15 minutes of prime time viewing is much smaller than the days with diary report (15.6%

vs. 56.6%). Hence, in our setting the mobile diary reports are active during the time periods

when most viewing (according to the People Meter) occurs.

Hence, although pulsing may not be too severe an issue in our setting, other settings

could face greater problems. Pulsing generates missingness in intervals at the individual-level,

and such missingness could lead to bias in individual-level and correlational analyses if not

handled properly. As practical advice, prior to conducting the study, the researcher can (a)

pretest for pulsing behavior to identify whether it exists for the specific study context, (b) pretest

for non-random arrival (e.g., checking for day of week effects and relationship with types of

activity), and (c) adjust design elements in order to reduce it (e.g., emphasis in instructions,

alternative alarms, check-ins, etc.).

After the study is complete, if non-reporting is concentrated in a few individuals or times,

those individuals or times can be considered for dropping. That said, we found that identifying

pulsing without knowledge of the passive measurement was very difficult in our setting. We

found little systematic variation (e.g., it does not correlate significantly with day-of-week or

week of diary effects), and it is not overly concentrated in a few individuals suggesting it could

be a more universal aspect of compliance in mobile diaries. Hence, if pulsing is more severe in a

given study, it may not be easy to identify heavy pulsing times or individuals in order to reduce

the problem. Further, respondents in our study were alarmed multiple times every day, the

alerting scheme we used on the smart phone is not sufficient to avoid the observed pulsing in the

reporting behaviors. Though our data does not allow a closer examination of pulsing, future

Marketing Science Institute Working Paper Series 18

research on mobile diary methods should examine the causes of and potential ways to reduce

pulsing.

Individual-Level Accuracy Measures Show 93% Precision and 65% Recall

We now provide a more detailed examination of reporting accuracy. We present the

accuracy at the level of a respondent-day for the half-hours in the prime time period (8-8:30,

8:30-9:00, etc. up to 10:30-11:00 pm Eastern). We count any viewing with a People Meter record

(even of 1 minute) in that half-hour period as a People Meter viewing, and mobile diary entries

are assumed to last for the length of the corresponding People Meter viewing if a match is found,

otherwise, it is assumed to start in the half-hour prior and last for the length of the telecast.

We present the accuracy measures building on a 2x2 contingency matrix as presented in

Figure 4. Assuming the People Meter data represent the true condition (with two caveats

explained below), the numbers in the four cells of the matrix (from the top left clockwise) are the

True Positive (A), False Negative (B), False Positive (C), and True Negative (D). Based on these

numbers and following ROC terminology (Fawcett 2006, Powers 2011), we present the

following measures:

The % Recall (% Reporting) is the percent of all half-hours viewed according to the

People Meter that have a matching diary viewing report (A/(A+B) = 64.7%).

The % False Positive Rate (C/(C+D) = 7.7%) is the percent of all half-hours with no

viewing according to the People Meter that have a diary viewing report.

The % False Omission Rate (B/(B+D) = 36.6%) is the percent of all half-hours with no

viewing according to the diary that NPM indicates are viewing.

The % Precision (A/(A+C) = 92.7%) is the percent of all half-hours with viewing

according to the diary that NPM indicates are viewing.

Marketing Science Institute Working Paper Series 19

The contingency matrix was calculated on active diary days (i.e., a day when the

respondent had at least one diary report). We augment the People Meter data to identify true

positives. Diary entries without corresponding People Meter records are considered as true

positives if (1) they are reported as live viewing and when evaluated against the Tribune program

listings they match a live airing (to account for the fact that out-of-home viewing is not captured

by the People Meter) or (2) they are reported as viewed, but in a way not measured by the People

Meter (e.g., not on TV). Appendix B provides more detail on the measurement and considers

alternative approaches to identifying true positives.

As depicted in Figure 4 the Precision (92.7%) and False Positive Rate (7.7%) indicate

high levels of accuracy. To put these numbers in context, the People Meter is estimated to have a

8-10% error rate due to the person ID entry (Sharot 1991, Danaher and Beed 1993). Hence, the

magnitude of these errors is similar to the expected People Meter error (and in fact these errors

could entirely result from errors in the People Meter).

The False Omission Rate is 36.6% and Recall is 64.7%. Previewing the results in the

remainder of this section, we find two main reasons that Recall appears to be lower than

Precision: (1) short viewings are not reported and (2) our alarming scheme. First, many viewings

are quite short and these short viewings are counted in Cell B of Figure 4. Of the half-hours in

Cell B, 33.3% have no more than 15 minutes of viewing, whereas in Cell A only 13.0% do. In

the section below that examines length of viewing, we find that respondents are unlikely to

report these short viewings and that viewings of 24 minutes have 77% Recall, twelve percentage

points higher than the average. Second, as we show below, the half-hours with alarms have

almost 16 percentage points higher Recall. As a result, our design of alarming every other half-

hour leads to lower Recall for the non-alarmed half-hours.

Marketing Science Institute Working Paper Series 20

We now investigate whether the errors are concentrated in a few individuals. In Figure 5,

we present the individual-level Recall (x’s) and False Positive Rate (circles) in descending order

by respondent. For Recall the distribution is wide, but not overly concentrated on either end, with

relatively few individuals having very low Recall. For the False Positive Rate, most of the

distribution has no such diary entries, and none have a very large portion with a max around

30%. This suggests that although individuals vary, the errors are not overly concentrated in a

small set of individuals. Further, the correlation between Recall and False Positive Rate is 0.045,

suggesting no meaningful correlation in who makes these two errors.

In the subsections that follow, we analyze how the various design elements relate to the

level of accuracy in order to direct how to further improve accuracy. The remainder of this

section considers how reporting is related to viewing duration, alarms, diary tenure, and non-

smartphone owners. We conclude this section by evaluating the diary accuracy in predicting

aggregate ratings.

Longer Viewing Increases Recall

Consistent with prior research (Deng and Mela 2014), the NPM data contain many

instances of individuals sampling programs for a short period of time: 27% of the viewings are

three minutes or less in length, and 43% are less than 1/4 of the telecast length. We expect

respondents will be less likely to report such short viewings. Figure 6 presents the percent

reporting by minutes of activity duration within the half-hour. The expected positive relationship

is evident in the lowess-smoothed curve (solid line). Half-hours with less than four minutes of

viewing have below 30% Recall and Recall increases rapidly until ten minutes of viewing when

it slows as it reaches 60% reporting. The highest point estimate of Recall is for half-hours with

24 minutes of viewing, which have a 77% Recall.

Marketing Science Institute Working Paper Series 21

The under-reporting of shorter activities leads to systematic reporting biases. Although

less concerning for audience measurement since shorter viewings naturally should have

proportionately less weight, for some activities such as word-of-mouth conversations, the length

of the activity might be unrelated to its influence and importance to the measures.

Alarms Improve Recall With Only a Small Increase in False Positives

We used a combination of time-based (alarms) and event-based (self-initiation) designs.

On average, panelists responded to 46% of the alarms, and 33.7% of the total reports were self-

initiated. Alarms should raise the attention of the respondents to enter a diary if they are

watching, just watched, or plan to watch when the alarm arrives, but it could also potentially

generate false entries (Adams et al 2005). We test whether alarms have an impact on our

individual-level accuracy measures using the randomization of the alarms to provide

experimental variation in the alarm “treatment” vs. no alarm “control” conditions.3 Table 2

tabulates the accuracy measures broken out by alarmed and non-alarmed conditions.

The results indicate a statistically significant (via chi-squared tests) improvement in

Recall (16.1%), and significant worsening of the False Positive Rate (8.8%), and Precision

(6.6%), but no significant change in the False Omission Rate. Importantly, the increase in Recall

is much larger than the increase in the False Positive Rate or the decrease in Precision. Overall,

these results suggest that the design decision to alarm participants, trades off the percent

reporting with the likelihood of false positives, but that, at the frequency of alarms in this study,

more is gained from alarming than lost. Hence, it appears that more alarms leads to more

accurate data without sacrificing too much in terms of the False Positive Rate and Precision.

3 Because the alarm periods were randomized and not separately identified in the data, we can only identify which

half-hour the person was alarmed by later diary reports that were alarmed. Therefore, the analysis could be done

only for active days.

Marketing Science Institute Working Paper Series 22

These results suggest that studies should alarm at time intervals when the activity is most

likely to occur and to be relatively aggressive in the frequency of alarms. Future research could

evaluate how many more alarms are feasible without leading to overall study fatigue and a

significantly higher False Positive Rate.

Until Incentives End Recall and Precision Levels Are High and Flat

A major issue in diary studies relates to how the amount and accuracy of reporting vary by how

long the diarist is participating (i.e., fatigue). Recall, that, to receive full incentive payment,

participants had to have at least 14 days between the first and last report. Since the diary app was

not automatically removed after 21 days, most (61.5%) of the panelists continued to report: 656

(39%) completed the 4th

week (28 days), 415 (24%) continued to complete the 5th

week, and 234

(14%) completed 6 full weeks of reporting. This high level of voluntary continuation is

surprising given the high demands on respondents. However, even these high voluntary

participation rates reduce the sample size and potential generate attrition bias (Winer 1983).

We first study the quantity and accuracy of reporting over time and then turn to whether

any differences could be attributed to attrition. First, we check whether the weekly reporting

quantity of respondents is aligned with the weekly amount of viewing as measured in the People

Meter. Figure 7 presents the percent of total programs viewed (bars with vertical black lines for

+/- 2 standard errors) and viewing reports per respondent (solid line) by tenure in the diary study.

The two measures track very closely over time and are statistically indistinguishable. Hence,

total viewing activity and reporting are closely linked, and the correspondence does not diminish

with tenure or incentivized vs. non-incentivized periods. These reporting quantity results suggest

mobile diaries longer than 3 weeks are feasible.

Marketing Science Institute Working Paper Series 23

Second, we check how accuracy, namely Recall and False Positive Rate, vary over the

respondents’ diary tenure. Figure 8 presents the average percent Recall and False Positive Rate

by the week of the respondent in the diary. The False Positive Rate is flat with no significant

week-to-week differences. Recall levels are flat when the incentive is in place and the first three

weeks of week-to-week differences are not significant. However, the Recall point estimates

decrease from week 3 to week 4 (from 67% to 60%, chi-squared=11.9,p-value<.001) after the

incentives end. No other week-to-week differences in Recall are statistically significant,

suggesting the decrease is due to the incentives rather than fatigue. Overall, the average Recall is

64.7% for first 3 weeks versus 59.8% for the second three weeks, and this difference is

statistically significant (chi-squared=11.9 again, p-value=<.001). By contrast, the False Positive

Rates for the first 3 weeks (7.7%) and the latter 3 weeks (7.6%) are not statistically different.

Hence, the modest decrease in Recall after incentives end does not also bring a worsening in

terms of increased false positives.

We now turn to whether the lower recall in the post-incentive period is likely to be due to

a selection of who voluntarily continues. We evaluate whether the accuracy for those that

continue differs from those that do not during the incentive period. We find the difference is

small (1.3%) and insignificant. However, unsurprisingly, the individuals who continue did

exhibit significantly higher activity levels (44.8 vs. 31.2 half-hours of viewing, t-stat for

difference is 3.54). Demographically, in Appendix A, we show that women and the two youngest

age groups are significantly more likely to voluntarily continue after the incentive period. Hence,

selection exists in voluntary participation periods, but that selection can't explain the decrease in

reporting (volunteers report more, not less than non-volunteers) and it doesn't cause a worsening

in accuracy or compliance.

Marketing Science Institute Working Paper Series 24

To summarize, these results suggest that the long diary (3 weeks) did not induce fatigue

in respondent reporting levels, but that incentives are important to keep (representative)

respondents in the sample, and to a lesser degree to maintain higher reporting accuracy levels.

Hence, we find that mobile diaries of up to six weeks for a regular activity like TV viewing

appear to be feasible while maintaining a consistent level of accuracy.

Non-Smartphone Owners Report More Accurately

The last design-related issue we discuss is related to smartphone ownership. Because smartphone

ownership is non-random and owners make up a large, but far from complete proportion of the

population, obtaining a representative sample could require including respondents who do not

own smartphones. As noted in Appendix A, compared to the owners, the non-owners have a

higher percentage of men, belong more to the youngest (15-17) and oldest (45-54) age groups,

and have a lower percentage of ethnic minorities than do smartphone owners.

We examine how this sub-population differs in terms of the viewing reporting and

activity levels as well as the accuracy of the reporting. Non-smartphone owners appear to report

(26.4 vs. 26.2 half-hours with viewing reports during the first three weeks) and view (36.7 vs.

37.7 half-hours of viewing during the first three weeks) approximately the same amount as

smartphone owners, and neither difference is statistically significant. Hence, activity and

reporting levels do not appear to be related to smartphone ownership. Interestingly, we find that

the non-smartphone owners are significantly more accurate in their reporting on all four

dimensions, as indicated in Table 3 below.

To summarize, although the non-smartphone population appears to differ in terms of

demographics, they do not differ in terms of activity and reporting levels, and are slightly more

accurate. These results suggest that smartphone ownership is a non-issue for our study. However,

Marketing Science Institute Working Paper Series 25

we recommend caution in generalizing these findings about smartphone ownership because this

study is for only a single type of activity (television viewing) and a nationally representative

sample. Although our results suggest it is possible that including non-smartphone owners may

not be necessary for representativeness of activity levels, the type of activity could affect the

relative influence of smartphone ownership on activity levels. A pre-test could evaluate the

potential size of this issue for other domains.

Mobile Diary Ratings Very Accurately Match Nielsen Telecast Ratings

The above analysis evaluates the individual-level diary data against individual-level

People Meter data for the metered diary group (a subset of the mobile diarists). We now evaluate

how aggregates calculated based on our full mobile diary sample (1702 respondents) correlate

with aggregate NPM ratings. For this comparison, we obtain from Nielsen the (aggregate)

National NPM TV ratings for 15-54 year olds (the same population as our mobile diary sample)

for the top 200 programs (as determined by our mobile diary study). We consider only original

telecasts that overlap with our mobile diary study period. We use the Nielsen "MC US AA %"

measure, which is the most current national average audience measure (Nielsen 2014b). Nielsen

calculates these ratings by taking the weighted minutes of viewing by the National People Meter

panel members and dividing by the weighted total sample (the “weighted intab”). While in the

individual-level analysis we included all the People Meter viewings, whether live or delayed, for

this analysis we focus on the core measure used in TV ratings, the “live plus same day” Nielsen

rating measure. This measure captures both live viewing and delayed viewing by DVR that

occurs on the same day as the telecast.

From our mobile diary we calculate the aggregate viewing percentage for each relevant

telecast. To match with these Nielsen ratings, we consider only entries on the same day as the

Marketing Science Institute Working Paper Series 26

program telecast that were self-reported as live or DVR TV viewing, and not as an older episode.

We use the sample weights, 𝑤𝑖 provided by the survey provider, which are designed to produce a

demographically representative sample (see Appendix A). These weights were calculated to be

the ratio between the sample and the quotas, in order to correct for the small discrepancies in the

sample. The quotas were constructed from population values or best estimates of those values.

The percent of viewing (i.e., ratings) based on the mobile diary, 𝑠𝑗𝑚, is a weighted average of the

viewing indicators, 𝑣𝑖,𝑗, 𝑠𝑗𝑚 = ∑ 𝑣𝑖,𝑗𝑖 𝑤𝑖/∑ 𝑤𝑖𝑖 .

We focus our discussion on telecasts that aired when our mobile diary sample was

relatively large (n>1660). For this sample, we have 243 telecasts over 10 days. The simple

correlation between this data and the NPM ratings is 0.90. This high correlation suggests we can

recover the basic pattern of viewing well. The high level of accuracy is robust to changes in the

required daily sample size and to including non-prime time telecasts of the top 200 shows.4

Further, we find, even after introducing both program fixed effects and date effects, the mobile

diary estimate of percent viewing has a significant relationship with the Nielsen ratings (p-value

<0.001), suggesting the mobile diary data can capture not only cross-sectional, but also within-

program time variation.

We previously demonstrated that diary reports miss some viewings with lower recall for

shorter viewings. This would suggest that the diary reports would understate the People Meter

viewing. However, we do not obtain self-reports of the viewing length for diary entries. We

instead assume the viewing is the full program length, an upper bound on the potential length of

the viewing. As a result, the diary could over or understate the total viewing minutes.

4 For example, if including airings when the mobile diary sample size is 1000 (n=446 over 22 days) the correlation is

0.87. Similarly, if we include both prime and non-prime time telecasts (n=364 over 10 days), the correlation is 0.86.

Marketing Science Institute Working Paper Series 27

To directly compare the mobile diary viewing percent, 𝑠𝑗𝑚, with NPM ratings, 𝑠𝑗

𝑁, we

need to allow a scaling adjustment for this under/over-reporting. To do so, we run a regression of

𝑠𝑗𝑁 = 𝛽𝑠𝑗

𝑚 + 𝜖𝑗 to get the optimal homogeneous weighting. We find that 𝛽=0.62 for our sample.

Using this weighting, a full 85% of the mobile diary ratings are within 1 rating point of the NPM

measure and 72% are within 0.5 rating points. Like the correlation reported above, the accuracy

level and weighting are quite robust to alternative required sample sizes. Again, this suggests

accuracy is quite high, since one could likely improve these estimates by using a more

complicated scaling model with heterogeneous 𝛽 weights, for example based on the program

length (e.g., 30 vs. 60 minutes) or type (e.g., sporting events, episodic programs). Figure 9

presents the plot of the NPM ratings vs. these ratings based on the mobile diary data. Overall, the

comparison suggests that mobile diaries can quite accurately match metered data.

Marketing Science Institute Working Paper Series 28

Contributing to Television Audience Measurement

Like most metered measurements of behavior, the People Meter doesn't perfectly capture

viewing. People Meter errors can arise from at least three causes: (a) errors in entering the person

ID when watching (Sharot 1991and Danaher and Beed 1993), (b) viewing in unmetered ways

including on laptops, tablets, smartphones, and on TV via Hulu and other network or streaming

apps, or (c) viewing out of the home. In this section, we shed light on the extent of viewing

behaviors that People Meters currently miss and mobile diaries capture. In the process, we

demonstrate that mobile diaries can complement observational data in valuable ways.

For this analysis, we classify all mobile diary viewing reports as via metered TV or not,

and whether the TV viewing has a matching People Meter record. Because respondents were

asked to report on their viewing in the past hour, we considered as a match any diary entry with a

People Meter record within 1.5 hours after and 1 hour before the diary entry and a matching

program name. Importantly, this analysis differs from that of the previous section in three ways:

(1) the unit of analysis is a diary entry, not a half-hour, (2) the time window is not restricted to

prime time, and (3) TV listings are not used to refine the accuracy of live viewing diary entries.

The results aggregated for the metered-diary group (i.e. those with both mobile diary and

People Meter data) are presented in Table 4. As Table 4 indicates, 67% of the mobile diary

reports have a matching People Meter record (Category 1), indicating that the diary entry was an

accurate report of metered viewing on TV. Approximately 5% of diary viewing reports are self-

reported as being non-metered including through an app or on a non-TV device (Category 2).

Consistent with this self-report, only 1.4% of these reports have a corresponding People Meter

record (i.e., on TV). Also consistent with diary viewing not on metered TV, Category 2 reports

are significantly less likely to be watched with someone else (26% less) or during prime time

Marketing Science Institute Working Paper Series 29

(24% less). The remaining entries (28%) are viewing entries, which are self-reported as on TV

but have no matching People Meter record (Category 3).

We argue that Category 3 entries are likely to be out-of-home or on-the-go viewing. In

2014, Nielsen found that ratings increased by 7% to 9% after accounting for out-of-home

viewing and that the lift was largest for daytime programming and sports programming (Nielsen

2014c). An earlier study by Nielsen (2009) also found that out-of-home viewing was higher for

daytime and sports programming, and that the impact of out-of-home viewing was higher for

weekend programming and among younger persons compared to older persons.

We find Category 3 viewing is consistent with these patterns for out-of-home viewing.

Category 3 is significantly higher than Category 1 for Daytime programming (15% points higher,

t-stat=11.7, 11% base rate), Sports programming (12% points, t-stat 11.2, 7% base rate), and

Weekend viewing (10% points higher, t-value 6.35, 21% base rate). To examine the age

relationship, we regress gender (male), ethnicity (non-white), age (in years), and activity

(number of diary viewing reports) on the percent of non-matching (Category 3) per person. Only

age and activity are significant with coefficients -0.006 (stderr =0.002, p-value<.05) and -0.002

(stderr=0.001, p-value<.05) respectively. Although the R-squared is relatively low (0.10), the

qualitative finding is consistent with Nielsen (2009). Further, we find that Category 3 is more

likely to be viewed with others (6% points higher, t-value 3.47) and more likely to be self-

initiated (7% points higher, t-value 4.66), which both appear consistent with out-of-home

viewing. Overall, these results suggest that out-of-home viewing plays a meaningful role in the

make-up of our Category 3 diary entries.

Taken together and combined with our finding of a 92.7% Precision, we conclude that the

mobile diary can be useful to capture viewing both on unmetered devices and out-of-home, both

Marketing Science Institute Working Paper Series 30

types of viewing that the People Meter measurement misses and that are increasingly important

to measuring viewing behaviors today. We are not arguing that mobile diaries can or should

replace People Meters, but instead that they can be used to gain information about behaviors that

People Meters cannot.

Marketing Science Institute Working Paper Series 31

Discussion

This paper aims to be the first to provide an evaluation of accuracy of mobile diaries and

empirically-based recommendations for how to design mobile diary studies in marketing. We

carried out a 21-day mobile diary study covering TV viewing on a representative sample of 1702

respondents. Despite high demands on respondents, compliance was high throughout both the

incentivized first three weeks and the following voluntary three weeks. Comparing self-reports

for a subset of 151 respondents to their individual-level Nielsen People Meter data we find a

Recall is 64.7% and Precision is 92.7%. Comparing the self-reports to the overall rating of the

shows, we find a high correlation of 0.9. Together our findings indicate a high level of accuracy.

In the previous section, we demonstrate that mobile diaries can capture activities which are not

measured by the People Meter such as viewing on non-metered devices and out of home

viewing. Hence, our findings indicate that mobile diaries can be a reliable source of new data in

future marketing research that can augment passive measurement. Our study also provides

empirically-based guidance on how to conduct future mobile diary studies. Table 5 summarizes

our main findings and the implied guidelines.

Although we evaluated accuracy compared to a metered viewing, we do not recommend

replacing passive measurement. The power of mobile diary studies is to provide insights on

behaviors that are hard to capture using passive measures. We demonstrated this in the context of

viewing where some types of viewing cannot be captured by the People Meter. More broadly,

mobile diaries can be used as a standalone data source, or in combination with other data sources

such as People Meter, scanner data, web browsing, or location tracking. Specifically, we think

mobile diary data can benefit studies that aim to (1) capture a spectrum of influences/behaviors

on an individual; (2) focus on earlier stages in the hierarchy of effects; (3) focus on exposures

Marketing Science Institute Working Paper Series 32

(the customer perception) rather than resource allocation by the firm; (4) focus on describing

process rather than simply outcomes, or (5) understand the deeper context of decisions and

behaviors. Here are some specific examples:

1. The relative importance of marketing communication mix elements on purchase – Mobile

diaries enable a customer to report exposures to different elements of marketing

communication: social interactions, advertising, PR, in-store promotions, and events that

an individual is exposed to. Such data can serve as an input into evaluating the effect of

communications on consumer attitudes and behaviors of interest, including purchase, and

these data are generally not available at the individual-level.

2. Processes in the personal social network – The structure and information flow of the

personal social network of an individual are important to the formation of beliefs and

attitudes as well as to purchase. Mobile diaries can document this structure, flow, and

interactions. Of special importance are the offline interactions, which are not captured in

available online social network data.

3. Determining the choice set – In many purchases, the exact choice set faced by the

customer depends both on the context and decisions of the consumer that are not

observed. Mobile diary entries can provide such information. For example, using the

phone camera, customers can take photos of the store shelves they are looking at, so the

exact brands, prices, and shelf-space become available.

4. Measuring brand encounters – Attitudes and brand choices are formed through brand

encounters. Using a mobile diary, respondents can report the exposure to brands not only

in advertising, but through interactions with people and displays. As importantly, brand

usage and experiences can be tracked including the location, the actual experience (e.g.,

wait time), the perceived experience, and the context (e.g., picture of complete meal or

restaurant at time of ordering).

5. Measuring brand attitudes – On-the-go recording of emotions and attitudes towards

brands using the mobile handset (as a report or through recording) could provide a richer

and more reliable measure than brand perception questionnaires used today.

Marketing Science Institute Working Paper Series 33

6. Purchase and use of services – A considerable part of purchases of individuals are for

services (e.g., movies, financial services, restaurants). Mobile diaries can complement

scanner data to provide a fuller picture of purchases through self-reports and receipt

scanning via the phone camera. The mobile diary can also add information such as the

time of purchase, location, social setting, weather, etc. that can further enrich data on the

context of the decision.

We have demonstrated that mobile diaries can be accurate and can augment existing metered

measures. However, our empirical analysis also revealed two potential issues--pulsing and short

activities--where mobile diaries may have limitations. Below we offer a brief summary and

recommendations:

1. Pulsing – Pulsing can limit the usefulness of individual-level data. While pulsing in our

context of TV viewing appears to be random and not too severe, studies in other contexts

that plan to use individual-level data need to evaluate for the presence of pulsing and

whether it is systematic. The presence of pulsing could lead to selection biases.

2. Short behaviors – Our results indicate that short viewing behaviors are not captured well

by the mobile diary. Particularly for studies focused on short activities, pre-testing the

accuracy of reporting can be important. One might consider enhancing collection of such

events by using observational methods via the smartphone to predict likely times when an

activity is occurring and to signal to the individual asking whether an event is occurring.

For instance, if tracking consumer commuting activities, one could track whether the

smartphone is leaving the home and alarm the individual to complete a diary entry.

Limitations and future research

This paper presents a first assessment of the accuracy of mobile diaries in a marketing context as

well as empirically-based guidance on how to design mobile diaries. While it uses large scale

data, it has several shortcomings which should be addressed in further research. First, the study

was designed in the context of TV viewing, and some findings might be limited to this context.

Future studies can help to more confidently generalize these findings to other domains. Second,

Marketing Science Institute Working Paper Series 34

the study is largely descriptive. Although we used randomized assignment to assess the effect of

alarms, we did not conduct experiments to isolate the other design factors. Third, some aspects

such as incentives or duration were not varied at all. In these cases, our positive findings serve as

an existence proof, and based on our findings, we speculate, for instance, that longer studies are

feasible as long as incentives are maintained throughout. Fourth, we believe the pulsing

phenomenon in particular needs further investigation. We identify the potential issue of pulsing

and find it present, but not too severe, in our data. Solving this issue could greatly increase the

value of mobile diaries for individual-level data collection. Hence, future research could pursue

evaluating the ubiquity, causes, and effects of pulsing.

Marketing Science Institute Working Paper Series 35

References

Adams, Swann Arp, Charles E. Matthews, Cara B. Ebbeling, Charity G. Moore, Joan E.

Cunningham, Jeanette Fulton, and James R. Hebert (2005), "The effect of social desirability

and social approval on self-reports of physical activity", American journal of

epidemiology 161 ( 4), 389-398.

Bolger, Niall, Angelina Davis, and Eshkol Rafaeli (2003), "Diary methods: Capturing life as

it is lived", Annual review of psychology 54 (1), 579-616.

Boyer, Peter J (1987), "TV turning to people meters to find who watches what". The New

York Times, (June 1).

Broderick, Joan E. (2008), "Electronic Diaries", Pharmaceutical medicine, 22 (2), 69-74.

Carter, Bill (1990) "The media business: Television; are there fewer viewers? Networks

challenge Nielsen", The New York Times (April 30).

———,Emily Steel (2014) "TV ratings by Nielsen had errors for months", The New York

Times (October 10).

Chen, Yubo, Qi Wang, and Jinhong Xie (2011) "Online Social Interactions: A Natural

Experiment on Word of Mouth Versus Observational Learning", Journal of Marketing

Research, 48 (2), 238-254.

Collins, R. Lorraine, Todd B. Kashdan, and Gernot Gollnisch. (2003) "The feasibility of

using cellular phones to collect ecological momentary assessment data: Application to alcohol

consumption." Experimental and clinical psychopharmacology 11 (1), 73.

Cooke, Alan D. and Peter P. Zubcsek (2015), “The Promise and Peril of Behavioral Research

on Mobile Devices” Working paper, University of Florida.

Danaher, Peter J., and Terence W. Beed (1993), "A coincidental survey of people meter

panelists: comparing what people say with what they do." Journal of Advertising

Research, 33 (1), 86-92.

Marketing Science Institute Working Paper Series 36

Deng, Yiting and Carl F. Mela (2014), "A Household Level Model of Television Viewing

with Implications for Advertising Targeting." Working paper.

Elliott, Richard and Nick Jankel-Elliott (2003),"Using ethnography in strategic consumer

research", Qualitative Market Research: An International Journal, 6 (4), 215 - 223

eMarketer, (2014), "US TV Ad Market Still Growing More than Digital Video. " (accessed

July 27, 2015),[available at http://www.emarketer.com/Article/US-TV-Ad-Market-Still-

Growing-More-than-Digital-Video/1010923].

Ephron, Erwin (1997), "Forum: how to curb TV’s sweeps ratings game: buyers can ease the

problem without foisting costs on stations", Advertising Age (February 3-).

Ericsson (2013), "TV and Media – Identifying the needs of tomorrow's video consumers",

Company Report, Ericson Consumer Lab, Ericsson.

Fawcett, Tom (2006), "An introduction to ROC analysis", Pattern recognition letters, 27(8),

‏.861-874

Heinonen, Reetta, Riitta Luoto, Pirjo Lindfors, and Clas-Håkan Nygård (2012), "Usability

and feasibility of mobile phone diaries in an experimental physical exercise

study", Telemedicine and e-Health 18 (2), 115-119.

Hensel, Devon J., James D. Fortenberry, Jaroslaw Harezlak, and Dorothy Craig (2012), "The

feasibility of cell phone based electronic diaries for STI/HIV research."BMC medical

research methodology, 12 (75), 1-12.

Kahn, Barbara E., Manohar U. Kalwani and Donald G. Morrison (1986), "Measuring

Variety-Seeking and Reinforcement Behaviors Using Panel Data", Journal of Marketing

Research, 23 (2), 89-100.

Keller, Ed, and Brad Fay (2012), The Face-To-Face Book, New York: Free Press.

Lin, Ying-Ching, and Chiu-chi Angela Chang (2012), "Double Standard: The Role of

Environmental Consciousness in Green Product Usage", Journal of Marketing, 76 (5), 125-

134.

Marketing Science Institute Working Paper Series 37

Lee, Eunkyu, Michael Y. Hu and Rex S. Toh (2000), "Are Consumer Survey Results

Distorted? Systematic Impact of Behavioral Frequency and Duration on Survey Response

Errors", Journal of Marketing Research, 37 (1), 125-133.

Lovett Mitchell J., Renana Peres and Ron Shachar (2013), "On Brands and Word of Mouth".

Journal of Marketing Research, 50 (4) 427-444.

Matthews, Mark, Gavin Doherty, John Sharry, and Carol Fitzpatrick (2008). "Mobile phone

mood charting for adolescents." British Journal of Guidance & Counselling, 36 (2), 113-129.

McKenzie, John (1983). "The accuracy of telephone call data collected by diary methods."

Journal of Marketing Research, 20 (4), 417-427.‏

Milavsky, J. Ronald (1992), "How good is the AC Nielsen people-meter system? A review of

the report by the committee on nationwide television audience measurement", Public Opinion

Quarterly, 56 (Spring), 102-115.‏

Napoli, Philip M. (2005) "Audience measurement and media policy: Audience economics,

the diversity principle, and the local people meter." Communication Law and Policy, 10 (4)

349-382.

Nielsen (2009), "A Close Look at Out-Of-Home Viewing," Company Report.

——— (2014a), "Mobile Millennials: Over 85% of Generation Y Owns Smartphones,"

Company Report (September 5).

——— (2014b), "National TV Toolbox User Guide Version 7.1," Company Report, revised

08/24/2014.

——— (2014c), "Nielsen Measures 7-9% Ratings Lift From Out-Of-Home TV Test in

Chicago," (accessed July 27, 2015), [available at http://www.nielsen.com/us/en/press-

room/2014/nielsen-measures-7-9-percent-ratings-lift-from-out-of-home-tv-test-in-

chicago.html].

Nielsen (2015) An interview of the authors with Christine Pierce, SVP of Data Science at

Nielsen, October 2, 2015, 3:30-4:15.

Marketing Science Institute Working Paper Series 38

Nonis, Sarath A., Melodie J. Philhours, and Gail I. Hudson (2006) "Where does the time go?

A diary approach to business and marketing students’ time use."Journal of Marketing

Education, 28 (2), 121-134.

Powers, David M.W. (2011), "Evaluation: from precision, recall and F-measure to ROC,

informedness, markedness and correlation," International Journal of Machine Learning

Technology, 2 (1), 37-63‏

Patterson, Anthony (2005) "Processes, relationships, settings, products and consumers: the

case for qualitative diary research" Qualitative Market Research: An International Journal, 8

(2), 142-156.

Proudfoot, Judith, Gordon Parker, Dusan Hadzi Pavlovic, Vijaya Manicavasagar, Einat Adler,

and Alexis Whitton (2010). "Community attitudes to the appropriation of mobile phones for

monitoring and managing depression, anxiety, and stress." Journal of medical Internet

research, 12 (5).

Reis, Harry T., Shelly L. Gable and Michael R. Maniaci (2014), "Methods for studying

everyday experience in its natural context," in Handbook of research methods in social and

personality psychology, Harry T. Reis and Charls M. Judd, eds. New-York: Cambridge

University Press, 373.

Rönkä, Anna, Kaisa Malinen, Ulla Kinnunen, Asko Tolvanen, and Tiina Lämsä (2010),

"Capturing daily family dynamics via text messages: development of the mobile diary."

Community, Work & Family, 13 (1), 5-21.

Sharot, Trevor (1991), "Attrition and Rotation in Panel Surveys", Journal of the Royal

Statistical Society Series D (The Statistician), 40 (3), 325-331

Shiffman, Saul, Arthur A. Stone, and Michael R. Hufford (2008) "Ecological momentary

assessment." Annual Review of Clinical Psychology, 4, 1-32.

Smith, Aaron (2015). "US Smartphone Use in 2015"(accessed July 27, 2015), [available at

http://www.pewinternet.org/files/2015/03/PI_Smartphones_0401151.pdf]

Stadd, Allison (2013), "79% Of People 18-44 have Their Smartphones with Them 22 Hours a

Day", (accessed July 27, 2015), [available at

http://www.adweek.com/socialtimes/smartphones/480485].

Marketing Science Institute Working Paper Series 39

Stone, A. A., Shiffman, S., Schwartz, J. E., Broderick, J. E., & Hufford, M. R. (2003).

"Patient compliance with paper and electronic diaries." Controlled Clinical Trials 24, 182-

199.

Sudman, Seymour, and Robert Ferber (2011) Consumer panels. Chicago: Marketing Classics

Press.

Toh, Rex S., and Michael Y. Hu (2009) "Toward a General Theory of Diary Panels

" Psychological reports, 105 (3), 1131-1153.

Uzma, Khan, Meng Zhu, and Ajay Kalra (2011) "When trade-offs matter: The effect of

choice construal on context effects." Journal of Marketing Research, 48(1), 62-71.

Wind, Yoram, and David Lerner (1979), "On the measurement of purchase data: surveys

versus purchase diaries", Journal of Marketing Research, 16 (1), 39-47.

Winer, Russell S. (1983), "Attrition bias in econometric models estimated with panel

data." Journal of Marketing Research, 20 (2), 177-186.

Zhang, Yao, Eric T. Bradlow, and Dylan S. Small (2013), "New measures of clumpiness for

incidence data." Journal of Applied Statistics, 40 (11), 2533-2548.

Marketing Science Institute Working Paper Series 40

Table 1: Reporting statistics (42,380 respondent-reporting days).

Table 2: Accuracy measures for alarmed vs. non alarmed time periods.

Alarmed Non-Alarmed Difference (p-value)

% Recall 73.7% 57.6% 16.1% (<.001)

% False Positive Rate 9.8% 1.0% 8.8% (<.001)

% Precision 91.7% 98.3% -6.6% (.005)

% False Omission Rate 29.8% 30.4% -0.6% (0.93)

Note: Sample is the half-hours that can be identified as alarmed (n=654) or not (n=598), where the

difference in size arises from greater technical difficulties in identifying late vs. early alarmed cases.

Mean

Standard deviation Minimum Maximum

Number of days in the diary (per respondent) (date of last report - date of first report) 27.7 8.3 14 46

Total reports (per respondent) 101.7 72.5 10 1031

Number of viewing reports (per respondent) 42.2 34.3 0 466

Number of days with at least one report (per respondent) 24.9 7.8 7 46

Number of days with at least 1 viewing report (per respondent) 17.8 8.7 0 42

Number of daily reports (per respondent reporting day) 4.1 3.2 1 101

Number of daily viewing reports (per respondent reporting day) 1.7 1.8 0 57

Marketing Science Institute Working Paper Series 41

Table 3: Accuracy measures for smartphone vs. non-smartphone owners

Smartphone owners

Non-smartphone owners

Difference (p-value)

% Recall 63.7% 69.5% 5.7% (<.001)

% False Positive Rate 8.5% 3.8% 4.7% (<.001)

% Precision 91.9% 96.5% 4.6% (<.001)

% False Omission Rate 37.4% 32.6% 4.8% (.007)

Table 4: Match of diary reports to people meter, n=3,927 entries for 151 respondents.

Category

% of reports of metered-diary group

1 Matching NPM record (on TV) 66.8%

2 Diary viewing not on metered TV 5.3%

3 No Matching NPM record (on TV) 27.9%

Marketing Science Institute Working Paper Series 42

Table 5: Main findings and design implications

Topic Empirical finding Design implications

Reporting activity levels

92% of the panelists completed 21 days of reporting. Reporting levels were consistent with activity levels even in post-incentive period.

1. Long (3-6 week) mobile diary studies are feasible with panelists maintaining high compliance.

Reporting activity pattern

Reporting exhibits "random" pulsing. On some days respondents provide more comprehensive reports of their activity than on other days; on some days they do not report at all but do watch TV. Diary reporting is "lumpier" than viewing, but no systematic pattern was found in who pulses or when pulsing occurs.

1. Pretest to see whether pulsing exists in the specific context, and whether it shows a systematic pattern. 2. Match alarming to natural reporting ebb to reduce pulsing. 3. Individual-level data may have missing data at random at the daily level.

Accuracy Individual-level accuracy shows Precision is 93% and Recall is 65%. Neither error is overly concentrated in a small number of individuals and the two types of errors are not correlated within individual.

1. Qualifying completes on reporting levels over a longer window appears to work. 2. Such qualification limits the concentration of errors so that all respondents are useable.

Length of activity

Longer viewing increases Recall. Viewings less than four minutes have below 30% Recall and 24-minute viewings have 77% Recall.

1. Mobile diaries might not be as accurate for capturing short activities as long activities. Pretests should check how effectively short activities are captured. 2. If calculating total length of activity, need to calibrate from self-reported activities to total activity duration.

Alarms Alarms improve Recall with only a small increase in False Positives. Alarmed prime time periods show 16 percentage points higher Recall than non-alarmed prime time periods. Of all responses, 33.7% are self-initiated.

1. Alarm regularly during times of day the focal activities are frequently observed. 2. Allow self-initiated entries to capture activities during non-alarmed times. 3. Pretest to evaluate how many alarms are feasible without severe fatigue and drop-out.

Study length and incentives

Reporting activity and accuracy remain high for incentivized period. Of respondents, 39% completed at least one full extra week, but accuracy slightly decreases after the incentivized period. Post-incentive volunteers differ demographically and have higher activity levels than non-volunteers.

1. Long (over 3 weeks) mobile diary studies are feasible, incurring only minor fatigue. 2. Keep incentivizing for the entire study length to obtain high participation, representativeness, and accuracy.

Smartphone ownership

Non-smartphone owners belong more to the youngest and oldest age groups, and have fewer respondents from ethnic minorities, but activity levels do not differ from smartphone owners. Non-smartphone owners have a slightly higher accuracy than smartphone owners.

1. Non-owners perform mobile diary tasks and are more accurately. 2. Including non-smartphone owners may not be necessary for accuracy. Pretest whether non-smartphone owners differ from owners in generating reporting and accuracy.

Marketing Science Institute Working Paper Series 43

Figure 1: Illustrations of mobile diary application questions

* At the end of the diary entry, the respondents were requested to repeat the questionnaire if they

were watching or communicating about more than one show.

Figure 2: An abbreviated schematic flow of the mobile diary viewing related questions.

Marketing Science Institute Working Paper Series 44

Figure 3: The number of daily viewing reports (boxes) and the amount of measured daily

viewing time during primetime (vertical line) for two respondents.

Marketing Science Institute Working Paper Series 45

Figure 4: Contingency matrix and accuracy measures of the mobile diary reports relative to the

People Meter records (for active diary days).

According to Mobile Diary

Viewing Not Viewing

According to People Meter*

Viewing

A=3670

B=1998

% Recall

(% Reporting)

= A/(A+B)

= 64.7%

Not Viewing

C=289

D=3463

% False Positive Rate

= C/(C+D)

= 7.7%

% Precision

=A/(A+C)

= 92.7%

% False Omission Rate

= B/(B+D)

= 36.6%

Note: When no People Meter entry matches, we also search show listing services (see text for details).

Marketing Science Institute Working Paper Series 46

Figure 5: Rank Ordered Respondents by % Recall and % False Positive Rate

Figure 6: % Recall vs. Minutes of Viewing in Half-Hour

0

20

40

60

80

100

120

0 50 100 150 200

Rank Ordered Respondents

Recall

False Positive Rate

0

20

40

60

80

100

0 5 10 15 20 25 30

% R

ecal

l

Minutes of viewing in half-hour

Marketing Science Institute Working Paper Series 47

Figure 7: Average People Meter (bars with vertical black lines for +/- 2 standard errors) vs.

reporting (solid line) percentages per respondent by tenure in study

Figure 8: % Recall and % False Positive Rate vs. week in diary

0

5

10

15

20

25

1 2 3 4 5 6

Perc

ent

of

per

res

po

nd

ent

rep

ort

/pro

gram

vie

wed

Week in Diary

NPM Mean

Diary Mean

0

20

40

60

80

100

1 2 3 4 5 6 7

Week in Diary

Percent Recall Percent False Positive

Marketing Science Institute Working Paper Series 48

Figure 9: Nielsen NPM Ratings vs. Weighted Mobile Diary Ratings

Marketing Science Institute Working Paper Series 49

Appendix A – Samples and Quotas

The sample was recruited to be representative of the U.S. TV viewers population between the

ages of 15 and 54. Sample quotas were chosen using the distribution of age and gender in the

population, along with independent quotas for Hispanic origin and ethnicity. The resultant

sample spans all geographic regions and ethnicities. Table A1 presents the demographics of the

sample, versus the quota targets, as well as additional demographics about the participants.

Table A1: Respondents demographics. n=1702

Variable Category Number of respondents

% of sample Target

Gender & Age

M15-17 60 3.5% 4.0%

M18-24 102 6.0% 9.0%

M25-34 208 12.2% 12.0%

M35-44 223 13.1% 12.0%

M45-54 194 11.4% 13.0%

F15-17 72 4.2% 4.0%

F18-24 113 6.6% 9.0%

F25-34 309 18.2% 12.0%

F35-44 210 12.3% 12.0%

F45-54 211 12.4% 13.0%

Ethnicity

White/Caucasian 984 57.8%

Black/African-American 203 20.9% 12.0%

Asian or Pacific Islander 80 3.6%

Other 112 4.7%

No response 323 13.0%

Hispanic Origin

Yes 253 14.9% 12.0%

No 1449 85.1% 88.0%

US Geographic Region

North East 266 15.6%

Mid West 362 21.3%

South 673 39.5%

West 401 23.6%

Marketing Science Institute Working Paper Series 50

Overall, the sample matches quite well with age-gender quotas excepting ages 18-24,

which are under-represented and women 25-34, which are over-represented. The sample percent

of African-Americans is also higher than the target. However, in general, the sample is quite

reasonable. Sample weights were constructed by the panel provider to adjust the small

discrepancies in demographics to be nationally representative on age, gender, education,

Hispanic, African Americans, and geographic region. We use these weights when comparing the

mobile diary to the aggregate NPM data.

We note that the study also included oversampling a group of respondents referred to as

"superconnectors" because their general usage of social media related to TV programming was

higher than average. This over-sampling of superconnectors was not drawn from the metered-

diary group, so that it does not affect our primary analyses here. Further, in constructing the

sample weights, this disproportionate sampling was addressed by underweighting these

individuals and overweighting the non-superconnectors.

We also note that because in our study dropout was virtually non-existent for the

incentivized period, we were not concerned with comparing the sample over the incentivized

period. However, in general, this can be a concern for longitudinal studies like ours (see Toh and

Hu 2009).

Sample of Non-Smartphone Owners

In Table A2 we present the demographics for the smartphone ownership vs. non-

ownership subgroups. For statistically significant differences, we find that non-smartphone

owners have significantly more men (58% vs. 44% of the smartphone owners), are more

concentrated in the youngest (15-17) and oldest (44-55) age groups, and have fewer African-

Americans. No significant differences were found in geographic regions or in Hispanic origin.

Marketing Science Institute Working Paper Series 51

Table A2: Respondents demographics for smartphone vs non smartphone owners. n=1702

Sample of Mobile Diary Respondents with People Meters (metered diary Group)

Although the pool of metered diary respondents was more limited in scope due to the limited

availability of tenured People Meter members, Nielsen recruited in order to have this group

demographically similar to the full sample. As Table A3 reports, the metered diary group has a

Variable Category

% of smartphone owners n=1386

% of non smartphone owners n=316

Gender & Age

M15-17 2% 9%

M18-24 6% 5%

M25-34 13% 8%

M35-44 13% 12%

M45-54 8% 24%

F15-17 3% 9%

F18-24 7% 3%

F25-34 21% 7%

F35-44 14% 6%

F45-54 11% 17%

Ethnicity

White/Caucasian 57% 60%

Black/African-American 13% 8%

Asian or Pacific Islander 5% 3%

Other 6% 8%

No response 19% 21%

Hispanic Origin

Yes 15% 16%

No 85% 84%

US Geographic Region

North East 16% 16%

Mid West 20% 25%

South 40% 35%

West 24% 24%

Marketing Science Institute Working Paper Series 52

larger proportion of women over 25, whites, and Western geography, and a smaller proportion of

young men relative to the full sample. None of these differences, however, are statistically

significant.

Table A3: Respondents demographics, the metered Diary sub-sample. n =151.

Variable Category Number of respondents

% of sample

Target

Gender & Age

M15-17 1 0.68% 4.0%

M18-24 2 1.37% 9.0%

M25-34 17 11.64% 12.0%

M35-44 18 12.33% 12.0%

M45-54 14 9.59% 13.0%

F15-17 1 0.68% 4.0%

F18-24 4 2.74% 9.0%

F25-34 30 20.55% 12.0%

F35-44 28 19.18% 12.0%

F45-54 31 21.23% 13.0%

Ethnicity

White/Caucasian 119 81.51%

Black/African-American

11 7.53% 12.0%

Asian or Pacific Islander

8 5.48%

Other 8 5.48%

Hispanic Origin

Yes 20 13.70% 12.0%

No 126 86.30% 88.0%

US Geographic Region

North East 23 15.8%

Mid West 20 13.7%

South 55 37.7%

West 48 32.9%

Sample of Mobile Diary Respondents who continued after 21 days

Of all respondents, 61.5% kept reporting for more than the incentivized 21 reporting days. It is of

interest to compare the demographics of this group to those who completed 21 days or less. The

Marketing Science Institute Working Paper Series 53

statistically significant differences are in gender, with 65% of the females continue for more than

21 days vs only 57% of the males. Also, panelists from the two youngest age groups tend not to

go over the 21 days. No other significant differences were found.

Table A4: Respondents demographics, comparing those who completed 21 reporting days or

less to those with more than 21 reporting days. n=1702.

Variable Category % of <=21 days n=654

% of > 21 days n=1048

Gender & Age

M15-17 4% 3%

M18-24 9% 4%

M25-34 14% 11%

M35-44 14% 13%

M45-54 10% 12%

F15-17 5% 4%

F18-24 8% 6%

F25-34 16% 19%

F35-44 9% 14%

F45-54 10% 14%

Ethnicity

White/Caucasian 57% 58%

Black/African-American 11% 13%

Asian or Pacific Islander 5% 4%

Other 6% 7%

No response 21% 18%

Hispanic Origin Yes 13% 16%

No 87% 84%

US Geographic Region

North East 16% 16%

Mid West 22% 21%

South 37% 41%

West 25% 23%

Marketing Science Institute Working Paper Series 54

Appendix B – Accuracy measures for additional subsets of the data

The accuracy measures in the results section were calculated for active diary days where diary

entries with no People Meter records are considered as true positives if (1) they are reported as

live viewing and when evaluated against program listings they match a live airing or (2) they are

reported as viewed in a way not measured by the people meter (e.g. not on TV). The numbers

show reasonably high level of Precision and Recall and relatively low level of errors.

For completeness, we present here the accuracy measures for two additional cases. The

leftmost column in the table below displays the results including all the primetime reports, both

from active and non-active days. The middle column displays accuracy including only reports

from active days, where all diary entries with no matching People Meter records are counted as

false positives, and the rightmost column is the one used in the paper and presented in Figure 4.

To explain better the switch from the middle column to the rightmost column, we focus

our attention on cell C, for which the People Meter indicates no viewing, but the respondent

reports viewing in the diary during prime time hours. For the full data, the number of reports in

All data Active days all non-matches are false positive

Active days +refinement of false positives

A. True Positive 3035 3023 3670

B. False Negative 3363 1998 1998

C. False Positive 936 936 289

D. True Negative 8467 3463 3463

% Recall 47.4% 60.2% 64.7%

% False Positive Rate 10.0% 21.3% 7.7%

% Precision 76.4% 76.4% 92.7%

% False Omission Rate 23.6% 36.6% 36.6%

Marketing Science Institute Working Paper Series 55

this cell is 936. However, when looking at the diary entries, we can see that most of these should

not be considered as false positives. First, approximately 14.4% of the diary reports in this cell

were reported by respondents as either not viewed on TV or were not viewed on TV in a way

that the People Meter registers. In either case, this viewing is not necessarily “incorrect,” and,

though we have no other data to further validate this viewing, on the surface it seems very likely

to be valid. Second, for live viewing reports, we compare the diary programs against the Tribune

listings of telecasts (and also use online sources for missing data). Although finding a

corresponding airing does not ensure the individual was in fact watching, again on the surface,

the viewing report seems very likely to be valid. We find that 75.9% of these live viewing cases

have a corresponding program airing for the same time window, and these programs account for

79.2% of the live viewing half-hours. This high percentage suggests that the People Meter may

be missing a meaningful proportion of viewing. Unfortunately, we do not have a similar way to

verify the delayed viewing cases, so in the refinement of cell C, we simply keep as false

positives these delayed viewing cases without matching People Meter records. This assumption

is likely to produce a conservative measure of our accuracy.

Marketing Science Institute Working Paper Series 56