Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE...

10
ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous audiovisual speech John F. Magnotti 1 , Wei Ji Ma 2† and Michael S. Beauchamp 1 * 1 Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, TX, USA 2 Department of Neuroscience, Baylor College of Medicine, Houston TX, USA Edited by: Luiz Pessoa, University of Maryland, USA Reviewed by: Wolfgang Einhauser, Philipps Universität Marburg, Germany Eric Hiris, University of Northern Iowa, USA *Correspondence: Michael S. Beauchamp, Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, 6431 Fannin St. Suite G.550, Houston, TX 77030, USA e-mail: michael.s.beauchamp@ uth.tmc.edu Present address: Wei Ji Ma, Center for Neural Science and Department of Psychology, New York University, New York, NY, USA During speech perception, humans integrate auditory information from the voice with visual information from the face. This multisensory integration increases perceptual precision, but only if the two cues come from the same talker; this requirement has been largely ignored by current models of speech perception. We describe a generative model of multisensory speech perception that includes this critical step of determining the likelihood that the voice and face information have a common cause. A key feature of the model is that it is based on a principled analysis of how an observer should solve this causal inference problem using the asynchrony between two cues and the reliability of the cues. This allows the model to make predictions about the behavior of subjects performing a synchrony judgment task, predictive power that does not exist in other approaches, such as post-hoc fitting of Gaussian curves to behavioral data. We tested the model predictions against the performance of 37 subjects performing a synchrony judgment task viewing audiovisual speech under a variety of manipulations, including varying asynchronies, intelligibility, and visual cue reliability. The causal inference model outperformed the Gaussian model across two experiments, providing a better fit to the behavioral data with fewer parameters. Because the causal inference model is derived from a principled understanding of the task, model parameters are directly interpretable in terms of stimulus and subject properties. Keywords: causal inference, synchrony judgments, speech perception, multisensory integration, Bayesian observer INTRODUCTION When an observer hears a voice and sees mouth movements, there are two potential causal structures (Figure 1A). In the first causal structure, the events have a common cause (C = 1): a sin- gle talker produces the voice heard and the mouth movements seen. In the second causal structure, the events have two differ- ent causes (C = 2): one talker produces the auditory voice and a different talker produces the seen mouth movements. When there is a single talker, integrating the auditory and visual speech information increases perceptual accuracy (Sumby and Pollack, 1954; Rosenblum et al., 1996; Schwartz et al., 2004; Ma et al., 2009); most computational work on audiovisual integration of speech has focused on this condition (Massaro, 1989; Massaro et al., 2001; Bejjanki et al., 2011). However, if there are two talkers, integrating the auditory and visual information actually decreases perceptual accuracy (Kording et al., 2007; Shams and Beierholm, 2010). Therefore, a critical step in audiovisual integration dur- ing speech perception is estimating the likelihood that the speech arises from a single talker. This process, known as causal infer- ence (Kording et al., 2007; Schutz and Kubovy, 2009; Shams and Beierholm, 2010; Buehner, 2012), has provided an excellent tool for understanding the behavioral properties of tasks requiring spatial localization of simple auditory beeps and visual flashes (Kording et al., 2007; Sato et al., 2007). However, multisensory speech perception is a complex and highly-specialized compu- tation that takes place in brain areas distinct from those that perform audiovisual localization (Beauchamp et al., 2004). Unlike spatial localization, in which subjects estimate the continuous variable of location, speech perception is inherently multidimen- sional (Ma et al., 2009) and requires categorical decision making (Bejjanki et al., 2011). Therefore, we set out to determine whether the causal inference model could explain the behavior of humans perceiving multisensory speech. Manipulating the asynchrony between the auditory and visual components of speech dramatically affects audiovisual integra- tion. Therefore, synchrony judgment tasks are widely used in the audiovisual speech literature and have been used to character- ize both individual differences in speech perception in healthy subjects and group differences between healthy and clinical pop- ulations (Lachs and Hernandez, 1998; Conrey and Pisoni, 2006; Smith and Bennetto, 2007; Rouger et al., 2008; Foss-Feig et al., 2010; Navarra et al., 2010; Stevenson et al., 2010; Vroomen and Keetels, 2010). While these behavioral studies provide valuable descriptions of behavior, the lack of a principled, quantitative foundation is a fundamental limitation. In these studies, syn- chrony judgment data are fit with a series of Gaussian curves without a principled justification for why synchrony data should be Gaussian in shape. A key advantage of the causal inference model is that the model parameters, such as the sensory noise in each perceiver, can be directly related to the neural mechanisms underlying speech perception. This stands in sharp contrast to the Gaussian model’s explanations based solely on descriptive mea- sures (most often the standard deviation of the fitted curve). The causal inference model generates behavioral predictions based on www.frontiersin.org November 2013 | Volume 4 | Article 798 | 1

Transcript of Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE...

Page 1: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

ORIGINAL RESEARCH ARTICLEpublished: 13 November 2013

doi: 10.3389/fpsyg.2013.00798

Causal inference of asynchronous audiovisual speechJohn F. Magnotti1, Wei Ji Ma2† and Michael S. Beauchamp1*

1 Department of Neurobiology and Anatomy, University of Texas Medical School at Houston, TX, USA2 Department of Neuroscience, Baylor College of Medicine, Houston TX, USA

Edited by:

Luiz Pessoa, University of Maryland,USA

Reviewed by:

Wolfgang Einhauser, PhilippsUniversität Marburg, GermanyEric Hiris, University of NorthernIowa, USA

*Correspondence:

Michael S. Beauchamp, Departmentof Neurobiology and Anatomy,University of Texas Medical Schoolat Houston, 6431 Fannin St. SuiteG.550, Houston, TX 77030, USAe-mail: [email protected]†Present address:

Wei Ji Ma, Center for NeuralScience and Department ofPsychology, New York University,New York, NY, USA

During speech perception, humans integrate auditory information from the voice withvisual information from the face. This multisensory integration increases perceptualprecision, but only if the two cues come from the same talker; this requirement hasbeen largely ignored by current models of speech perception. We describe a generativemodel of multisensory speech perception that includes this critical step of determiningthe likelihood that the voice and face information have a common cause. A key featureof the model is that it is based on a principled analysis of how an observer should solvethis causal inference problem using the asynchrony between two cues and the reliabilityof the cues. This allows the model to make predictions about the behavior of subjectsperforming a synchrony judgment task, predictive power that does not exist in otherapproaches, such as post-hoc fitting of Gaussian curves to behavioral data. We testedthe model predictions against the performance of 37 subjects performing a synchronyjudgment task viewing audiovisual speech under a variety of manipulations, includingvarying asynchronies, intelligibility, and visual cue reliability. The causal inference modeloutperformed the Gaussian model across two experiments, providing a better fit to thebehavioral data with fewer parameters. Because the causal inference model is derivedfrom a principled understanding of the task, model parameters are directly interpretablein terms of stimulus and subject properties.

Keywords: causal inference, synchrony judgments, speech perception, multisensory integration, Bayesian

observer

INTRODUCTIONWhen an observer hears a voice and sees mouth movements,there are two potential causal structures (Figure 1A). In the firstcausal structure, the events have a common cause (C = 1): a sin-gle talker produces the voice heard and the mouth movementsseen. In the second causal structure, the events have two differ-ent causes (C = 2): one talker produces the auditory voice anda different talker produces the seen mouth movements. Whenthere is a single talker, integrating the auditory and visual speechinformation increases perceptual accuracy (Sumby and Pollack,1954; Rosenblum et al., 1996; Schwartz et al., 2004; Ma et al.,2009); most computational work on audiovisual integration ofspeech has focused on this condition (Massaro, 1989; Massaroet al., 2001; Bejjanki et al., 2011). However, if there are two talkers,integrating the auditory and visual information actually decreasesperceptual accuracy (Kording et al., 2007; Shams and Beierholm,2010). Therefore, a critical step in audiovisual integration dur-ing speech perception is estimating the likelihood that the speecharises from a single talker. This process, known as causal infer-ence (Kording et al., 2007; Schutz and Kubovy, 2009; Shams andBeierholm, 2010; Buehner, 2012), has provided an excellent toolfor understanding the behavioral properties of tasks requiringspatial localization of simple auditory beeps and visual flashes(Kording et al., 2007; Sato et al., 2007). However, multisensoryspeech perception is a complex and highly-specialized compu-tation that takes place in brain areas distinct from those thatperform audiovisual localization (Beauchamp et al., 2004). Unlike

spatial localization, in which subjects estimate the continuousvariable of location, speech perception is inherently multidimen-sional (Ma et al., 2009) and requires categorical decision making(Bejjanki et al., 2011). Therefore, we set out to determine whetherthe causal inference model could explain the behavior of humansperceiving multisensory speech.

Manipulating the asynchrony between the auditory and visualcomponents of speech dramatically affects audiovisual integra-tion. Therefore, synchrony judgment tasks are widely used in theaudiovisual speech literature and have been used to character-ize both individual differences in speech perception in healthysubjects and group differences between healthy and clinical pop-ulations (Lachs and Hernandez, 1998; Conrey and Pisoni, 2006;Smith and Bennetto, 2007; Rouger et al., 2008; Foss-Feig et al.,2010; Navarra et al., 2010; Stevenson et al., 2010; Vroomen andKeetels, 2010). While these behavioral studies provide valuabledescriptions of behavior, the lack of a principled, quantitativefoundation is a fundamental limitation. In these studies, syn-chrony judgment data are fit with a series of Gaussian curveswithout a principled justification for why synchrony data shouldbe Gaussian in shape. A key advantage of the causal inferencemodel is that the model parameters, such as the sensory noise ineach perceiver, can be directly related to the neural mechanismsunderlying speech perception. This stands in sharp contrast to theGaussian model’s explanations based solely on descriptive mea-sures (most often the standard deviation of the fitted curve). Thecausal inference model generates behavioral predictions based on

www.frontiersin.org November 2013 | Volume 4 | Article 798 | 1

Page 2: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

FIGURE 1 | Causal structure of audiovisual speech. (A) Causal diagramfor audiovisual speech emanating from a single talker (C = 1) or two talkers(C = 2). (B) Difference between auditory and visual speech onsets showinga narrow distribution for C = 1 (navy) and a broad distribution for C = 2(gray). The first x-axis shows the onset difference in the reference frame ofphysical asynchrony. The second x-axis shows the onset difference inreference frame of the stimulus (audio/visual offset created by shifting theauditory speech relative to the visual speech). A recording of natural speechwithout any manipulation corresponds to zero offset in the stimulusreference frame and a positive offset in the physical asynchrony referenceframe because visual mouth opening precedes auditory voice onset. (C)

For any given physical asynchrony (�) there is a distribution of measuredasynchronies (with standard deviation σ ) because of sensory noise. (D)

Combining the likelihood of each physical asynchrony (B) with sensorynoise (C) allows calculation of the measured asynchrony distributionsacross all physical asynchronies. Between the dashed lines, the likelihoodof C = 1 is greater than the likelihood of C = 2.

an analysis of how the brain might best perform the task, ratherthan seeking a best-fitting function for the behavioral data.

CAUSAL INFERENCE OF ASYNCHRONOUS AUDIOVISUALSPEECHThe core of the causal inference model is a first-principlesanalysis of how the relationship between cues can be used todetermine the likelihood of a single talker or multiple talk-ers. Natural auditory and visual speech emanating from thesame talker (C = 1) contains a small delay between the visualonset and the auditory onset caused by the talker preparing thefacial musculature for the upcoming vocalization before engaging

the vocal cords. This delay results in the distribution of asyn-chronies having a positive mean when measured by the phys-ical difference between the auditory and visual stimulus onsets(Figure 1B). When there are two talkers (C = 2), there is no rela-tionship between the visual and auditory onsets, resulting in abroad distribution of physical asynchronies. Observers do nothave perfect knowledge of the physical asynchrony, but insteadtheir measurements are subject to sensory noise (Figure 1C).An observer’s measured asynchrony therefore follows a distri-bution that is broader than the physical asynchrony distribu-tion. Overlaying these distributions shows that there is a win-dow of measured asynchronies for which C = 1 is more likelythan C = 2 (Figure 1D). This region is the Bayes-optimal syn-chrony window and is used by the observer to make the syn-chronous/asynchronous decision; this window does not changebased on the physical asynchrony, which is unknown to theobserver.

During perception of a multisensory speech event, the mea-sured onsets for the auditory and visual cues are corrupted bysensory noise (Ma, 2012); these measurements are subtractedto produce the measured asynchrony (x; Figure 2A). Critically,observers use this measured asynchrony, rather than the physi-cal asynchrony, to infer the causal structure of the speech event.Because the sensory noise has zero mean, the physical asynchronydetermines the mean of the measured asynchrony distribution.Thus, synchrony perception depends both on the physical asyn-chrony and the observer’s sensory noise. For example, if the visualcue leads the auditory cue by 100 ms (physical asynchrony =100 ms, the approximate delay for cues from the same talker),then the measured asynchrony is likely to fall within the Bayes-optimal synchrony window, and the observer is likely to respondsynchronous (Figure 2B). By contrast, if the visual cue trailsthe auditory cue by 100 ms (physical asynchrony = −100 ms)then the measured asynchrony is unlikely to fall within the syn-chrony window and the observer is unlikely to report a commoncause (Figure 2C). Calculating the likelihood that a measuredasynchrony falls within the synchrony window for each physicalasynchrony at a given level of sensory noise produces the pre-dicted behavioral curve (Figure 2D). Because the sensory noiseis modeled as changing randomly from trial-to-trial, the model isprobabilistic, calculating the probability of how an observer willrespond, rather than deterministic, with a fixed response for anyphysical asynchrony (Ma, 2012).

MATERIALS AND METHODSBEHAVIORAL TESTING PROCEDUREHuman subjects approval and subject consent were obtainedfor all experiments. Participants (n = 39) were undergraduatesat Rice University who received course credit. All participantsreported normal or corrected-to-normal vision and hearing.

Stimuli were presented on a 15′′ Macbook Pro Laptop (2008model) using Matlab 2010a with the Psychophysics Toolboxextensions (Brainard, 1997; Pelli, 1997) running at 1440 × 900(width × height) resolution. Viewing distance was ∼40 cm.A lamp behind the participants provided low ambient lighting.Sounds were presented using KOSS UR40 headphones. The vol-ume was set at a comfortable level for each individual participant.

Frontiers in Psychology | Perception Science November 2013 | Volume 4 | Article 798 | 2

Page 3: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

FIGURE 2 | Synchrony judgments under the causal inference model.

(A) On each trial, observers obtain a measurement of the audiovisualasynchrony by differencing measurements of the auditory (magenta) andvisual (green) onsets. Because of sensory noise, the measured asynchrony(x, blue) is different than the physical asynchrony (�, purple). (B) For a givenphysical asynchrony, � = 100 ms, there is a range of possible measuredasynchronies (x, blue). The shaded region indicates values of x for whichC = 1 is more probable than C = 2 (Figure 1D). The area of the shadedregion is the probability of a synchronous percept, p(Sync). (C) For adifferent physical asynchrony, � = −100 ms, there is a different distributionof measured asynchronies, with a lower probability of a synchronouspercept. (D) The probability of a synchronous percept for different physicalasynchronies. Purple markers show the predictions for � = 100 ms and� = −100 ms.

Trials began with the presentation of a white fixation cross ina central position on the screen for 1.2 s, followed by presentationof the audiovisual recording of a single word (∼2 s), and then thereappearance of the fixation cross until the behavioral responsewas recorded. Participants were instructed to press the “m” key ifthe audio and visual speech were perceived as synchronous, andthe “n” key if perceived as asynchronous.

STIMULIThe stimuli consisted of audiovisual recordings of spoken wordsfrom previous studies of audiovisual speech synchrony judgments(Lachs and Hernandez, 1998; Conrey and Pisoni, 2004) obtainedby requesting them from the authors. The stimuli were all640 × 480 pixels in size. The first stimulus set consisted of record-ings of four words (“doubt,” “knot,” “loan,” “reed”) selected tohave high visual intelligibility, as determined by assessing visual-only identification performance (Conrey and Pisoni, 2004). Thetemporal asynchrony of the auditory and visual components ofthe recordings was manipulated, ranging from −300 ms (audioahead) to +500 ms (video ahead), with 15 total asynchronies(−300, −267, −200, −133, −100, −67, 0, 67, 100, 133, 200, 267,

300, 400, and 500 ms). The visual-leading half of the curve wasover-sampled because synchrony judgment curves have a peakshifted to the right of 0 (Vroomen and Keetels, 2010). The sec-ond stimulus set contained blurry versions of these words (at thesame 15 asynchronies), created by blurring the movies with a 100-pixel Gaussian filter using FinalCut Pro. For the third stimulusset, four words with low visual intelligibility were selected (“give,”“pail,” “theme,” “voice”) at the same 15 asynchronies. The fourthstimulus set contained visual-blurred versions of the low visualintelligibility words.

EXPERIMENTAL DESIGNFor the first experiment, stimuli from all four stimulus sets werepresented, randomly interleaved, to a group of 16 subjects in onetesting session for each subject. The testing session was dividedinto three blocks, with self-paced breaks between each run. Eachrun contained one presentation of each stimulus (8 words ×15 asynchronies × 2 reliabilities = 240 total stimuli per run).Although the stimulus sets were presented intermixed, and a sin-gle model was fit to all stimuli together, we discuss the modelpredictions separately for each of the four stimulus sets.

In the second, replication experiment, the same task and stim-uli were presented to a group of 23 subjects (no overlap withsubjects from Experiment 1). These subjects completed one run,resulting in one presentation of each stimulus to each subject. Twosubjects responded “synchronous” to nearly all stimuli (perhapsto complete the task as quickly as possible). These subjects werediscarded, leaving 21 subjects.

MODEL PARAMETERSThe causal inference of multisensory speech (CIMS) model hastwo types of parameters: two subject parameters and four stimu-lus parameters (one of which is set to zero, resulting in three fittedparameters). The first subject parameter, σ , is the noise in themeasurement of the physical asynchrony (while we assume thatthe measurement noise is Gaussian, other distributions could beused easily), and varies across stimulus conditions. Because thenoisy visual and noisy auditory onset estimates are differenced,only a single parameter is needed to estimate the noise in the mea-sured asynchrony. As σ increases, the precision of the observer’smeasurement of the physical asynchrony decreases.

The second subject parameter is pC=1, the prior probabilityof a common cause. This parameter is intended to reflect theobserver’s expectation of how often a “common cause” occursin the experiment, and remains fixed across stimulus conditions.Holding other parameters constant, a higher pC=1 means that theobserver will more often report synchrony. If observers have nosystematic bias toward C = 1 or C = 2, we have pC=1 = 0.5, andthe model has only one subject-level parameter.

The four stimulus parameters reflect the statistics of naturalspeech and are the mean and standard deviation of the C = 1(μC=1, σC=1) and C = 2 distributions (μC=2, σC=2).

PHYSICAL ASYNCHRONY vs. AUDITORY/VISUAL OFFSETIn the literature, a common reference frame is to consider themanipulations made to speech recordings when generating exper-imental stimuli (Vroomen and Keetels, 2010). In this convention,

www.frontiersin.org November 2013 | Volume 4 | Article 798 | 3

Page 4: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

natural speech is assigned an audio/visual offset of zero. Thesetwo reference frames (physical vs. stimulus manipulation) dif-fer by a small positive shift that varies slightly across differentwords and talkers. This variability in the C = 1 distribution isaccounted for in the model with a narrow distribution of physicalasynchronies. Talkers consistently begin their (visually-observed)mouth movements slightly before beginning the auditory part ofthe vocalization. This results in a narrow asynchrony distribu-tion for C = 1 (small σC=1). In contrast, if the visually-observedmouth movements arise from a different talker than the auditoryvocalization, we expect no relationship between them and a broadasynchrony distribution for C = 2 (large σC=2). We adopt thewidely-used stimulus manipulation reference frame and defineμC=1 as zero, which leads to a negative value for μC=2 (adoptingthe other reference frame does not change the results).

MODEL FITTING AND COMPARISONAll model fitting was done in R (R Core Team, 2012). The sourcecode for all models is freely available on the authors’ web site(http://openwetware.org/wiki/Beauchamp:CIMS). Only a singlemodel was fit for each subject across all stimulus conditions. Theinput to the model fitting procedures was the number of timeseach physical asynchrony was classified as synchronous across allruns. All model parameters were obtained via maximization ofthe binomial log-likelihood function on the observed data.

For the CIMS model, we used a multi-step optimizationapproach. In the first step, we found the best-fitting subjectparameters (pC=1 and σ ) for each subject and stimulus set basedon random initial values for the stimulus parameters (σC=1,σC=2, μC=2). In the second step, we found the best-fitting stim-ulus parameters based on the fitted pC=1 and σ values. Thesesteps were repeated until the best-fitting model was obtained.Because the experimental manipulations were designed to affectsensory reliability, we fit a separate σ in each condition, result-ing in a total of 8 free parameters for the CIMS model. Thishierarchical fitting procedure was used because some parame-ters were consistent across conditions, allowing the fitting pro-cedure to converge on the best fitting model more quickly.We refit the model using 256 initial positions for the stimu-lus parameters to guard against fitting to local optima. Visualinspection of individual fits confirmed the model was not fit-ting obviously sub-optimal subject parameters. Finally, we con-firmed the ability of the fitting procedure to recover the max-imum likelihood estimates for each parameter using simulateddata.

We compared the CIMS model with a curve-fitting approachtaken in previous studies of audiovisual synchrony judgments,which we term the Gaussian model. In the Gaussian model, ascaled Gaussian probability density curve is fit to each subject’ssynchrony judgment curve. Each subject’s synchrony judgmentcurve is therefore characterized by the three parameters: the meanvalue, the standard deviation, and a scale parameter that reflectsthe maximum rate of synchrony perception. Because the Gaussianmodel does not make any predictions about the relationshipbetween conditions, we follow previous studies and fit indepen-dent sets of parameters between stimulus sets. The Gaussianmodel contains a total of 12 free parameters across all conditionsin this experiment.

After fitting the models to the behavioral data, we comparedthem based on how well their predictions matched the observeddata. We show the mean model predictions with the meanbehavioral data to assess qualitative fit. Because these predicteddata are averages, however, they are not a reasonable indicator ofa model’s fit to any individual subject. A model that overpredictssynchrony for some subjects and underpredicts for others mayhave a better mean curve than a model that slightly overpredictsmore often than underpredicts. To ensure the models accuratelyreproduce individual-level phenomena, we assess model fit byaggregating error across individual-level model fits. In experi-ment 1 we provide a detailed model comparison for each stimulusset using the Bayesian Information Criterion (BIC). BIC is a lin-ear transform of the negative log-likelihood that penalizes modelsbased on the number of free parameters and trials, with lowerBIC values corresponding to better model fits. For each model wedivide the penalty term evenly across the conditions, so that theCIMS model is penalized for 2 parameters per condition and theGaussian model is penalized for 3 parameters per condition. Tocompare model fits across all stimulus conditions we consideredboth individual BIC across all conditions and the group meanBIC, calculated by summing the BIC for each condition acrosssubjects, then taking the mean across subjects. In Experiment 2,we compare group mean BIC across all conditions. Conventionalsignificance tests on the BIC differences between models were alsoperformed. In the model comparison figures, error is the within-subject standard error, calculated as in Loftus and Masson (1994).

MODEL DERIVATIONIn the generative model, there are two possible states of theworld: C = 1 (single talker) and C = 2 (two talkers). The priorprobability of C = 1 is pC=1. Asynchrony, denoted �, has a dif-ferent distribution under each state. Both are assumed to beGaussian, such that p(�|C = 1) = N

(�; 0, σ 2

1

)and p(�|C =

2) = N(�;μ2, σ

22

). The observer’s noisy measurement of �,

denoted x, is also Gaussian, p(x|�) = N(x; �, σ 2) where thevariance σ 2 is the combined variance from the auditory andvisual cues. This specifies the statistical structure of the task.In the inference model, the observer infers C from x. This ismost easily expressed as a log posterior ratio, d = log p(C = 1|x)

p(C = 2|x) =log pC = 1

1−pC = 1+ log

N(x;0,σ 2+σ 2

1

)N

(x;μ,σ 2+σ 2

2

) .

The optimal decision rule is d > 0. If we assume that σ1 < σ2,

then the optimal decision rule becomes∣∣∣x + μ2

σ 2 + σ 22

σ 22 − σ 2

1

∣∣∣ <√√√√√√

2 log pC = 11−pC = 1

+ logσ 2 + σ 2

2

σ 2 + σ 21

+ μ22

σ 22 − σ 2

1(1

σ 2 + σ 21

− 1σ 2 + σ 2

2

) and the probabil-

ity of reporting a common cause for a given asynchrony

is p(C = 1|�) = Normcdf(

x; −μ2σ 2 + σ 2

2

σ 22 − σ 2

1+ √

. . . ,σ)

−Normcdf

(x; −μ2

σ 2 + σ 22

σ 22 − σ 2

1− √

. . . ,σ)

.

RESULTSBEHAVIORAL RESULTS FROM EXPERIMENT 1The CIMS model makes trial-to-trial behavioral predictionsabout synchrony perception using a limited number of

Frontiers in Psychology | Perception Science November 2013 | Volume 4 | Article 798 | 4

Page 5: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

parameters that capture physical properties of speech, the sen-sory noise of the subject, and the subject’s prior assumptionsabout the causal structure of the stimuli in the experiment. TheGaussian model simply fits a Gaussian curve to the behavioraldata. We tested the CIMS model and Gaussian model againstbehavioral data from subjects viewing movie clips of audiovisualspeech with varying asynchrony, visual intelligibility (high/low)and visual reliability (reliable/blurred).

Visual reliable, high visual intelligibility wordsWe first compared synchrony judgments for 16 subjects withvisual reliable, high visual intelligibility words. Synchronousresponses were 0.90 or higher for visual-leading asynchroniesfrom +67 to +133 ms, but dropped off for higher or lowerasynchronies (Figure 3A). The general shape of the curve is con-sistent with previous reports of simultaneity judgments with thesestimuli (Conrey and Pisoni, 2006).

A single CIMS model and Gaussian model were fit to each sub-ject across all stimulus conditions and the predicted synchronyreports were averaged to produce mean predictions (Figure 3A).To provide a quantitative comparison of the model fits, we com-pared the BIC of both models for each subject (Figure 3B). TheBIC measure was in favor of the CIMS model, with a meandifference of 5.8 ± 1.8 (SEM). A paired t-test showed that thedifference was reliable [t(15) = 3.16, p = 0.006].

The better fit of the CIMS models is caused by its abilityto predict a range of asynchronies that are perceived as nearlysynchronous (rather than one peak) and an asymmetric syn-chrony judgment curve. These features are consequences of themodel structure, not explicit parameters of the model. The pres-ence of noise in the sensory system means that even when thephysical asynchrony is identical to the mean of the commoncause distribution there is still a chance the measured asynchronywill be outside the synchrony window. Having an asymmetric,

FIGURE 3 | Model fits to behavioral data for experiment 1 (high visual

intelligibility words). (A) Black circles show the behavioral data from 16subjects performing a synchrony judgment task (mean ± standard error) foreach stimulus asynchrony with visual reliable, high visual intelligibilitystimuli. Curves show the model predictions for the CIMS model (orange)and Gaussian model (blue). (B) Fit error measured with BayesianInformation Criterion (BIC) for the CIMS and Gaussian models; lower valuesindicate a better fit for the CIMS model (∗∗p = 0.006). Error bars showwithin-subject standard error (Loftus and Masson, 1994). (C) Meanproportion of synchrony responses and model predictions for visual blurred,high visual intelligibility stimuli. (D) Fit error for the CIMS and Gaussianmodels, showing better fit for the CIMS model (∗∗p = 0.007).

broad range of synchronies reported as nearly synchronous is pre-dicted by the interaction of the observer’s prior belief about theprevalence of a common cause in this experiment and sensorynoise.

Visual blurred, high visual intelligibility wordsBecause blurring decreases the reliability of the visual speech,the CIMS model predicts that the sensory noise level shouldincrease, resulting in changes in synchrony perception primar-ily at larger asynchronies. Despite the blurring, the peak of thesynchrony judgment curve remained high (around 0.9 reportedsynchrony for +67 to +133 ms) showing that participants werestill able to perform the task. However, the distribution had a flat-ter top, with participants reporting high synchrony values for abroad range of physical asynchronies that extended from 0 ms(no audio/visual offset) to +267 ms (Figure 3C). The drop-offin reported synchrony was more asymmetric than for unblurredstimuli, dropping more slowly for the visual-leading side of thecurve. Comparing the model fits to the behavioral data, the CIMSmodel was supported (Figure 3D) over the Gaussian model [BICdifference: 4.1 ± 1.3, t(15) = 3.1, p = 0.007]. Blurring the visualspeech makes estimation of the visual onset harder by addinguncertainty to the observer’s estimate of the visual onset. For theCIMS model, this has the effect of increasing variability, leadingto a widening of the predicted behavioral curves and an exagger-ation of their asymmetry. In contrast, for the Gaussian model,increasing the standard deviation can only symmetrically widenthe fitted curves.

Visual reliable, low visual intelligibility wordsIn the CIMS model, words with low visual intelligibility shoulddecrease the certainty of the visual speech onset, correspondingto an increase in sensory noise. Decreasing visual intelligibilityboth widened and flattened the peak of the synchrony judgmentcurve (Figure 4A), resulting in a broad plateau from −67 ms to+133 ms. When the CIMS and Gaussian model fits were com-pared (Figure 4B), the CIMS model provided a better fit tothe behavioral data [BIC difference: 9.9 ± 2.0, t(15) = 4.9, p <

0.001]. The CIMS model accurately predicts the plateau observedin the behavioral data, in which a range of small asynchroniesare reported as synchronous with high probability. The Gaussianmodel attempts to fit this plateau through an increased standarddeviation, but this resulted in over-estimating synchrony reportsat greater asynchronies.

Visual blurred, low visual intelligibility wordsThe CIMS model predicts that visual blurring should be cap-tured solely by a change in sensory noise. Blurring led to anincrease in synchrony reports primarily for the larger asyn-chronies (Figure 4C). The height and location of the plateauof the curve was similar to the unblurred versions of thesewords (between 0.89 and 0.93 synchrony reports from −67msto +133 ms). Overall, blurring the low visual intelligibility wordshad generally the same effect as blurring the high visual intel-ligibility words: widening the synchrony judgment curve, butnot changing the height or position of the curve’s plateau. TheCIMS model fit the behavioral data better (Figure 4D) than

www.frontiersin.org November 2013 | Volume 4 | Article 798 | 5

Page 6: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

FIGURE 4 | Model fits to behavioral data for Experiment 1 (low visual

intelligibility words). (A) Black circles show the behavioral data with visualreliable, low visual intelligibility stimuli, curves show model predictions forCIMS (orange) and Gaussian (blue) models. (B) Fit error showingsignificantly better fit for the CIMS model (∗∗p < 0.001). (C) Meanproportion of synchrony responses and model predictions for visual blurred,low visual intelligibility stimuli. (D) Fit error showing significantly better fitfor the CIMS model (∗∗p = 0.001).

the Gaussian model [BIC difference: 4.7 ± 1.2, t(15) = 3.9, p =0.001]. The better fit resulted from the CIMS model’s ability topredict an asymmetric effect of blurring and continued predic-tion of a wide range of asynchronies reported as synchronous withhigh probability.

Overall model testingNext, we compared the models across all stimulus sets together.For the CIMS model, the parameters σC=1, σC=2, μC=2, pC=1

for each subject are fit across all conditions, placing constraintson how much synchrony perception can vary across conditions(the Gaussian model has no such constraints because a sepa-rate scaled Gaussian is fit for each condition). We comparedthe models across all conditions using the average of the totalBIC (summed across stimulus sets) across subjects (Figure 5A).Despite the additional constraints of the CIMS model, the overallmodel test supported it over the Gaussian model [BIC difference:24.5 ± 4.9, t(15) = 4.99, p < 0.001]. The direction of this differ-ence was replicated across all 16 subjects (Figure 5B), althoughthe magnitude showed a large range (range of BIC differences:2–81).

It is tempting to compare the fit error across different stimu-lus conditions within each model to note, for instance, that thefit error for visual blurring is greater than without it; or that themodels fit better with low visual intelligibility words. However,these comparisons must be made with caution because the stim-ulus level parameters are fit across all conditions simultaneouslyand are thus highly dependent. For instance, removing the visualintelligibility manipulation would change the parameter esti-mates and resulting fit error for the visual blurring condition.The only conclusion that can be safely drawn is that the CIMSmodel provides a better fit than the Gaussian model for all testedstimulus manipulations.

Interpreting parameters from the CIMS modelA key property of the CIMS model is the complete specificationof the synchrony judgment task structure, so that model param-eters may have a meaningful link to the cognitive and neural

FIGURE 5 | Model comparison across experiments. (A) Total fit error(Bayesian Information Criterion; BIC) across conditions averaged over the16 subjects in experiment 1 showing better fit for the CIMS model(∗∗p < 0.001). Error bars are within-subject standard error (Loftus andMasson, 1994). (B) Difference in fit error (BIC) for each individual subject(across all conditions). (C) Total fit error across conditions averaged over the21 subjects in experiment 2 showing better fit for the CIMS model(∗∗p < 10−15). (D) Difference in fit error for each individual subject (acrossall conditions).

FIGURE 6 | Model estimates of sensory noise across stimuli in

experiment 1. (A) Correlation between CIMS model sensory noise (σ )estimates for visual blurred words with high visual intelligibility (high VI) andlow VI. Each symbol represents one subject, the dashed line indicatesequal sensory noise between the two conditions. There was a strongpositive correlation (r = 0.92, p < 10−6). (B) Correlation between sensorynoise estimates for visual reliable words with high and low VI (r = 0.95,p < 10−7). (C) Mean sensory noise across subjects [± within-subjectstandard error of the mean (Loftus and Masson, 1994)] for visual blurredwords (green line) and visual reliable words (purple line) with low VI (left) orhigh VI (right).

processes that instantiate them. First, we examined how σ , thesensory noise parameter, changed across stimulus conditions andsubjects (Figure 6).

The fitted value of σ is a measure of the sensory noise levelin each individual and condition, and captures individual differ-ences in the task. To demonstrate the within-subject relationshipacross conditions, we correlated the σ for the high and low

Frontiers in Psychology | Perception Science November 2013 | Volume 4 | Article 798 | 6

Page 7: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

visual intelligibility conditions across subjects. We found a veryhigh correlation (Figures 6A,B) for both the blurred stimuli (r =0.92, p < 10−6) and the unblurred stimuli (r = 0.95, p < 10−7).This correlation demonstrates that subjects have a consistentlevel of sensory noise: some subjects have a low level of sensorynoise across different stimulus manipulations, while others have ahigher level. In our study, the subjects were healthy controls, lead-ing to a modest range of σ across subjects (70–300 ms). Althoughwe fit the model with a restricted range of stimuli (congruentaudiovisual words with high or low visual reliability and intelli-gibility), subjects with high sensory noise might show differencesfrom subjects with low sensory noise across a range of multi-sensory speech tasks, such as perception of the McGurk effect(Stevenson et al., 2012). In clinical populations, subjects with lan-guage impairments (such as those with ASD or dyslexia) might beexpected to have higher sensory noise values and a larger varianceacross subjects.

Next, we examined the average value of σ in different condi-tions (Figure 6C). Blurring the visual speech should lead to anincrease in sensory noise, as the blurred stimuli provide less reli-able information about the onset time of the visual speech. Wordswith lower visual intelligibility should also cause an increase insensory noise, as the visual speech information is more ambigu-ous and its onset harder to estimate. As expected, σ was higherfor blurred stimuli (mean increase of 10 ms) and low visual intel-ligibility words (mean increase of 12 ms). A repeated measuresANOVA on the fitted σ values with visual reliability (reliable orblurred) and visual intelligibility (high or low) as factors showeda marginally reliable effect of visual reliability [F(1, 15) = 4.51,p = 0.051], a main effect of visual intelligibility [F(1, 15) = 7.16,p = 0.020] and no interaction.

An additional parameter of the CIMS model is pC=1, whichrepresents the observer’s prior belief that audio and visual speechevents arise from a common cause. Higher values indicate ahigher probability of inferring a common cause (and there-fore of responding synchronous). Across all stimulus conditions,subjects’ priors were biased toward reporting one cause [MeanpC=1 = 0.58; t-test against 0.5; t(15) = 4.47, p < 0.001]. A priorbiased toward reporting a common cause may be due to the pre-sentation of a single movie clip in each trial and the same talkeracross trials. Having a high prior for C = 1 increases the proba-bility of responding synchronous even for very high asynchronies,leading to the observed behavioral effect of non-zero reportedsynchrony even at very large asynchronies.

Finally, we examined model parameters that relate to the nat-ural statistics of audiovisual speech. Across all participants, thestandard deviation of the common and separate cause distribu-tions were estimated to be σC=1 = 65 ± 9 ms (SEM) and σC=2 =126 ± 12 ms. For consistency with the literature, we used thestimulus manipulation reference frame and fixed μC=1 at zero,resulting in a fitted value for μC=2 of −48 ± 12 ms (using thephysical asynchrony reference frame would result in a value forμC=2 near zero and a positive value for μC=1).

EXPERIMENT 2Results from the first experiment demonstrated that the CIMSmodel describes audiovisual synchrony judgments better than the

Gaussian model under manipulations of temporal asynchrony,visual blurring, and reduced visual intelligibility. One notable dif-ference between the two models is the use of parameters in theCIMS model that are designed to reflect aspects of the natu-ral statistics of audiovisual speech. If these stimulus parametersare reflective of natural speech statistics, they should be rela-tively consistent across different individuals tested with the samestimuli. To test this assertion, we fit the CIMS model to an inde-pendent set of 21 subjects using the mean values from experiment1 for the stimulus parameters σC=1, σC=2, and μC=2 (μC=1

remained fixed at zero). We then compared the fits of this reducedCIMS model with the fits from the Gaussian model using thebehavioral data from the 21 subjects. In experiment 2, the CIMSmodel has 7 fewer parameters per subject than the Gaussian model(5 vs. 12).

Behavioral results and model testingOverall behavioral results were similar to Experiment 1. The BICmeasure favored the CIMS model both for the group (BIC dif-ference: 33.8 ± 1.3; Figure 5C) and in each of the 21 subjects(Figure 5D). A paired t-test on the BIC values confirmed thatCIMS was the better fitting model [t(20) = 25.70, p < 10−15].There is a noticeable difference in the average total BIC betweenexperiments. Because the calculation of BIC scales with the num-ber of trials, with 4 trials per subject in Experiment 2 and 12 trialsper subject in Experiment 1, the magnitudes will necessarily belarger in Experiment 1 and cannot be directly compared.

The better fit for the CIMS model in this experiment showsthat the model is reproducing essential features of synchrony per-ception with fewer parameters than the Gaussian model. If bothmodels were simply curve-fitting, we would expect the modelwith more free parameters to perform better. Instead, the CIMSmodel makes explicit predictions that some parameters shouldremain fixed across conditions and provides an explanation forthe shape of the synchrony judgment data.

DISCUSSIONThe CIMS model prescribes how observers should combine infor-mation from multiple cues in order to optimally perceive audiovi-sual speech. Our study builds on previous examinations of causalinference during audio-visual multisensory integration but pro-vides important advances. Previous work has demonstrated thatcausal inference can explain behavioral properties of audiovi-sual spatial localization using simple auditory beeps and visualflashes (Kording et al., 2007; Sato et al., 2007). We use the sametheoretical framework, but for a different problem, namely thetask of deciding if two speech cues are synchronous. Althoughthe problems are mathematically similar, they are likely to besubserved by different neural mechanisms. For instance, audiovi-sual spatial localization likely occurs in the parietal lobe (Zatorreet al., 2002) while multisensory speech perception is thoughtto occur in the superior temporal sulcus (Beauchamp et al.,2004). Different brain areas might solve the causal inference prob-lem in different ways, and these different implementations arelikely to have behavioral consequences. For instance, changingfrom simple beep/flash stimuli to more complex speech stim-uli can change the perception of simultaneity (Love et al., 2013;

www.frontiersin.org November 2013 | Volume 4 | Article 798 | 7

Page 8: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

Stevenson and Wallace, 2013). Because multisensory speech maybe the most ethologically important sensory stimulus, it is crit-ical to develop and test the framework of causal inference formultisensory speech perception.

The CIMS model shows how causal inference on auditoryand visual signals can provide a mechanistic understanding ofhow humans judge multisensory speech synchrony. The modelfocuses on the asynchrony between the auditory and visual speechcues because of the prevalence of synchrony judgment tasks inthe audiovisual speech literature and its utility in characteriz-ing speech perception in healthy subjects and clinical populations(Lachs and Hernandez, 1998; Conrey and Pisoni, 2006; Smith andBennetto, 2007; Rouger et al., 2008; Foss-Feig et al., 2010; Navarraet al., 2010; Stevenson et al., 2010; Vroomen and Keetels, 2010).A key feature of the CIMS model is that it is based on a principledanalysis of how an optimal observer should solve the synchronyjudgment problem. This feature allows it to make predictions(such as the stimulus-level parameters remaining constant acrossgroups of observers) that can never be made by post-hoc curve-fitting procedures, Gaussian or otherwise. Hence, the model actsas a bridge between primarily empirical studies that examinesubjects’ behavior (Wallace et al., 2004; Navarra et al., 2010;Hillock-Dunn and Wallace, 2012) under a variety of differentmultisensory conditions and more theoretical studies that focuson Bayes-optimal models of perception (Kording et al., 2007;Shams and Beierholm, 2010).

Across manipulations of visual reliability and visual intel-ligibility, the model fit better than the Gaussian curve-fittingapproach, even when it had many fewer parameters. Unlikethe Gaussian approach, the parameters of the CIMS model aredirectly related to the underlying decision rule. These parame-ters, such as the subject’s sensory noise, beliefs about the task, andstructural knowledge of audiovisual speech can be used to char-acterize individual and group differences in multisensory speechperception.

An interesting observation is that the individual differencesin sensory noise across subjects (range of 70–300 ms) was muchgreater than the change in sensory noise within individuals causedby stimulus manipulations (∼30 ms). This means that resultsfrom only one condition may be sufficient to study individualdifferences in synchrony perception. In some populations, it isprohibitively difficult to collect a large number of trials in manyseparate conditions. A measure of sensory noise that is obtainablefrom only one condition could therefore be especially useful forstudying these populations.

CAUSAL INFERENCE PREDICTS FEATURES OF AUDIOVISUAL SPEECHSYNCHRONY JUDGMENTSThe CIMS model explains synchrony perception as an inferenceabout the causal relationship between two events. Several featuresof synchrony judgment curves emerge directly from the compu-tation of this inference process. First, the presence of uncertaintyin the sensory system leads to a broad distribution of synchronyresponses rather than a single peak near μC=1. When an observerhears and sees a talker, the measured asynchrony is corrupted bysensory noise. The optimal observer takes this noise into accountand makes an inference about the likelihood that the auditory

and visual speech arose from the same talker, and therefore,are synchronous. This decision process can lead to an overallsynchrony judgment curve with a noticeably flattened peak, asobserved behaviorally.

Second, the rightward shift (toward visual-leading asyn-chronies) of the maximal point of synchrony is explainedby the natural statistics of audiovisual speech coupled withnoise in the sensory system. Because the mean of the com-mon cause distribution (μC=1) is over the visual-leading asyn-chronies, small positive asynchronies are more consistent witha common cause than small negative asynchronies. This fea-ture of the synchrony judgment curve is enhanced by thelocation of the C = 2 distribution at a physical asynchronyof 0 ms.

WHAT ABOUT THE TEMPORAL BINDING WINDOW AND THE MEANPOINT OF SYNCHRONY?The Gaussian model is used to obtain measures of the tem-poral binding window and mean point of synchrony in orderto compare individuals and groups. In our formulation of theCIMS model, we introduced the Bayes-optimal synchrony win-dow. This synchrony window should not be confused with thetemporal binding window. The temporal binding window is pred-icated on the idea that observers have access to the physicalasynchrony of the stimulus, which cannot be correct: observersonly have access to a noisy representation of the world. TheCIMS model avoids this fallacy by defining a synchrony win-dow based on the observer’s noisy measurement of the physicalasynchrony. The predicted synchrony reports from the CIMSmodel therefore relate to the probability that a measured asyn-chrony will land within the Bayes-optimal window, not whethera physical asynchrony is sufficiently small. This distinction is acritical difference between the generative modeling approach ofthe CIMS model and the curve-fitting approach of the Gaussianmodel.

In the CIMS model, the shape of the behavioral curve emergesnaturally from the assumptions of the model, and is a result ofinteractions between all model parameters. In contrast, the meanpoint of synchrony in the Gaussian model defines a single valueof the behavioral data. This poses a number of problems. First,the behavioral data often show a broad plateau, meaning that alone “peak” mean point of synchrony fails to capture a prominentfeature of the behavioral data. Second, the location of the cen-ter of the behavioral data is not a fixed property of the observer,but reflects the contributions of prior beliefs, sensory noise, andstimulus characteristics. By separately estimating these contri-butions, the CIMS model can make predictions about behavioracross experiments.

MODIFICATIONS TO THE GAUSSIAN MODELThe general form of the Gaussian model used in this paper hasbeen used in many published studies on synchrony judgments(Conrey and Pisoni, 2006; Navarra et al., 2010; Vroomen andKeetels, 2010; Baskent and Bazo, 2011; Love et al., 2013). It is pos-sible to modify the Gaussian model used in this paper to improveits fit to the behavioral data, for instance by fitting each sideof the synchrony judgment curve with separate Gaussian curves

Frontiers in Psychology | Perception Science November 2013 | Volume 4 | Article 798 | 8

Page 9: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

(Powers et al., 2009; Hillock et al., 2011; Stevenson et al., 2013). Inthe current study, the Gaussian model required 12 parameters persubject; fitting each half of the synchrony judgment curve wouldrequire 20 parameters per subject, more than twice the numberof parameters in the CIMS model. Additionally, the CIMS modelrequired only 4 parameters to characterize changes across exper-imental conditions. Although researchers may continue to addmore flexibility to the Gaussian model to increase its fit to thebehavioral data (Stevenson and Wallace, 2013), the fundamen-tal problem remains: the only definition of model goodness isthat it fits the behavioral data “better.” In the limit, a model withas many parameters as data points can exactly fit the data. Suchmodels are incapable of providing a deeper understanding of theunderlying properties of speech perception because their param-eters have no relationship with underlying cognitive or neuralprocesses.

Researchers using the Gaussian model could also try to reducethe number of free parameters by fixing certain parameters acrossconditions, or using values estimated from independent sam-ples. This approach is necessarily post-hoc, providing no rationaleabout which parameters should remain fixed (or allowed to vary)before observing the data.

CONCLUSIONThe CIMS model affords a quantitative grounding formultisensory speech perception that recognizes the fundamental

role of causal inference in general multisensory integration(Kording et al., 2007; Sato et al., 2007; Schutz and Kubovy, 2009;Shams and Beierholm, 2010; Buehner, 2012). Most importantly,the CIMS model represents a fundamental departure from curve-fitting approaches. Rather than focusing on mimicking the shapeof the observed data, the model suggests how the data are gener-ated through a focus on the probabilistic inference problem thatunderlies synchrony perception. The model parameters thus pro-vide principled measures of multisensory speech perception thatcan be used in healthy and clinical populations. More generally,causal inference models make no assumptions about the natureof the stimuli being perceived, but provide strong predictionsabout multisensory integration across time and space. While thecurrent incarnation of the CIMS model considers only tempo-ral asynchrony, the theoretical framework is amenable to othercues that may be used to make causal inference judgments dur-ing multisensory speech perception, such as the spatial locationof the auditory and visual cues, the gender of the auditory andvisual cues, or the speech envelope.

ACKNOWLEDGMENTSThis research was supported by NIH R01NS065395 to Michael S.Beauchamp. The authors are grateful to Debshila Basu Mallick,Haley Lindsay, and Cara Miekka for assistance with data collec-tion and to Luis Hernandez and David Pisoni for sharing theasynchronous video stimuli.

REFERENCESBaskent, D., and Bazo, D. (2011).

Audiovisual asynchrony detec-tion and speech intelligibility innoise with moderate to severesensorineural hearing impair-ment. Ear Hear. 32, 582–592. doi:10.1097/AUD.0b013e31820fca23

Beauchamp, M. S., Argall, B. D.,Bodurka, J., Duyn, J. H., andMartin, A. (2004). Unravelingmultisensory integration: patchyorganization within human STSmultisensory cortex. Nat. Neurosci.7, 1190–1192. doi: 10.1038/nn1333

Bejjanki, V. R., Clayards, M., Knill,D. C., and Aslin, R. N. (2011).Cue integration in categorical tasks:insights from audio-visual speechperception. PLoS ONE 6:e19812.doi: 10.1371/journal.pone.0019812

Brainard, D. H. (1997). The psy-chophysics toolbox. Spat. Vis.10, 433–436. doi: 10.1163/156856897X00357

Buehner, M. J. (2012). Understandingthe past, predicting the future:causation, not intentional action,is the root of temporal binding.Psychol. Sci. 23, 1490–1497. doi:10.1177/0956797612444612

Conrey, B., and Pisoni, D. B. (2006).Auditory-visual speech perceptionand synchrony detection for speechand nonspeech signals. J. Acoust.

Soc. Am. 119, 4065–4073. doi:10.1121/1.2195091

Conrey, B. L., and Pisoni, D. B. (2004).“Detection of auditory-visual asyn-chrony in speech and nonspeechsignals,” in Research on SpokenLanguage Processing Progress ReportNo 26, (Bloomington, IN: IndianaUniversity), 71–94.

Foss-Feig, J. H., Kwakye, L. D., Cascio,C. J., Burnette, C. P., Kadivar,H., Stone, W. L., et al. (2010).An extended multisensory tempo-ral binding window in autism spec-trum disorders. Exp. Brain Res. 203,381–389. doi: 10.1007/s00221-010-2240-4

Hillock, A. R., Powers, A. R., andWallace, M. T. (2011). Bindingof sights and sounds: age-relatedchanges in multisensory temporalprocessing. Neuropsychologia 49,461–467. doi: 10.1016/j.neuropsychologia.2010.11.041

Hillock-Dunn, A., and Wallace, M. T.(2012). Developmental changes inthe multisensory temporal bindingwindow persist into adolescence.Dev. Sci. 15, 688–696. doi: 10.1111/j.1467-7687.2012.01171.x

Kording, K. P., Beierholm, U., Ma,W. J., Quartz, S., Tenenbaum, J.B., and Shams, L. (2007). Causalinference in multisensory per-ception. PLoS ONE 2:e943. doi:10.1371/journal.pone.0000943

Lachs, L., and Hernandez, L. R. (1998).“Update: the Hoosier audiovisualmultitalker database,” in Research onSpoken Language Processing ProgressReport No 22, (Bloomington, IN),377–388.

Loftus, G. R., and Masson, M. E.(1994). Using confidence inter-vals in within-subject designs.Psychon. Bull. Rev. 1, 476–490. doi:10.3758/BF03210951

Love, S. A., Petrini, K., Cheng, A.,and Pollick, F. E. (2013). A psy-chophysical investigation of dif-ferences between synchrony andtemporal order judgments. PLoSONE 8:e54798. doi: 10.1371/jour-nal.pone.0054798

Ma, W. J. (2012). Organizing prob-abilistic models of perception.Trends Cogn. Sci. 16, 511–518. doi:10.1016/j.tics.2012.08.010

Ma, W. J., Zhou, X., Ross, L. A.,Foxe, J. J., and Parra, L. C. (2009).Lip-reading aids word recognitionmost in moderate noise: a Bayesianexplanation using high-dimensionalfeature space. PLoS ONE 4:e4638.doi: 10.1371/journal.pone.0004638

Massaro, D. W. (1989). Testing betweenthe TRACE model and the fuzzylogical model of speech perception.Cogn. Psychol. 21, 398–421. doi:10.1016/0010-0285(89)90014-5

Massaro, D. W., Cohen, M. M.,Campbell, C. S., and Rodriguez, T.

(2001). Bayes factor of modelselection validates FLMP.Psychon. Bull. Rev. 8, 1–17. doi:10.3758/BF03196136

Navarra, J., Alsius, A., Velasco, I., Soto-Faraco, S., and Spence, C. (2010).Perception of audiovisual speechsynchrony for native and non-native language. Brain Res. 1323,84–93. doi: 10.1016/j.brainres.2010.01.059

Pelli, D. G. (1997). The VideoToolboxsoftware for visual psychophysics:transforming numbers intomovies. Spat. Vis. 10, 437–442.doi: 10.1163/156856897X00366

Powers, A. R. 3rd., Hillock, A. R., andWallace, M. T. (2009). Perceptualtraining narrows the temporalwindow of multisensory binding.J. Neurosci. 29, 12265–12274. doi:10.1523/JNEUROSCI.3501-09.2009

R Core Team. (2012). R: A Languageand Environment for StatisticalComputing. Vienna: R Foundationfor Statistical Computing.

Rosenblum, L. D., Johnson, J. A.,and Saldana, H. M. (1996).Point-light facial displays enhancecomprehension of speech in noise.J. Speech Hear. Res. 39, 1159–1170.

Rouger, J., Fraysse, B., Deguine, O.,and Barone, P. (2008). McGurkeffects in cochlear-implanted deafsubjects. Brain Res. 1188, 87–99.doi: 10.1016/j.brainres.2007.10.049

www.frontiersin.org November 2013 | Volume 4 | Article 798 | 9

Page 10: Causal inference of asynchronous audiovisual speech · 2017. 4. 13. · ORIGINAL RESEARCH ARTICLE published: 13 November 2013 doi: 10.3389/fpsyg.2013.00798 Causal inference of asynchronous

Magnotti et al. Causal inference of asynchronous audiovisual speech

Sato, Y., Toyoizumi, T., and Aihara,K. (2007). Bayesian inferenceexplains perception of unity andventriloquism aftereffect: iden-tification of common sourcesof audiovisual stimuli. NeuralComput. 19, 3335–3355. doi:10.1162/neco.2007.19.12.3335

Schutz, M., and Kubovy, M. (2009).Causality and cross-modal inte-gration. J. Exp. Psychol. Hum.Percept. Perform. 35, 1791–1810.doi: 10.1037/a0016455

Schwartz, J. L., Berthommier, F., andSavariaux, C. (2004). Seeing tohear better: evidence for earlyaudio-visual interactions in speechidentification. Cognition 93,B69–B78. doi: 10.1016/j.cognition.2004.01.006

Shams, L., and Beierholm, U. R. (2010).Causal inference in perception.Trends Cogn. Sci. 14, 425–432. doi:10.1016/j.tics.2010.07.001

Smith, E. G., and Bennetto, L. (2007).Audiovisual speech integra-tion and lipreading in autism.J. Child Psychol. Psychiatry 48,

813–821. doi: 10.1111/j.1469-7610.2007.01766.x

Stevenson, R. A., Altieri, N. A.,Kim, S., Pisoni, D. B., andJames, T. W. (2010). Neuralprocessing of asynchronousaudiovisual speech perception.Neuroimage 49, 3308–3318. doi:10.1016/j.neuroimage.2009.12.001

Stevenson, R. A., and Wallace, M.T. (2013). Multisensory temporalintegration: task and stimulusdependencies. Exp. Brain Res. 227,249–261. doi: 10.1007/s00221-013-3507-3

Stevenson, R. A., Wilson, M. M.,Powers, A. R., and Wallace, M.T. (2013). The effects of visualtraining on multisensory tempo-ral processing. Exp. Brain Res. 225,479–489. doi: 10.1007/s00221-012-3387-y

Stevenson, R. A., Zemtsov, R. K., andWallace, M. T. (2012). Individualdifferences in the multisensorytemporal binding window pre-dict susceptibility to audiovisualillusions. J. Exp. Psychol. Hum.

Percept. Perform. 38, 1517–1529.doi: 10.1037/a0027339

Sumby, W. H., and Pollack, I. (1954).Visual contribution to speechintelligibility in noise. J. Acoust.Soc. Am. 26, 212–215. doi:10.1121/1.1907309

Vroomen, J., and Keetels, M. (2010).Perception of intersensory syn-chrony: a tutorial review. Atten.Percept. Psychophys. 72, 871–884.doi: 10.3758/APP.72.4.871

Wallace, M. T., Roberson, G. E.,Hairston, W. D., Stein, B. E.,Vaughan, J. W., and Schirillo, J.A. (2004). Unifying multisensorysignals across time and space.Exp. Brain Res. 158, 252–258. doi:10.1007/s00221-004-1899-9

Zatorre, R. J., Bouffard, M., Ahad,P., and Belin, P. (2002). Whereis ‘where’ in the human auditorycortex? Nat. Neurosci. 5, 905–909.doi: 10.1038/nn904

Conflict of Interest Statement: Theauthors declare that the researchwas conducted in the absence of any

commercial or financial relationshipsthat could be construed as a potentialconflict of interest.

Received: 20 August 2013; accepted:10 October 2013; published online: 13November 2013.Citation: Magnotti JF, Ma WJ andBeauchamp MS (2013) Causal infer-ence of asynchronous audiovisual speech.Front. Psychol. 4:798. doi: 10.3389/fpsyg.2013.00798This article was submitted to PerceptionScience, a section of the journal Frontiersin Psychology.Copyright © 2013 Magnotti, Ma andBeauchamp. This is an open-access arti-cle distributed under the terms of theCreative Commons Attribution License(CC BY). The use, distribution or repro-duction in other forums is permitted,provided the original author(s) or licen-sor are credited and that the originalpublication in this journal is cited, inaccordance with accepted academic prac-tice. No use, distribution or reproductionis permitted which does not comply withthese terms.

Frontiers in Psychology | Perception Science November 2013 | Volume 4 | Article 798 | 10