Auditory-visual comodulation ("BRUNO H. Repp" )

Subject: Auditory-visual comodulation
From:    "BRUNO H. Repp"  <repp(at)lenny.HASKINS.YALE.EDU>
Date:    Thu, 2 Apr 1998 11:41:39 -0500

Ken Grant wrote: >With all due respect for Bruno, Frost and Zsiga, we have begun to replicate >these studies about the possible influence of simultaneous lip movements on >auditory detection because the original Repp et al study had serious problems in >several respects. First, let me tell you the outcome before I tell why Repp et >al failed to find the "correct" result. We are running an adaptive speech >detection experiment with three different sentence targets. The sentences are >presented in a background of white noise under three conditions: auditory alone, >auditory plus simultaneously presented matching lipread information, and >auditory plus simultaneously presented mismatched lipread information. The task >is a two interval forced choice procedure where the subjects have to indicate >the interval that contains the speech plus noise. We are using a 3-down, 1-up >adaptive procedure tracking the 79% point on the psychometric function. The >speech is held constant whereas the noise is controlled by the adaptive track. >The results show a 1-4 dB release from masking or a bimodal coherence masking >protection depending on the sentence. We are currently looking into the >correlation, on a sentence by sentence basis, between the time course of lip >opening and the rms amplitude fluctuations in the speech signals both broadband >and in selected spectral bands (especially the F2 region) as an explanation for >the differences across sentences. These results indicate that cross-modal >comodulation between visual and acoustic signals can reduce stimulus uncertainty >in auditory detection and reduce thresholds for detection. These results will be >reported at the upcoming ASA meeting in Seattle (June). > >Now, why did the Repp et al study fail to see these results. First, their >equipment was incapable of precise (within 1-3 ms) acoustic-visual alignments >and allowed for as much as 100 ms desynchronization across the modalities. If >simultaneous comodulation of sensory information across the senses is important >for this effect to occur then a misalignment of the A and V components will >weaken the effect. Second, and perhaps most important, Repp et al used a speech >modulated noise as the masker. It is well known that lipreading plus speech >modulated noise leads to improved speech intelligibility (over speechreading >alone) and that speech modulated noise has many speech cues by itself, capable >of informing subjects about phonetic features at levels well above chance. >Therefore, when the Repp et al subjects saw a moving face accompanied by a noise >alone trial, they naturally heard speech (the bias effect) because the noise was >indeed speech like in many respects. In our study we use a noise whose >modulation properties differ from the visual and acoustic signals, whereas the >visual and acoustic signals share common modulation properties. This is an >essential characteristic of all CMR studies and more recently of coherence >masking protection (CMP) described by Peter Gordon. And third (and finally), >Repp et al used disyllabic words with similar stress patterns whereas our >experiment used sentences. The shorter stimuli have create similar temporal >expectations as to when in the utterance the detection will occur, whereas the >longer more diverse sentences create a situation of greater temporal uncertainty >as to when in the listening interval the detection will occur. The greater >temporal uncertainty is alleviated to varying degrees by the visual information, >thus reducing the temporal uncertainty and reducing thresholds for detection. >Several variants of this experiment have been proposed in a new grant submitted >to the McDonnell-Pew Foundation in collaboration with brain imaging and modeling >studies conducted at UCSF and UC-Berkeley. These are very interesting results! However, I disagree that there were "problems" with our study that led us not to find the "correct" results. The aims of our study were different. We did not investigate the effect of comodulation of visual and auditory input on detectability. Rather, we were interested in effects of the lexical status of the words to be detected. Effects of lexicality are definitely top-down, whereas comodulation effects of the sort that Ken has demonstrated are arguably bottom-up, even though they require cross-modal integration of some kind. In our study, as Ken has pointed out, there was always a considerable degree of comodulation between auditory and visual inputs. Also, our use of signal-correlated noise was quite deliberate and not a "problem". Only the presence of some inaccuracies in temporal alignment may be considered a shortcoming of our study. However, the degree of synchrony present was sufficient to lead to very clear lexical bias effects, whereas there was no effect on sensitivity to the presence of speech in (speech-like) noise. It seems unlikely to me that there would have been an effect on sensitivity if the synchrony had been more accurate. Does the enhancement effect that Ken has demonstrated go away when auditory and visual inputs are misaligned by +/-50 ms? I would be surprised if that were the case. His "mismatched" lipread information, I suppose, is completely out of synch with the auditory (masked) input. Bruno H. Repp Haskins Laboratories 270 Crown Street New Haven, CT 06511-6695 Phone: (203) 865-6163 (10:00 a.m. - 6:30 p.m.) FAX: (203) 865-8963 e-mail: repp(at)haskins.yale edu WWW:

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University