[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Auditory-visual comodulation



Ken Grant wrote:

>With all due respect for Bruno, Frost and Zsiga, we have begun to replicate
>these studies about the possible influence of simultaneous lip movements on
>auditory detection because the original Repp et al study had serious problems
 in
>several respects. First, let me tell you the outcome before I tell why Repp et
>al failed to find the "correct" result. We are running an adaptive speech
>detection experiment with three different sentence targets. The sentences are
>presented in a background of white noise under three conditions: auditory
 alone,
>auditory plus simultaneously presented matching lipread information, and
>auditory plus simultaneously presented mismatched lipread information. The task
>is a two interval forced choice procedure where the subjects have to indicate
>the interval that contains the speech plus noise. We are using a 3-down, 1-up
>adaptive procedure tracking the 79% point on the psychometric function. The
>speech is held constant whereas the noise is controlled by the adaptive track.
>The results show a 1-4 dB release from masking or a bimodal coherence masking
>protection depending on the sentence. We are currently looking into the
>correlation, on a sentence by sentence basis, between the time course of lip
>opening and the rms amplitude fluctuations in the speech signals both broadband
>and in selected spectral bands (especially the F2 region) as an explanation for
>the differences across sentences. These results indicate that cross-modal
>comodulation between visual and acoustic signals can reduce stimulus
 uncertainty
>in auditory detection and reduce thresholds for detection. These results will
 be
>reported at the upcoming ASA meeting in Seattle (June).
>
>Now, why did the Repp et al study fail to see these results. First, their
>equipment was incapable of precise (within 1-3 ms) acoustic-visual alignments
>and allowed for as much as 100 ms desynchronization across the modalities. If
>simultaneous comodulation of sensory information across the senses is important
>for this effect to occur then a  misalignment of the A and V components will
>weaken the effect. Second, and perhaps most important, Repp et al used a speech
>modulated noise as the masker. It is well known that lipreading plus speech
>modulated noise leads to improved speech intelligibility (over speechreading
>alone) and that speech modulated noise has many speech cues by itself, capable
>of informing subjects about phonetic features at levels well above chance.
>Therefore, when the Repp et al subjects saw a moving face accompanied by a
 noise
>alone trial, they naturally heard speech (the bias effect) because the noise
 was
>indeed speech like in many respects. In our study we use a noise whose
>modulation properties differ from the visual and acoustic signals, whereas the
>visual and acoustic signals share common modulation properties. This is an
>essential characteristic of all CMR studies and more recently of coherence
>masking protection (CMP) described by Peter Gordon. And third (and finally),
>Repp et al used disyllabic words with similar stress patterns whereas our
>experiment used sentences. The shorter stimuli have create similar temporal
>expectations as to when in the utterance the detection will occur, whereas the
>longer more diverse sentences create a situation of greater temporal
 uncertainty
>as to when in the listening interval the detection will occur. The greater
>temporal uncertainty is alleviated to varying degrees by the visual
 information,
>thus reducing the temporal uncertainty and reducing thresholds for detection.
>Several variants of this experiment have been proposed in a new grant submitted
>to the McDonnell-Pew Foundation in collaboration with brain imaging and
 modeling
>studies conducted at UCSF and UC-Berkeley.

        These are very interesting results! However, I disagree that there
were "problems" with our study that led us not to find the "correct" results.
The aims of our study were different. We did not investigate the effect
of comodulation of visual and auditory input on detectability. Rather, we
were interested in effects of the lexical status of the words to be detected.
Effects of lexicality are definitely top-down, whereas comodulation effects
of the sort that Ken has demonstrated are arguably bottom-up, even though
they require cross-modal integration of some kind. In our study, as Ken
has pointed out, there was always a considerable degree of comodulation between
auditory and visual inputs. Also, our use of signal-correlated noise was
quite deliberate and not a "problem". Only the presence of some inaccuracies
in temporal alignment may be considered a shortcoming of our study.
However, the degree of synchrony present was sufficient to lead to very clear
lexical bias effects, whereas there was no effect on sensitivity to the
presence of speech in (speech-like) noise. It seems unlikely to me that
there would have been an effect on sensitivity if the synchrony had been
more accurate. Does the enhancement effect that Ken has demonstrated go
away when auditory and visual inputs are misaligned by +/-50 ms? I would
be surprised if that were the case. His "mismatched" lipread information,
I suppose, is completely out of synch with the auditory (masked) input.

Bruno H. Repp
Haskins Laboratories
270 Crown Street
New Haven, CT 06511-6695

Phone:   (203) 865-6163 (10:00 a.m. - 6:30 p.m.)
FAX:     (203) 865-8963
e-mail:  repp@haskins.yale edu
WWW:     http://www.haskins.yale.edu/Haskins/STAFF/repp.html