Cross-modality comodulation and the release from masking (grant )


Subject: Cross-modality comodulation and the release from masking
From:    grant  <grant(at)NICOM.COM>
Date:    Wed, 8 Apr 1998 11:08:06 -0400

This is a multi-part message in MIME format. --------------E5DADC48F5D1389C38C6C8A1 Content-Type: multipart/alternative; boundary="------------F809BAD2E5D38A317AFD28C5" --------------F809BAD2E5D38A317AFD28C5 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit John Hershey wrote: > Ken, I've been reading your recent postings to the auditory list with > increasing interest. I am developing a model of audio-visual interaction in > sound localization that parallels > aspects of your work, especially: > >... We are currently looking into the correlation, on a sentence by sentence > basis, between the time course of lip opening and the rms amplitude > fluctuations in the speech signals both broadband and in selected spectral > bands (especially the F2 region) as an explanation for the differences across > sentences. These results indicate that cross-modal comodulation between visual > and acoustic signals can reduce stimulus uncertainty in auditory detection and > reduce thresholds for detection. > I would like to read your work more closely. Would you send me references or > postscript reprints if they're available? > > Here is an abstract of ours submitted to the 5th Annual Joint Symposium on > Neural Computation, Saturday, May 16, 1998 (John Hershey & Javier Movellan) > > Title: Looking for sounds: using audio-visual mutual information to locate > sound sources. > > Abstract: > > Evidence from psychophysical experiments shows that, in humans, > localization of acoustic signals is strongly influenced by synchrony > with visual signals (e.g., this effect is at work when sound coming > from the side of your TV feels as if it were coming from the mouth of > the actors). This effect, known as ventriloquism, suggests that speaker > localization is a multimodal process and that the perceptual system > is tuned to finding dependencies between audio and visual signals. > In spite of this evidence, most systems for automatic speaker > localization use only acoustic information, taking advantage of > standard stereo cues. > > In this talk we present a progress report on a real time audio-visual > system for automatic speaker localization and tracking. The current > implementation uses input from a single camera and a microphone > and looks for regions of the visual landscape that correlate highly with > the acoustic signal. These regions are tagged as highly likely to > contain an acoustic source. We will discuss our experience with the > system and its potential theoretical and practical applications. > > John Hershey > Cognitive Science, UCSD. Hi John, The work you will be describing sounds interesting. I have an HTML version of a condensed 2-page manuscript describing work that I will present at Seattle at the ASA meeting in June. You can find the paper at the following address: ASACondensedMS.htm (http://members.aol.com/kwgrant/ASACondensedMS.htm) I have just begun to make measurements of the correlation between lip kinematics and acoustic amplitude envelope fluctuations derived from the whole speech signal as well as from separate spectral regions of speech (roughly corresponding to F1, F2, and F3 regions). Depending on the phonetic content of the sentence (e.g., /u/ obscures many of the lips movements), the correlation may be strong or weak. But the correlation between area of lip opening and amplitude envelope in the F2 region (800-2500) seem to be the most strongly correlated. This is very preliminary so I don't know how well it will hold up for different sentences and different talkers. But it is consistent with what we know about the information conveyed by speechreading: that place-of-articulation is the primary cue derived from speechreading and that place cues are primarily F2 transition cues. Please keep me posted about your progress in this area and don't hesitate to contact me if you have any questions. Ken --------------F809BAD2E5D38A317AFD28C5 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 7bit <HTML> John Hershey wrote: <BLOCKQUOTE TYPE=CITE>Ken, I've been reading your recent postings to the auditory list with increasing interest. I am developing a model of audio-visual interaction in sound localization that parallels <BR>aspects of your work, especially:</BLOCKQUOTE> <BLOCKQUOTE TYPE=CITE>>... We are currently looking into the correlation, on a sentence by sentence basis, between the time course of lip opening and the rms amplitude fluctuations in the speech signals both broadband and in selected spectral bands (especially the F2 region) as an explanation for the differences across sentences. These results indicate that cross-modal comodulation between visual and acoustic signals can reduce stimulus uncertainty in auditory detection and reduce thresholds for detection.</BLOCKQUOTE> <BLOCKQUOTE TYPE=CITE>I would like to read your work more closely. Would you send me references or&nbsp; postscript reprints if they're available? <P>Here is an abstract of ours submitted to the 5th Annual Joint Symposium on <BR>Neural Computation, Saturday, May 16, 1998&nbsp; (John Hershey &amp; Javier Movellan) <P>Title:&nbsp; Looking for sounds:&nbsp; using audio-visual mutual information to locate sound sources. <P>Abstract: <P>Evidence from psychophysical experiments shows that, in humans, <BR>localization of acoustic signals is strongly influenced by synchrony <BR>with visual signals (e.g., this effect is at work when sound coming <BR>from the side of your TV feels as if it were coming from the mouth of <BR>the actors).&nbsp; This effect, known as ventriloquism, suggests that speaker <BR>localization is a multimodal process and that the perceptual system <BR>is tuned to finding dependencies between audio and visual signals. <BR>In spite of this evidence, most systems for automatic speaker <BR>localization use only acoustic information, taking advantage of <BR>standard stereo cues. <P>In this talk we present a progress report on a real time audio-visual <BR>system for automatic speaker localization and tracking.&nbsp; The current <BR>implementation uses input from a single camera and a microphone <BR>and looks for regions of the visual landscape that correlate highly with <BR>the acoustic signal.&nbsp; These regions are tagged as highly likely to <BR>contain an acoustic source.&nbsp; We will discuss our experience with the <BR>system and its potential theoretical and practical applications. <P>John Hershey <BR>Cognitive Science, UCSD.</BLOCKQUOTE> Hi John, <P>The work you will be describing sounds interesting. I have an HTML version of a condensed 2-page manuscript describing work that I will present at Seattle at the ASA meeting in June. You can find the paper at the following address: <P>&nbsp; <A HREF="http://members.aol.com/kwgrant/ASACondensedMS.htm">ASACondensedMS.htm</A> <BR>(<A HREF="http://members.aol.com/kwgrant/ASACondensedMS.htm">http://members.aol.com /kwgrant/ASACondensedMS.htm</A>) <P>I have just begun to make measurements of the correlation between lip kinematics and acoustic amplitude envelope fluctuations derived from the whole speech signal as well as from separate spectral regions of speech (roughly corresponding to F1, F2, and F3 regions). Depending on the phonetic content of the sentence (e.g., /u/ obscures many of the lips movements), the correlation may be strong or weak. But the correlation between area of lip opening and amplitude envelope in the F2 region (800-2500) seem to be the most strongly correlated. This is very preliminary so I don't know how well it will hold up for different sentences and different talkers. But it is consistent with what we know about the information conveyed by speechreading: that place-of-articulation is the primary cue derived from speechreading and that place cues are primarily F2 transition cues. <P>Please keep me posted about your progress in this area and don't hesitate to contact me if you have any questions. <P>Ken</HTML> --------------F809BAD2E5D38A317AFD28C5-- --------------E5DADC48F5D1389C38C6C8A1 Content-Type: text/x-vcard; charset=us-ascii; name="vcard.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Ken W. Grant Content-Disposition: attachment; filename="vcard.vcf" begin: vcard fn: Ken W. Grant n: Grant;Ken W. org: Army Audiology & Speech Center adr;dom: Research Section;;Walter Reed Army Medical Center;;Washington, DC;20307-5001; email;internet: grant(at)nicom.com title: Research Audiologist tel;work: (202) 782-8596 tel;fax: (202) 782-9228 x-mozilla-cpt: ;0 x-mozilla-html: FALSE version: 2.1 end: vcard --------------E5DADC48F5D1389C38C6C8A1--


This message came from the mail archive
http://www.auditory.org/postings/1998/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University