Summary of ATR-Kyoto Workshop (at)

Subject: Summary of ATR-Kyoto Workshop
Date:    Sat, 24 Sep 1994 19:16:42 -0400

To AUDITORY from Bill Hartmann, 23 September 1994 In response to the suggestion from Dan Ellis, I'll review the recent workshop called A BIOLOGICAL FRAMEWORK FOR SPEECH PERCEPTION AND PRODUCTION held at the Advanced Telecommunication Research Institute International (ATR) in Kyoto, Japan on the 16th and 17th of September 1994. I've tried to represent the ideas fairly, but the reader should be warned that these summaries have not been cleared with the authors. Written versions of the talks given at this meeting should be available, as an ATR report, about six weeks from now. The reader may request a copy from Hideki Kawahara kawahara(at) ATR Human Information Processing Research Labs. 2-2 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-02, Japan Phone: +81-7749-5-1020 Facsimile: +81-7749-5-1008. The conference was opened by welcoming addresses from Yo'ichi TOHKURA President of the ATR Human Information Processing Research Laboratories and by Kohei HABARA, Chairman of the Board. There followed fourteen technical presentations: **(1)** "Impact of Biological Aspects of Speech Perception and Production on Future Communication Systems" Hideki KAWAHARA ATR Human Information Processing Research Labs. 2-2 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-02, Japan [kawahara(at)] Hideki stressed the importance of a holistic approach to human communication, which combines information across modalities. The model should link speech production with speech perception, where perceived elements are dynamic and evolving, and should take proper account of the effects of early development vs later training. An impressive parallel between modalities shows a strong correlation between auditory acuity and visual acuity across species. Contrasting roles of early development and later training are shown by R/L distinctions made by native Japanese speakers. It is not too soon to start thinking about holistic perception models because computers capable of implementing them are expected early in the next century. **(2)** "Beyond Sensory Processing: The effects of Learning, Memory, and Cross-Modal Integration on Speech Perception" Patricia K. KUHL Department of Speech and Hearing Sciences (WJ-10), University of Washington, Seattle, WA 98195 [pkkuhl(at)] Unfortunately, Professor Kuhl was unable to attend. As she was the only woman on the program, her absence was particularly regrettable. Her work was represented by a video tape on the McGurk effect. The tape shows the influence of visual information on auditory perception of phonemes. On the audio track of the tape is recorded "Ba ba, Ba ba, ..." On the video track of the tape there is a human face (Professor Kuhl's) articulating the syllables "Ga ga, Ga ga, ..." synchronized with the audio. Watching and listening to the tape, the observer perceives "Da da, Da da, ..." or "Tha tha, Tha Tha, ..." Since "Ba" is made at the front of the mouth and "Ga" is made in the back, the observer's perception is a compromise between audio and visual information. Opening and closing one's eyes as the tape rolls makes a dramatic contrast in what one "hears." **(3)** "Learning to Recognize Speech in Noisy Environments" Martin COOKE Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, U.K. Martin observed that speech in noise can be treated with standard auditory object grouping rules, common onset etc, but that this grouping leads to a fragmented speech pattern as the input for higher level processors. His talk investigated the effects of a fragmented spectrogram both as training and as test input in automatic speech recognition systems, a Kohonen Net and a Hidden Markov Model. He showed that the latter system can actually produce better performance if weak features of a speech signal are omitted from the input. **(4)** "A Neurobiological Perspective on Speech Production" Vincent L. GRACCO Haskins Laboratories, 270 Crown St., New Haven, CT 06511 USA Vincent Gracco described speech production as the result of coordinated activity that is distributed among different cortical and subcortical areas of the brain. Evidence for this model comes from neuroanatomic studies, electrical stimulation physiology, and observations of parallel deficits in neurological disorders. The flexibility of the neuromotor centers in coping with diverse contextual challenges indicates continual intervention from the high-level cognitive centers. **(5)** "Somatoneural relation in the auditory-articulatory linkage" Kiyoshi HONDA ATR Human Information Processing Research Labs, 2-2 Hikaridai, Seika-cho Soraku-gun Kyoto 619-02, Japan [email: honda(at) ] Kiyoshi Honda considered the relationship between speech perception and speech production, particularly the evolution of the peripheral mechanisms of production to accommodate the acoustical and neural requirements of the perceptual situation. The general somatoneural principle that brain organization must adjust to the shape of the body, accounts for the formulation of a tight auditory-articulatory linkage enforced by analogous speech representations in the motor and sensory spaces. **(6)** "Robust speech recognition based on human binaural perception" Richard M. STERN and Thomas M. Sullivan Department of Electrical and Computer Engineering and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 Rich Stern presented a front end for automatic speech recognition based upon human binaural hearing. Speech signals from multiple microphones are passed through bandpass filters and nonlinear rectification operations. The outputs are then cross-correlated within each frequency band. The system has been found to work better in noisy environments than a single channel system and better than delay-and-add beamforming. **(7)** "The conceptual basis of modelling auditory processing in the brainstem" Ray MEDDIS Department of Human Sciences, Loughborough University, 14 Chester Close, Loughborough, LE11 3BD, U.K. Ray gave an overview of peripheral physiology and the potential application of elements based on physiology in AI systems such as automatic speech recognizers. Features include nonlinearity in the cochlea and the generation of difference tones, limitations in neural synchrony, adaptation and limited dynamic range in the auditory nerve, and cochlear nucleus neurons (bushy cells, choppers and pausers) with a wide diversity of behaviors, tunings and time constants. **(8)** "Auditory Figure/Ground Separation" Roy PATTERSON MRC Applied Psychology Unit, 15 Chaucer Road, Cambridge, U.K. A train of damped sinusoids and the same train played backwards have the same power spectra, but they sound different. It follows that the perception of pitch and tone color cannot depend entirely on the power spectrum. The backwards signal, called "ramps," has the greater sine-tone character, so long as the damping time is not too long. An auditory image model, using strobed integration, agrees with these observations. [Note: this work appears in the JASA that arrived today: vol 96; pages 1409-1418, and 1419-1428.] **(9)** "A Temporal Account of Complex Pitch" William A. YOST Parmly Hearing Institute, Loyola University of Chicago, 65625 N. Sheridan Rd. Chicago, IL 60626, USA Whereas Roy Patterson (8) reported perceived differences in the absence of spectral differences, Bill Yost reported major spectral differences that are somehow not perceived. He used iterated rippled noise, made with two or three stages, leading to power spectra with sharp peaks at harmonics of the reciprocal of the delay time. Depending upon the details of the stage connections, there may or may not be spectral ripples between the peaks. These spectral ripples cannot be heard. By contrast, variations producing smaller spectral changes are actually heard. The differences in the perceived pitch strength can be accounted for by a model based upon the height of the first peak in the autocorrelation function. **(10)** "Extracting the fundamental frequencies of two concurrent sounds" Robert P. CARLYON MRC Applied Psychology Unit, 15 Chaucer Rd., Cambridge CB2 2EF, U.K. Bob reviewed work showing that fundamental frequency differences between two tones are more easily perceived if the spectral components of the two tones are either all resolved or all unresolved. This result supports a two-mechanism model for complex-tone pitch perception. An extension of this work studied the ability to detect fundamental frequency changes in a complex tone target given another complex tone as a masker. Components of the target and masker were in the same spectral region. The data suggest that listeners can perform this task if the components of the tones are resolved, but not if they are unresolved. **(11)** "Auditory model inversion for sound separation" Malcolm SLANEY Interval Research Inc., Palo Alto, CA, USA An interesting test of the retention of information in perceptual models is to try to recreate the original acoustic waveform from displays known as cochleagrams and correlograms. To invert a cochleagram one inverts the automatic gain control process and then recovers a bipolar waveform from the output of a rectifying haircell, using the technique of convex projection. The waveform is finally recovered by inverse filtering each auditory channel and adding the outputs. Inverting a correlogram is more difficult because phase information is not originally present. The first step is to transform the correlogram into a spectrogram. Then one recovers the missing phase information by amplitude information that is redundant across neighboring auditory filters. The process involves iterating between the spectrogram and the calculated waveform. **(12)** "The computation of loudness in the auditory continuity phenomenon" Stephen McADAMS, Marie-Claire Botte, Francois Banide, Xavier Durot, and Carolyn Drake Laboratoire de Psychologie Experimentale (CNRS), Universite Rene Descartes EPHE, 28 rue Serpente, F-75006 Paris and IRCAM, 1 place Stravinsky, F-75004 Paris [email: smc(at)] Steve McAdams reported a quantitative check on the continuity illusion created by periodically adding an increment to a tone. According to Bregman's old-plus-new listening strategy the listener should hear a continuous tone with a loudness determined by the tone without increment and a pulsed tone with a loudness depending upon the size of the increment. Loudness matching experiments agree for the continuous sensation, but the loudness of the pulsed sensation does not agree with either a power or a pressure interpretation of the increment. [It looks as though a loudness interpretation of the increment would not agree either. wmh] **(13)** "On the perceptual segregation of steady-state tones" William Morris HARTMANN Department of Physics, Michigan State University, East Lansing, MI, 48824, USA. [e-mail: hartmann(at)] I presented the results of mistuned harmonic experiments that show the importance of neural synchrony in the process that detects a single mistuned harmonic in an otherwise periodic complex tone. Mistuned harmonic matching experiments find that performance decreases with increasing mistuned harmonic number in a way that precisely parallels the loss of synchrony observed in physiological recordings from eighth-nerve neurons. Further, mistuned harmonic detection experiments show a non-monotonic dependence on signal level that resembles the level dependence of multiple synchrony in the eighth nerve. They exhibit structure in their dependences on tone duration and mistuning suggesting that synchrony anomalies, as measured by an autocorrelator, produce perceptual segregation. Additional experiments suggest that synchrony anomalies are detected in tuned channels. **(14)** "On the perceptual distance between speech segments" Oded GHITZA and M. Mohan Sondhi AT&T Bell Laboratories, Acoustics Research Department, Murray Hill, New Jersey, 07974, USA Oded described the search for an objective measure of signal differences that correlates with the perceptual distance between speech diphones. In a diagnostic rhyme test, pairs of CVC words were subjected to nine different interchanges of spectral regions in time-frequency space. The difference metric was determined from listener errors as a function of these distortions. Next, an automatic speech recognizer was given the same test and distance parameters of the model were adjusted to lead to the same pattern of errors as found with human listeners. **(Finally)** It was a most enjoyable workshop because of the vigorous exchange of ideas and the gracious hospitality of our hosts in Kyoto. end

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University