[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Summary of ATR-Kyoto Workshop
To AUDITORY from Bill Hartmann, 23 September 1994
In response to the suggestion from Dan Ellis, I'll review the recent
A BIOLOGICAL FRAMEWORK FOR SPEECH PERCEPTION AND PRODUCTION
held at the Advanced Telecommunication Research Institute International
(ATR) in Kyoto, Japan on the 16th and 17th of September 1994.
I've tried to represent the ideas fairly, but the reader should be
warned that these summaries have not been cleared with the authors. Written
versions of the talks given at this meeting should be available, as an ATR
report, about six weeks from now. The reader may request a copy from
Hideki Kawahara email@example.com
ATR Human Information Processing Research Labs.
2-2 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-02, Japan
Phone: +81-7749-5-1020 Facsimile: +81-7749-5-1008.
The conference was opened by welcoming addresses from Yo'ichi
TOHKURA President of the ATR Human Information Processing Research Laboratories
and by Kohei HABARA, Chairman of the Board.
There followed fourteen technical presentations:
**(1)** "Impact of Biological Aspects of Speech Perception and Production on
Future Communication Systems"
ATR Human Information Processing Research Labs. 2-2 Hikaridai, Seika-cho
Soraku-gun, Kyoto 619-02, Japan [firstname.lastname@example.org]
Hideki stressed the importance of a holistic approach to human
communication, which combines information across modalities. The model should
link speech production with speech perception, where perceived elements are
dynamic and evolving, and should take proper account of the effects of early
development vs later training. An impressive parallel between modalities shows
a strong correlation between auditory acuity and visual acuity across species.
Contrasting roles of early development and later training are shown by R/L
distinctions made by native Japanese speakers. It is not too soon to start
thinking about holistic perception models because computers capable of
implementing them are expected early in the next century.
**(2)** "Beyond Sensory Processing: The effects of Learning, Memory, and
Cross-Modal Integration on Speech Perception"
Patricia K. KUHL
Department of Speech and Hearing Sciences (WJ-10), University of Washington,
Seattle, WA 98195 [email@example.com]
Unfortunately, Professor Kuhl was unable to attend. As she was the
only woman on the program, her absence was particularly regrettable. Her work
was represented by a video tape on the McGurk effect. The tape shows the
influence of visual information on auditory perception of phonemes. On the
audio track of the tape is recorded "Ba ba, Ba ba, ..." On the video track of
the tape there is a human face (Professor Kuhl's) articulating the syllables
"Ga ga, Ga ga, ..." synchronized with the audio. Watching and listening to the
tape, the observer perceives "Da da, Da da, ..." or "Tha tha, Tha Tha, ..."
Since "Ba" is made at the front of the mouth and "Ga" is made in the back, the
observer's perception is a compromise between audio and visual information.
Opening and closing one's eyes as the tape rolls makes a dramatic contrast in
what one "hears."
**(3)** "Learning to Recognize Speech in Noisy Environments"
Department of Computer Science, University of Sheffield, Regent Court,
211 Portobello Street, Sheffield, S1 4DP, U.K.
Martin observed that speech in noise can be treated with standard
auditory object grouping rules, common onset etc, but that this grouping leads
to a fragmented speech pattern as the input for higher level processors. His
talk investigated the effects of a fragmented spectrogram both as training and
as test input in automatic speech recognition systems, a Kohonen Net and a
Hidden Markov Model. He showed that the latter system can actually produce
better performance if weak features of a speech signal are omitted from the
**(4)** "A Neurobiological Perspective on Speech Production"
Vincent L. GRACCO
Haskins Laboratories, 270 Crown St., New Haven, CT 06511 USA
Vincent Gracco described speech production as the result of coordinated
activity that is distributed among different cortical and subcortical areas
of the brain. Evidence for this model comes from neuroanatomic studies,
electrical stimulation physiology, and observations of parallel deficits in
neurological disorders. The flexibility of the neuromotor centers in coping
with diverse contextual challenges indicates continual intervention from the
high-level cognitive centers.
**(5)** "Somatoneural relation in the auditory-articulatory linkage"
ATR Human Information Processing Research Labs, 2-2 Hikaridai,
Seika-cho Soraku-gun Kyoto 619-02, Japan [email: firstname.lastname@example.org ]
Kiyoshi Honda considered the relationship between speech perception
and speech production, particularly the evolution of the peripheral mechanisms
of production to accommodate the acoustical and neural requirements of the
perceptual situation. The general somatoneural principle that brain
organization must adjust to the shape of the body, accounts for the
formulation of a tight auditory-articulatory linkage enforced by analogous
speech representations in the motor and sensory spaces.
**(6)** "Robust speech recognition based on human binaural perception"
Richard M. STERN and Thomas M. Sullivan
Department of Electrical and Computer Engineering and School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA 15213
Rich Stern presented a front end for automatic speech recognition based
upon human binaural hearing. Speech signals from multiple microphones are
passed through bandpass filters and nonlinear rectification operations. The
outputs are then cross-correlated within each frequency band. The system has
been found to work better in noisy environments than a single channel system
and better than delay-and-add beamforming.
**(7)** "The conceptual basis of modelling auditory processing in the brainstem"
Department of Human Sciences, Loughborough University, 14 Chester Close,
Loughborough, LE11 3BD, U.K.
Ray gave an overview of peripheral physiology and the potential
application of elements based on physiology in AI systems such as automatic
speech recognizers. Features include nonlinearity in the cochlea and
the generation of difference tones, limitations in neural synchrony,
adaptation and limited dynamic range in the auditory nerve, and cochlear
nucleus neurons (bushy cells, choppers and pausers) with a wide diversity of
behaviors, tunings and time constants.
**(8)** "Auditory Figure/Ground Separation"
MRC Applied Psychology Unit, 15 Chaucer Road, Cambridge, U.K.
A train of damped sinusoids and the same train played backwards
have the same power spectra, but they sound different. It follows that the
perception of pitch and tone color cannot depend entirely on the power
spectrum. The backwards signal, called "ramps," has the greater sine-tone
character, so long as the damping time is not too long. An auditory image
model, using strobed integration, agrees with these observations. [Note:
this work appears in the JASA that arrived today: vol 96; pages 1409-1418, and
**(9)** "A Temporal Account of Complex Pitch"
William A. YOST
Parmly Hearing Institute, Loyola University of Chicago,
65625 N. Sheridan Rd. Chicago, IL 60626, USA
Whereas Roy Patterson (8) reported perceived differences in the absence
of spectral differences, Bill Yost reported major spectral differences that
are somehow not perceived. He used iterated rippled noise, made with two or
three stages, leading to power spectra with sharp peaks at harmonics of the
reciprocal of the delay time. Depending upon the details of the stage
connections, there may or may not be spectral ripples between the peaks. These
spectral ripples cannot be heard. By contrast, variations producing smaller
spectral changes are actually heard. The differences in the perceived pitch
strength can be accounted for by a model based upon the height of the first
peak in the autocorrelation function.
**(10)** "Extracting the fundamental frequencies of two concurrent sounds"
Robert P. CARLYON
MRC Applied Psychology Unit, 15 Chaucer Rd., Cambridge CB2 2EF, U.K.
Bob reviewed work showing that fundamental frequency differences
between two tones are more easily perceived if the spectral components of the
two tones are either all resolved or all unresolved. This result supports a
two-mechanism model for complex-tone pitch perception. An extension of this
work studied the ability to detect fundamental frequency changes in a complex
tone target given another complex tone as a masker. Components of the target
and masker were in the same spectral region. The data suggest that listeners
can perform this task if the components of the tones are resolved, but not if
they are unresolved.
**(11)** "Auditory model inversion for sound separation"
Interval Research Inc., Palo Alto, CA, USA
An interesting test of the retention of information in perceptual
models is to try to recreate the original acoustic waveform from displays
known as cochleagrams and correlograms. To invert a cochleagram one inverts
the automatic gain control process and then recovers a bipolar waveform from
the output of a rectifying haircell, using the technique of convex projection.
The waveform is finally recovered by inverse filtering each auditory channel
and adding the outputs. Inverting a correlogram is more difficult because phase
information is not originally present. The first step is to transform the
correlogram into a spectrogram. Then one recovers the missing phase
information by amplitude information that is redundant across neighboring
auditory filters. The process involves iterating between the spectrogram and
the calculated waveform.
**(12)** "The computation of loudness in the auditory continuity phenomenon"
Stephen McADAMS, Marie-Claire Botte,
Francois Banide, Xavier Durot, and Carolyn Drake
Laboratoire de Psychologie Experimentale (CNRS), Universite Rene Descartes
EPHE, 28 rue Serpente, F-75006 Paris and IRCAM, 1 place Stravinsky,
F-75004 Paris [email: email@example.com]
Steve McAdams reported a quantitative check on the continuity illusion
created by periodically adding an increment to a tone. According to Bregman's
old-plus-new listening strategy the listener should hear a continuous tone
with a loudness determined by the tone without increment and a pulsed tone
with a loudness depending upon the size of the increment. Loudness matching
experiments agree for the continuous sensation, but the loudness of the pulsed
sensation does not agree with either a power or a pressure interpretation of
the increment. [It looks as though a loudness interpretation of the increment
would not agree either. wmh]
**(13)** "On the perceptual segregation of steady-state tones"
William Morris HARTMANN
Department of Physics, Michigan State University,
East Lansing, MI, 48824, USA. [e-mail: firstname.lastname@example.org]
I presented the results of mistuned harmonic experiments that show the
importance of neural synchrony in the process that detects a single mistuned
harmonic in an otherwise periodic complex tone. Mistuned harmonic matching
experiments find that performance decreases with increasing mistuned harmonic
number in a way that precisely parallels the loss of synchrony observed in
physiological recordings from eighth-nerve neurons. Further, mistuned
harmonic detection experiments show a non-monotonic dependence on signal level
that resembles the level dependence of multiple synchrony in the eighth nerve.
They exhibit structure in their dependences on tone duration and mistuning
suggesting that synchrony anomalies, as measured by an autocorrelator, produce
perceptual segregation. Additional experiments suggest that synchrony
anomalies are detected in tuned channels.
**(14)** "On the perceptual distance between speech segments"
Oded GHITZA and M. Mohan Sondhi
AT&T Bell Laboratories, Acoustics Research Department,
Murray Hill, New Jersey, 07974, USA
Oded described the search for an objective measure of signal
differences that correlates with the perceptual distance between speech
diphones. In a diagnostic rhyme test, pairs of CVC words were subjected to
nine different interchanges of spectral regions in time-frequency space. The
difference metric was determined from listener errors as a function of these
distortions. Next, an automatic speech recognizer was given the same test and
distance parameters of the model were adjusted to lead to the same pattern of
errors as found with human listeners.
**(Finally)** It was a most enjoyable workshop because of the vigorous
exchange of ideas and the gracious hospitality of our hosts in Kyoto.