[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Some limitations of the pitch-based grouping
It is interesting to think about the evolutionary importance of the pitch sensation. In my opinion, selection pressures do not guarantee a perfect neural network for pitch extraction. I would like to point out some performance limits of human pitch sensation with respect to the task of auditory scene analysis.
It is widely accepted that the pitch sensation is important in auditory grouping. The fact that there is no similar principle of the pitch-based grouping in visual sensation allows me to think about the mechanical principle underlying the pitch-based grouping in audition.
I think that the auditory system has evolved the ability of pitch extraction because a number of natural oscillators produce harmonically related components. They are self-sustained oscillators with one-dimensional attractor (limit-cycle) in the phase space. However, some self-sustained oscillators do not produce harmonically related components if their attractor has a dimension higher than one. The pitch-based grouping cannot account for sounds produced by the self-sustained oscillators with a strange attractor (chaotic oscillation) or a torus attractor.
The two-voice sounds, which have components at n*F1+m*F2 (where n and m are integers), produced by the self-sustained oscillators with a two-dimensional torus attractor are essential in bird acoustic communication. This is in a sharp contrast to human acoustic communication, as such voices (biphonations) are considered as a pathological phenomenon. This can be related to an important distinction between the syrinx and the larynx: the syrinx contains two independent sets of vibrating membrane under independent nervous control, whereas mammals appear to lack the fully independent anatomy and nervous control that would allow the vocal folds to have different characteristic vibration frequencies F1 and F2. [see, e.g. "Acoustic Communication" (edited by AM Simmons, AN Popper and RR Fay), Springer, 2003, pp. 85-86]
>From a viewpoint of auditory scene analysis, birds should group the components at n*F1+m*F2 as a single entity because these components are likely produced by the same bird. So, the pitch sensation of birds might be two-dimensional. Their auditory system might extract F1 and F2 from a birdcall and then group all of the components at n*F1+m*F2.
Has the bird's auditory system evolved a sophisticated ability of grouping the components at n*F1+m*F2? I do not think so. However, this comparative perspective highlights a limitation of the pitch-based grouping in human auditory processing: we cannot group a two-voice birdcall on the basis of the grouping rule of harmonicity.
The pitch-based grouping works bad for sounds produced by the self-sustained oscillators with higher-dimension attractors. Such oscillators do exist in everyday life.
Even if we constrict ourselves to the sounds produced by the self-sustained oscillators with one-dimensional attractors, the pitch-based grouping sometimes works imperfectly. I have found three examples.
Example 1: Overtone singing.
Overtone singing is a vocal technique found in Central Asian cultures, by which one singer produces two pitches simultaneously. When listening to the performance, a high pitch of n*F0 can be perceived along with a low drone pitch of F0, because the formant centered at n*F0 has an extraordinary small bandwidth. I have used a pitch model based on autocorrelation analysis to determine the pitch strength of n*F0, finding that the peak height increases as the formant bandwidth decreases. Autocorrelation functions of normal voices show peaks corresponding to formants, but their heights are not comparable to the peak at 1/F0. The pitch model of autocorrelation analysis works very well.
When listening to overtone singing, the auditory system extracts 'too many' pitches for grouping.
Example 2: Natural periodic sounds with the predominance of upper odd harmonics.
A complex tone composed of three harmonics at 7f0, 9f0, and 11f0 could elicit three pitches: a prominent pitch of f0, two weak pitches of 9f0/4 and 9f0/5. Natural periodic sounds with the predominance of upper odd harmonics can be produced by a quasi-sinusoidally driven Duffing oscillator.
When listening to such sounds, the auditory system extracts 'too many' pitches for grouping.
Example 3: Natural periodic sounds with the predominance of lower even-numbered components.
The sound of the oscillator that has undergone a period-doubling can have weak odd-numbered components at lower frequencies. The pitch F0, which is extracted on the basis of the lower even-numbered components-the harmonics-is too high for grouping all components. The pitch sensation of F0/2 can accomplish this task, but the auditory system fails to perceive this pitch when the lower odd-numbered components-the subharmonics-are weak and masked by adjacent harmonics.
In summary, there are two types of the limitations of the pitch-based grouping: (1) the auditory system extracts 'too many' pitches for grouping, (2) the components produced by the same object cannot be grouped on the basis of the extracted pitch.
In the first case, a periodic sound just elicits multiple pitches. In the second case, ungrouped components may lead to a significant increase of roughness. That is why some birdcalls and pathological voices sound very rough.
Selection pressures do not guarantee a perfect pitch sensation that can accomplish the task of auditory scene analysis for all sounds. We need other "grouping rules" such as the rule of common onset. Note that only a portion of natural sounds is produced by self-sustained oscillators. So "grouping rules" other than harmonicity are needed in processing percussion-like sounds.
A mathematical question:
Consider the natural sounds composed of components at n*F1+m*F2.
Is it possible to formulate a computational model that can extract F1 and F2 and then group the components n*F1+m*F2?
Any comment is much appreciated.
Thanks a lot
PhD Musicology, Humboldt University Berlin
--------- Original Message ---------
DATE: Fri, 16 Jan 2004 21:21:12
From: Dmitry Terez <terez@SOUNDMATHTECH.COM>
>I do think that correlation function has two fatal drawbacks...
>The first fatal drawback of correlation is the abundance of secondary
>peaks due to complex harmonic structure of a signal. For some real
>signals we are dealing with every day, such as speech, the secondary
>peaks in the correlation function due to speech formants (vocal tract
>resonances) are sometimes about the same height as the main peaks due
>to signal periodicity (pitch).
>I think that it would be strange if evolution resulted in such a
>suboptimal mechanism of perceiving sound periodicity.
Get advanced SPAM filtering on Webmail or POP Mail ... Get Lycos Mail!