voiced/unvoiced detection ("John K. Bates" )


Subject: voiced/unvoiced detection
From:    "John K. Bates"  <jkbates(at)COMPUTER.NET>
Date:    Fri, 13 Nov 1998 13:59:06 -0500

Dear List, Pierre Divenyi's posting that cited de Boer's ideas on voice pitch mentioned some key ideas like: "de Boer did not base his model on autocorrelation" and, "stochastic process in which various alternative (instantaneous) pitches may coexist." According to my non-conformist interpretation, these ideas fit in with a "granular" approach to acoustic processing. Granularity allows a better statistical interpretation for addressing the voiced/unvoiced problem. The problem is to find the best method for getting the statistics. Classic autocorrelation is not one of them. In a 1995 publication ("A model of auditory perception" in Control and Dynamic Systems, Vol 69, Multidimensional Systems: Signal Processing and Modeling Techniques, Ed. by C.T. Leondes, Academic Press) I described a periodicity detector that was able to extract pitch as well as voiced-unvoiced detection. It operates by _assuming_ that all sounds are composed of stochastic multiple instantaneous elementary periodicities. An elementary periodicity consists of three similar events spaced by two equal intervals. The range of assumed periodicities covers seven octaves. Each periodicity is recognized as an independent event and collected in a running histogram. If there is a continuous sequence of similar periodicities, such as a voice or a musical tone, it can be called a "pitch." If a sound is purely random the histogram of periodicities will show a flat distribution over the histogram spectrum. Band-limited randomness in whispered speech forms clusters denoting formant periodicities. Thus, with or without a periodic glottal vibration, the formant periodicities may be identified. Examples of voiced and whispered speech are shown in the aforementioned chapter. This scheme gets voiced/unvoiced decisions along with the phonetic segmentation that I also described. These phonetic segments provide a meaningful, non-arbitrary interval for collecting statistics of the periodic events. I first tried getting the V/UV decision by analyzing periodicity clusters in the histogram distribution. Effectively, the histogram thus becomes the statistical equivalent of the correlator: the variance of the periodicity distribution can indicate the degree of randomness. However, I found experimentally that this method was not reliable. Instead it is better to use past history of periodic sequences to predict future periodic events. The method I chose compares the number of successful predictions against the average number of hits. For speech, predictions of voicing are limited to the normal pitch range. Experimental results have been consistent over a variety of utterances. Fricatives, plosives, and whispers are labeled as unvoiced, except for the vowel /oo/, which has a formant in the pitch range. In discussions on this granular approach the main concern is its apparent incompatibility with current models of cochlear functions. However, in view of the List's current quandary on unvoiced speech, the results I have obtained indicate that a granular-based approach to auditory modeling might be in order. In any case, this method may be useful in speech recognition. John Bates Time/Space Systems 79 Sarles Lane, Pleasantville, NY 10570 (914)-747-3143, jkbates(at)ieee.org Email to AUDITORY should now be sent to AUDITORY(at)lists.mcgill.ca LISTSERV commands should be sent to listserv(at)lists.mcgill.ca Information is available on the WEB at http://www.mcgill.ca/cc/listserv


This message came from the mail archive
http://www.auditory.org/postings/1998/
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University