[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: musical tones in speech


honestly, with the techniques you applied so far, it was not possible to
detect "musical tones in speech".

Let's take the example sentences in Fig.1 of my paper, which most readers of
this list can easily look at:

Perhaps we can agree on the following facts:

1) For such a sentence there are typically about 200 f0 values, but only 9
speech targets. Because the vast majority of the 200 f0 values are pitch
transitions between speech targets, it is obvious that the pitch of the 9
speech targets will be hopelessly lost in data noise, if you use the 200 f0
values for a histogram.

2a) Such a sentence typically has about 5 periods of voicelessness, i.e. 6
sections of contiguous voiced portions. If your software extracts minima and
maxima on a basis of such sections, you have 12 f0 values. If your software
is not highly sophisticated, you will get mainly extreme f0 values that are
caused by section onset or section offset. Such erratic data have no
relation to speech targets.

2b) If you have a software that rules out all erratic f0 points, some of the
12 f0 values will agree with the 9 speech targets. But it is uncertain
which, and how many of them. You would still have too much noise in your
histogram to see a pattern in the pitch distribution of speech targets.

2c) If your software takes "breath groups" as units, each of such a sentence
will be only one unit. You would get 2 f0 values per sentence. Due to the
problems described under (2a) and (2b) above, it would be uncertain, if any
of the 2 f0 values would agree with one of the 9 speech targets.

Perhaps we can now agree that there is no other way than hand-marking the
speech targets in the f0 contours first, and then extract them.

There is another vital point. As reported in the paper, the IPO researchers
presented such sentence material to the speakers that was likely to elicit
clear and reproducible peaks and valleys in the pitch contour. In much other
speech material, peaks and valleys are often accidental. It is unknown at
present, if such material also reflects an influence of an absolute memory
of musical tones. It might, and it might not.


Alain de Cheveigné wrote in a later message:
"As to how such an artifact could arise, a possibility is that the software
that was used to choose targets quantized F0 values to semitones."

Alain, it does not seem to be a convincing strategy to present speculations
on flaws in the work of others, just because it doesn't fit one's own
incomplete results.

A semitone resolution would have been insufficient both for the original
studies at the IPO and for my study. The resolution in the raw f0 values was
25 Cent, which is very common and sufficient for the studies at issue.

----- Original Message -----
From: Alain de Cheveigne' <Alain.de.Cheveigne@IRCAM.FR>
Sent: Friday, May 11, 2001 4:15 PM
Subject: Re: musical tones in speech