Re: musical tones in speech


The questions you submitted to Bob Ladd were worth asking, and his answers
are worth reading, but they are irrelevant to the main question at this
point: how is it possible that "speech targets" have the distribution that
you claim if there is no trace of it in the F0 data they are derived from?

Maybe I should point out that my goal was not to extract breath groups or
"speech targets" to any degree of accuracy.  Rather, I extracted a set of
points that (and Bob Ladd seems to confirm) should overlap with the set of
"speech targets", were one to extract that set as Bob Ladd did, and should
consequently be expected to have a remarkable distribution if speech
targets have a remarkable distribution.

I take the fact that I did not find a remarkable distribution to indicate
that there was none.  As I don't expect Dutch to be any different from
Japanese, French or English in this respect, the least I can say is that I
am puzzled.


>The following information is essential when comparing Alain de Cheveigné's
>(see end of mail) and my results on "musical tones in speech".
>I put three questions to Bob Ladd, who is not a regular reader of the list
>but a leading expert in intonational phonology (see address below). He
>answered as follows:
>1) Do "all glottal vibrations" in a section of speech have any relation to
>speech targets?
>No, that's the whole point.  Most of them are transitions.  Perhaps the
>clearest explanation of this notion and its empirical consequences is
>in Chapter 2 of Pierrehumbert and Beckman's "Japanese Tone Structure"
>(MIT Press, 1988), esp. section 2.2.1.
>2) Do "maxima or minima of contiguous voiced portions" (presumably extracted
>by software) have any relation to speech targets?
>Well, this is closer to what Jacques Terken and I were looking at, but
>(a) if the software is stupid, it will be misled by local F0
>perturbations (e.g. it will find the first glottal cycle after a
>voiceless stop as a local maximum), and (b) I'm puzzled by Alain's
>reference to "breath groups", since contiguous voiced portions the size
>of breath groups are bound to be interrupted by short stretches of
>voicelessness, unless of course his materials are carefully controlled
>segmentally.  But yes, if you did some intelligent pre-processing of the
>laryngograph signal you could take the automatically extracted maxima
>and minima as a first approximation to targets (or certainly a first
>approximation to the kinds of targets Jacques and I were looking at).
>3) Is a software extraction of speech targets at all possible today, or in
>the near future?
>No.  It's still in many ways a theoretical question, not an empirical one.
>Replying to my message from Tuesday, May 08, 2001, Alain de Cheveigné wrote
>on Thursday, May 10, 2001:
>"I happen to be working with several databases of speech recorded with a
>laryngograph signal (which allows accurate F0 to be estimated).  Together
>they contain 1.75 hours of speech of which half is voiced, pronunced by 38
>speakers (19 male, 19 female) of Japanese (30), English (4) and French (4).
>The data have been carefully labeled with an accurate period estimation
>method with sub-sample resolution, and the estimates checked visually.
>A histogram of F0 values with 1/4 semitone bins shows no obvious structure
>related to the musical scale.  A 4-bin histogram of values modulo one
>semitone is essentially flat.  The remarkable statistics of "hand-marked
>end- and turning points of the contour" are apparently not reflected in raw
>F0 contours."
>He added the same day upon my question,
>"At what points in a sentence did you extract f0 ?"
>as follows:
>"At all points for which there was regular glottal vibration.  By raw I mean
>that no "speech target" selection process was involved.
>I also tried doing statistics of maxima or minima of contiguous voiced
>portions (which roughly correspond to "breath groups") as a rough but
>plausible target selection process.  No sign of a note-related structure in
>the distribution of values."

