[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: technical notes on data used by Martin Braun]

Dear listers,

I received the following e-mail by Bob Ladd, concerning the debate on
the results by Martin Braun. It seems all my questions can be answered

Christian Kaernbach

-------- Original Message --------
Von: D R Ladd <bob@ling.ed.ac.uk>
Betreff: technical notes on data used by Martin Braun
An: nombraun@post.netlink.se,

Dear All,

I apologise for taking so long to answer various questions that were
raised in the course of last month's discussion of Martin Braun's
report of bias toward certain musical semitone values in the F0 values
produced in a corpus of spoken Dutch.  The questions were put by
Christian Kaernbach and Alain de Cheviegne, and they have come to me
because I was the originator of the Dutch corpus used by Braun.

I would be happy to have one of you post this message (or something
suitably edited) to the Auditory list.  I would also be happy to answer
any further questions you may have.

Bob Ladd
Dept. of Theoretical and Applied Linguistics
University of Edinburgh

Kaernbach's questions:

> Martin, when hand-marking targets on visually presented speech contours,
> did you (or your colaborators) specify a point on that contour or a
> range and some automatic algorithm determined the minimum or maximum in
> that range?

The identification of targets was done by myself and by Liz Shriberg (of
SRI in California), independently on separate visits to IPO in 1994 and
1995 respectively, based on procedures established by me.  As I recall,
labelled 9 speakers and Shriberg 6.  The aim of the project was to
establish principles relating the scaling of pitch targets from one
speaker to another and from a given speaker's normal voice to their
"raised voice" range.  (See preliminary reports in Ladd and Terken 1995,
ICPhS Stockholm, and Shriberg et al. 1996, ICSLP Philadelphia.)  The
investigators had no prior association with Martin Braun, but provided
data files (not the speech files) to him at his later request.

The labelling was always done by hand, on the basis of F0 extraction
performed and displayed by GIPOS, the IPO speech analysis package.  The
basic labelling principle, which covers perhaps 60-80% of the values
associated with "high targets", was to take the analysis frame with the
highest extracted F0 value in the vicinity of the accented syllable -
normally late in the stressed vowel, in the immediately following
consonant, or occasionally in the following unstressed vowel.  The
systematicity of the alignment of F0 peaks with segmental landmarks in
Dutch has subsequently been established by Ladd, Mennen and Schepman in
JASA 2000 (107:2685-2969).  Cases where it was not possible to identify
a local maximum as just described were treated in various ways; details
can be supplied if necessary.

Identification of "low targets" was less straightforward (because of
irregular phonation, etc.), but again the basic principle was to select
the analysis frames that represented specified local minima.

Given this background, it is certainly very unlikely that anything
about our interests or our hypotheses would have led us - even
unconsciously - to bias the data in the direction reported by Braun.
Whether there is something in the technical details of our procedures
is a separate question (or set of questions), which I address next.  I
should point out that I was merely a user of GIPOS, not a designer or
programmer, and in fact was not a very technically sophisticated user,
in the sense that I have only a rough understanding of the mathematical
foundations of acoustic F0 extraction.  There are several questions
that I can only approximately answer on the basis of my current
knowledge (and memory of what I did 7 years ago).  In principle I
believe all these questions could be answered in detail if the
significance of Braun's findings appeared to depend on them, though
given the restructurings at IPO since 1994 I think it would be
difficult in practice to locate the individuals with the necessary
in-depth knowledge.  From what I can tell, however, knowing more about
the inner workings of GIPOS would not shed any further light on the
debate over Braun's work.

> How was the visual display presented? Was it a) always the same
> frequency range for F0 or was the range b) specifically adjusted for
> each speech segment in question following to what the algorithm of pitch
> contour extraction thought appropriate for presentation?

GIPOS makes it possible to choose different display ranges.  In general
the range was set wide enough to accommodate most male speech or most
female speech.  It may have been necessary to adjust the basic settings
for one or two speakers with particularly high or low voices; I don't
remember.  In any case there was no detailed adjustment of the display
for each utterance.

> If a): What was this range? Were its boundarys in full semitones or were
> they in quarter semitones, or some non-semitone value?

As I recall, the display range is specified in Hertz.

> If b): What could be possible range boundaries chosen by the algorithm?
> Where these defined in full semitones, in quarter semitones, or in an
> even finer resolution?
> Both a) or b): Were there any horizontal grid lines across the display,
> and were those on semitones, or what was their position?

I don't remember.  This could be established if necessary - it will be
a standard feature of GIPOS.  However, given the procedure described
above, the presence or absence of grid lines seems of little relevance.

> Was the pitch contour presented as a continuous line, or was it
> quantized in quarter semitones?

Well, it was displayed as a series of frame-by-frame values, not as a
continuous line. At the 16k sampling frequency that we used, the
frame-by-frame values have a resolution of a quarter of a semitone.  As
noted above, we were picking out specific analysis frames to represent
linguistic targets values, as far as possible on the sole basis of
whether they were local maxima or minima.

> Answering to these questions would help me to know whether I should be
> sceptical or not. And then there was one important point in a message by
> Martin: Their data were from sentence material that was specifically
> chosen so as to show clear targets. That migh make all the difference,
> i.e. it could well be that AP histograms are real for Martin's data and
> non-existent for Alain's data.

I think this is a very important point, but I would emphasise that it is
not merely a methodological question.  The claim of much recent research
on linguistic pitch (intonation as well as lexical tone) is that it
involves a string of phonological targets associated in well-defined
with the segmental string, and - importantly - that the F0 values in
between the targets are essentially nothing but transitions from one
"intended" value to another.  (In musical terms, speech pitch is mostly
portamento.)  Perhaps the clearest evidence for this claim is provided
Pierrehumbert and Beckman (Japanese Tone Structure, MIT Press, 1988,
chapter 2. sec. 2.2.1), who show that most syllables in Japanese cannot
analysed as either High or Low (as in traditional descriptions), but
rather as UNSPECIFIED for tone, and with a pitch value determined by
interpolation from one clear pitch target to another.  Similar evidence
can be inferred from Arvaniti et al. 1998 (J.Phonetics 26:3-25) and Ladd
et al 1999 (JASA 106: 1543-1554), who show that the DURATION and SLOPE
accentual F0 rises in both Greek and English are quite variable
on the segmental makeup of the accented syllable, but the F0 LEVEL AT
BEGINNING AND ENDING OF THE RISE is extremely stable for a given
speaker, regardless of the duration of the rise.  All this evidence
suggests that certain points in the F0 contour have some sort of
cognitive or linguistic salience while the rest do not.  In that case,
and if there is indeed an effect of the sort Martin Braun reports, we
would not expect to find the effect in a sample of F0 values that
includes ALL analysis frames or ALL glottal cycles, but only in a
sample of values based on putative phonological targets, like the
Ladd/Shriberg/Terken corpus.

To me, the most important reasons for skepticism about Braun's findings
would lie in the following two areas: (1) the resolution of the F0
extraction, and (2) the perceptual relevance of the F0 values chosen by
the procedures described above.

With regard to (1), this is a truly methodological issue.  However, it
seems to me that Braun's procedures remove some of the reasons for this
worry, because of the fact that the resolution of the F0 values in the
data is one quarter of a semitone.  That is, by using quarter-semitone
bins in his first histogram Braun is effectively doing nothing but
exploring the distribution of the discrete F0 values that it is
possible for GIPOS to report.  As far as I can see there is nothing in
the F0 extraction or rounding procedures that would lead to irregular
distributions of the sort Braun reports.  NB the values returned by
GIPOS are *not* on a scale anchored to any musical value; none of the
quarter-semitone values precisely correspond to semitone values
computed relative to 440 Hz.  (I suspect, knowing what I know about the
assumptions of IPO phoneticians in the 1970s and 1980s, that the values
are computed relative to 50 or 100 Hz, but I don't know this for
certain.  What is certain is that they are not relative to 440; in fact
on the de facto scale in our data 440 falls nearly in the middle of two
scale points, 436.5 and 442.8.  Again, the details could be established,
if this is really thought to be an issue.)

With regard to (2), I worry (also in my own work; refs. above) that the
local maxima and minima do not correspond to anything perceptually
relevant, even though they are demonstrably aligned and scaled
consistently.  Specifically, I think that listeners can probably
the perceived pitch level of a syllable or an accentual peak (in effect,
they can undo the portamento mentioned above), and I don't know whether
the perceived pitch level is the same as the observed acoustic F0 values
that Braun uses as the basis for his conclusions.  I think it might be
possible to get listeners to report the perceived pitch of an accentual
peak (e.g. play them a word like _Marina_ and ask them to set the values
of three pure tones to correspond to the perceived pitch of the three
consecutive syllables; would the perceived pitch of -ri- bear any
systematic relation to the acoustically observable F0 maximum at the end
of that syllable?).  If it turns out that the perceived pitch *is*
to the acoustic F0 maximum, then I think Braun's results would take on
considerable significance.  If (as I rather suspect) the perceived pitch
is related in a more complex way to the amount of F0 change on the
syllable, etc. etc., then the meaning of Braun's results is less clear.
But that does not constitute a basic methodological flaw in Braun's
rather it could provide the basis for a skeptical follow-up study.

de Cheveigne's questions:

> As a third request, are you aware of any factor in the preparation of
> speech targets that could have introduced small biases that might explain
> an over-representation of target F0 values close to notes on the musical
> scale?

As explained above, I am not aware of any such factor - and not because
I haven't tried to think of one!

> The sort of things that come to mind are:
> - were period estimates derived with sample- or subsample- resolution?  At
> what sampling rate?

As noted above, these were not values of successive pitch periods, but
values of successive analysis frames in acoustic F0 extraction, based
on speech sampled at 16k.

> - were they quantized to semitone values, either in the extraction process
> or when plotted?

See above.  In effect, they were quantized to quarter-semitone values,
a scale that was not relative to 440 Hz.  I don't know whether this is
a function of the extraction process or the plotting process, but I
think the former.

> - did plotting software add graduations?  If so, at what positions?  Did
> window bounds map to particular values?

See above.

> - did target-editing software quantize targets to a semitone scale, or
> otherwise favor particular values?

There was no target-editing software.  As noted above, everything was
done by hand, on the basis of procedures that were set down to be as
replicable as possible.

> - anything else you can think of?

No.  I was as skeptical as you are when first told of these results, but
Martin Braun's responses seem to me to make it unlikely that this effect
is merely a methodological artefact.

> As a fourth request, are you aware of an algorithmic procedure that could
> be used to obtain an approximation to speech targets (such for example that
> it catches say 2 or 3 out of 4, and adds at most 1 or 2 spurious targets
> for every 4 correct).  The aim is to probe other databases for a
> scale-related distribution, without having to go through the manual
> labelling process (F0 contours are assumed to be correct, with unvoiced
> flagged as NaN), and with some confidence that the statistics reflect the
> same important information as speech targets.

This would not be easy, but I think the most reasonable approach would
be to follow procedures used in modern automatic speech recognition
systems:  take smoothed contours (so as not to be misled by local
perturbations caused by onset and offset of voicing, glottal stops and
other obstruents, octave errors in the case of acoustic F0 extraction,
etc.) and then use some sort of accent detection algorithm.  The data
you would then analyse statistically are the F0 minima and maxima
associated with the detected accents - only those points, i.e. perhaps
half a dozen values in an ordinary sentence.

Please see my long comment above about why it is important to consider
only putative F0 targets, not all extracted F0 values or all pitch
period values as you did in your first reply to Martin Braun.