[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

intrarater reliability - anchoring stimuli

Dear list,

First of all apologies for the long message, but I will really appreciate
any comments or thoughts on my 2 questions below.

I was wondering if anybody could give me some advice on two problems, the
first one related to the amount of data duplication needed to achieve a
good estimate of intrarater reliability in a perceptual experiment, and
the secone one related to the amount of training stimuli needed in the
anchoring/training phase of the experiment.

I am about to start some tests on the perception of voice quality.
A panel of 6 expert listeners (i.e. voice therapists) will be asked to
rate the voice quality of a number of speech fragments on 12 - 15
perceptual parameters (e.g. roughness, breathiness etc.). The parameters
are rated on a 5 point equal appearing interval scale.

The speech fragments consist of 156 sustained vowels (divided into 3
groups of different vowels), 52 fragments of conversational speech and 52
fragments of the Rainbow passage. The 3 different types of speech
fragments will be presented in separate listening sessions.

In order to calculate intrarater reliability (i.e. the self-consistency of
the listener) , I need to duplicate some of the stimuli. The best way to
do this, is to duplicate all the stimuli. However, given the large amount
of speech material in the tests, this will be very impractical. (The
listeners will not be prepared to sit through 12 hours or so of testing).

The literature provides little guidance as to the minimal amount of
stimuli that should be duplicated in order to achieve an accuarate
reliability coefficient. Some studies report a duplication of
10% or less, some 30%, a few 50% and the very odd study duplicates 100%.
But never are any justifications given for the chosen percentage.

(I must admit I haven't decided yet on which statistic to use for the
reliability, but Pearson's r and intraclass correlation coefficients seem
to be widely used)

Therefore, my first question is:

Given the large amount of speech material, what should be the minimal
amount of data to be duplicated?

(A complicating factor is also the use of conversation fragments.
Listeners will probably be aware of the duplication, if only because of
conversation content, and may remember their scores for that particular


The second question is related to the anchoring phase of the experiment.

It is common practice to provide listeners with anchoring stimuli before
the actual listening test. Usually the listeners are provided with
explicit anchors, i.e. the speech fragment is presented together with the
perceptual rating for a particular parameter.

In my experiment however, I have decided against the use of explicit
anchors, in order to avoid the introduction of a bias. (This is done
because the perceptual labels will become the baseline for acoustic
correlates). Instead, listeners will be presented with a random selection
of stimuli (which should include all values of the scale, including
extremes), and are supposed to create their own anchors on the basis of
these stimuli.

Again, very little information is available on how people reach the
decision on the number of anchoring samples.

It's actually more a stats problem. My question is:

If I have a set of 156 vowels and each vowel is rated on 12 parameters on
a scale from 0 - 4, how many vowels should be in my training set, so that
I can say with 95% probability that the scale begin- and end values (i.e 0
and 4) for each parameter are included in that set?

Apologies once again for the lengthy e-mail, and sincere thanks to those
to took the trouble to read until the end.

Any thoughts and comments will be very greatfully received!

Christel de Bruijn

Christel de Bruijn - PhD student
University of Sheffield
Department of Human Communication Sciences
31 Claremont Crescent
Sheffield S10 2TA
United Kingdom

phone: (+44) (0)114 22 22410
fax:   (+44) (0)114 27 30547
e-mail: christel@larynx.shef.ac.uk