Dear Laszlo,

Christine is of course correct. I would like to post 3 of many refs. For more, look in my book "Articulation and Intelligibility" published by Morgan Claypool, 2005.

If you are finding differently for an ASR system, then that just shows
that the HMM "Gain" is turned up way too high. By that I mean, its ignoring the input to some extent, and looking for words that it can put together that make some sense, and thus have a combined low entropy (independent of other phones that it recognizes).

Let me give an example: What if the spoken utterance was:
"The make had a blue type"
and assume the recognizer got, at the phone level:
"the make had a blue type" (100% correct),
then the recognizer would report:
"the man had a blue tie"

Get it?
What if your relative calls you on the phone, and leaves a message,
that gets transcribed
"a large ant was fried"
and then you listen to the message, and it is really
"your great aunt just died"
That wouldnt be too good, would it.
Maybe not a very good example. They are better taken off of a real system.

All the example I have are my personal phone messages (text and wav files), and they have things in them I cant make public.
But they can be pretty funny sets of errors, I'll tell you!

Tell us what really happened, please. I dont care how off topic it is.
Its not off topic, IMO.

A comment: It is my opinion that ASR people will not report the phone scores because they dont want their funding sources to dry up. Typically these phone scores are quite low (compared to human scores, that is), being in the 50-75% range, with no noise. When the SNR gets "down" to +10, things are falling appart, and at 0 dB SNR, the scores (in one case I know) are below chance. Yes below chance!
Human phone error rates start at somewhere between 1.5-2 % error in quiet. At +10 dB SNR (AI~0.5), the Miller Nicely phone error rate was about 10%. At 0 dB the AI is about 0.2 (Allen 2005, JASA, Fig. 6) which gives a phone error rate of about 30%. The 50% point is about -6dB SNR, and an AI of about 0.06.

We have unpublished results (in review) where we repeated some of this and found 2% error in quiet (consonants scored from CVs), 10% at -6 dB SNR, and 50% error at -18 dB SNR. However, we found that there are 3 sets of consonants, with one group of 5 consonants, having very high error. These bias the average numbers way up. The rest of the sounds (11 of them) are much better than what I quote above. One group has an error of 0.5% error in quiet (5 errors per 1000 presentations).

I have run on too long.

Please tell us more!

Jont Allen


The statement of the reviewer--that better phone recognition does not mean
better word recognition--is wrong.  It is possible that the reviewer could
support this statement with data from poorly conducted speech recognition
tests like, for example, those conducted with an inadequate number of speech
items, or when mean scores comprise scores of too few listeners.

Christine Rankovic