4aSC19. Pronunciation variability in the Switchboard corpus.

Session: Thursday Morning, December 5


Author: Sean A. Fulop
Location: Phonet. Lab., Dept. Linguist., UCLA, Los Angeles, CA 90095-1543
Author: Patricia A. Keating
Location: Phonet. Lab., Dept. Linguist., UCLA, Los Angeles, CA 90095-1543


This paper first describes a project on manual phonetic labeling of some 2075 words from the Switchboard corpus (a 3 million word corpus of unscripted telephone conversations, recorded and orthographically transcribed by Texas Instruments, and available from the Linguistic Data Consortium). Multiple (from 10 to 40) tokens of 72 lexical items were transcribed; the transcription system used was an extension of the TIMITBET, designed to allow a narrower transcription, particularly of consonants. Intertranscriber agreement was assessed using the Oregon Graduate Institute's metric for transcription accuracy, and comparison with their results for their English telephone corpus will be provided. A number of phonological facts will then be elucidated from the transcriptions. To facilitate this, a database of contextual information has been created for each phoneme in the dictionary forms of the words; the database includes both lexical contextual factors and the actual context in which a given token appears. The scheme is similar to that of Withgott and Chen [Computational Models of American Speech (1993)], who used such a contextual database to develop phonological rules for TIMIT. Here, the phonological rules which emerge from Switchboard will be discussed and compared with those already found in TIMIT.

