Eng. Dept., Cambridge Univ., Trumpington St., Cambridge CB2 1PZ, UK
Human beings evidently learn to produce and perceive speech ``simultaneously''---learning to produce classes of sounds that are agreed by existing speakers to fall within the spoken language and at the same time learning to perceive these classes of sounds in other speakers. By contrast in speech communication by machine, speech synthesis systems are designed quite separately from speech recognition systems, employing at present quite different techniques and importantly using labeled data. A structure for this acquisition of speech by machine (asm), has been given for the simultaneous synthesis and recognition of speech from unlabeled human speech [F. Fallside, Speech Commun. (in press) (1992)]. This employs a synthesizer and a recognizer that are trained by a coupled minimization to produce, recognize, and label the human speech classes. The method is apparently general---applying to any type of recognizer and synthesizer with adequate performance and to any level of speech representation---sub-word, word, and higher. A decomposition technique has been established that allows, in principle, the acquisition of speech to be built up for successively higher levels. The method offers one way of bringing together the specialist techniques of synthesis and recognition and in particular the use of prosody in each. Recent results will be given in the paper.