Allen L. Gorin
AT&T Bell Labs., 600 Mountain Ave., Rm. 2C-440, Murray Hill, NJ 07974
In this research on adaptive language acquisition, connectionist systems have been investigated that learn the mapping from a message to a meaningful machine action through interaction with a complex environment. Thus far, the only input to these systems has been the message. However, in many cases, the action also depends on the state of the world, motivating the study of systems with multisensory input. In this work, a task is considered where the machine receives both message and visual input. In particular, the machine action is to focus its attention on one of many blocks of different colors and shapes, in response to a message such as ``Look at the red square.'' This is done by minimizing a time-varying potential function that correlates the message and visual input. The visual input is factored through color and shape sensory primitive nodes in an information-theoretic connectionist network, allowing generalization between different objects having the same color or shape. The system runs in a conversational mode where the user can provide clarifying messages and error feedback, until the system responds correctly. During the course of performing its task, a vocabulary of 389 words was acquired from approximately 1300 unconstrained natural language inputs, collected from ten users. The average number of inputs for the machine to respond correctly was only 1.4.