Steps Towards Embodied Language Acquisition
Chen Yu and
Dana H. Ballard
Department of Computer Science
University of Rochester, U.S.A.
Montag, 25.11.2002, 16 Uhr c.t., Hörsaal 5
Language is about symbols and those symbols must be grounded
in the physical environment during human development. Most
recently, there has been an increased awareness of the
essential role of inferences of speakers' referential
intentions in grounding those symbols. Experiments have
shown that these inferences as revealed in eye, head and
hand movements serve as an important driving force in
language learning at a relatively early age. The challenge
ahead is to develop formal models of language acquisition
that can shed light on the leverage provided by embodiment.
We present an implemented computational model of embodied
language acquisition that learns words from natural
interactions with users. The system can be trained in
unsupervised mode in which users perform everyday tasks
while providing natural language descriptions of their
behaviors. We collect acoustic signals in concert with
user-centric multisensory information from non-speech
modalities, such as user's perspective video, gaze
positions, head directions and hand movements. A multimodal
learning algorithm is developed that firstly spots words
from continuous speech and then associates action verbs and
object names with their grounded meanings. The central idea
is to make use of non-speech contextual information to
facilitate word spotting, and utilize user's attention as
deictic reference to discover temporal correlations of data
from different modalities to build lexical items. We report
the results of a series of experiments that demonstrate the
effectiveness of our approach.