In addition to both symbolic and subsymbolic representations and inference processes, signal-referred, perception analogous presentations are possible. Two forms of coupling are applied in this hybrid arrangement: on the one hand an interface of the signal-near representation to the concept level and on the other an integration of subsymbolic and symbolic inference processes. The integration of language and vision is based on the theory of mental models, according to which the integrative and coherent representation of objects, events and facts plays an important role. Both modalities are available to the processes of understanding via a common representation level. The fact that processes do not only take place between processing levels immediately following one another but that direct connections of the mental representation to sensor-near levels are possible is characteristic for this architecture. In this way, a rapid interaction between visual and language components should be guaranteed.