A Computational Model for the Alignment of Hierarchical Scene Representations in Human-Robot Interaction |

The ultimate goal of human-robot interaction is to enable the robot to seamlessly communicate with a human in a natural human-like fashion. Most work in this field concentrates on the speech interpretation and gesture recognition side assuming that a propositional scene representation is available. Less work was dedicated to the extraction of relevant scene structures that underlies these propositions. As a consequence, most approaches are restricted to place recognition or simple table top settings and do not generalize to more complex room setups. In this paper, we propose a hierarchical spatial model that is empirically motivated from psycholinguistic studies. Using this model the robot is able to extract scene structures from a time-of-flight depth sensor and adjust its spatial scene representation by taking verbal statements about partial scene aspects into account. Without assuming any pre-known model of the specific room, we show that the system aligns its sensor-based room representation to a semantically meaningful representation typically used by the human descriptor.
This work is presented at the International Joint Conference on Artificial Intelligence (IJCAI'09, Pasadena, CA, USA) and a brief summary is given as a poster at the International Computer Vision Summerschool (ICVSS'09, Sicily, Italy). [paper, slides, poster]

This video presents the development of a model using a specific verbal scene description given by a subject in the conducted study:
download video here (right mouse button: save link as)