Research
Multimodal Communication with Social Robots and Virtual Agents
Multimodal Communication with Social Robots and Virtual Agents
In this work we study how speech, gaze, and pointing are integrated during referential communication, and how such behaviour can be captured, analysed, and adapted for artificial agents.
This study presents a complete workflow for capturing, processing, and aligning multimodal data from human participants performing a referential task, including speech, eye gaze, and pointing behavior. It also shows how the resulting temporal and movement-based features can inform models of referential behavior in an artificial agent. Information about speech timing, gaze patterns, pointing actions, and gesture movement dynamics is obtained using motion-capture and eye-tracking technologies.
Multimodal data is difficult to analyze as it involves synchronizing multiple data streams and annotating (which is very effort and time intensive). In my research I have focused on developing pipelines which make it easier to annotate, analyze and synchronize multimodal data.
In this study we introduce a human-in-the-loop automatic gaze data annotation pipeline that maps gaze data onto regions of interest (ROI) in videos leveraging state-of-the-art deep learning models. This pipeline would enable researchers define target objects and refine ROIs using intuitive channels such as text or visual prompts.
In this work, I investigate how large language models can be used to generate context-appropriate emotional expressions for social robots during live human–robot dialogue.
The system uses the ongoing conversation as context, predicts the robot’s emotional state in real time, and maps this prediction to facial expressions on the robot. Through a collaborative interaction study, we compared model-driven emotional expressions with no-emotion and mismatched-emotion conditions. The results showed that congruent LLM-generated expressions made the robot appear more human-like, emotionally appropriate, and engaging.
In this work, I explore how social robots can decide where and when to look during human–robot interaction. Instead of relying only on reactive gaze shifts, we developed a planning-based gaze control system that predicts gaze targets over a short future time window.
The system coordinates gaze behavior across conversational functions such as turn-taking, gaze aversion, referential gaze, and joint attention, while also improving eye–head coordination. A user study showed that this planning-based approach was preferred over a purely reactive system and was perceived as more interpretable and better at regulating interpersonal intimacy.
In this work, I study how people perceive emotional expressions in social robots, and whether the eye region alone is sufficient for recognizing emotions.
We conducted a user study comparing human faces with robot faces that varied in appearance and visible facial region. The results show that fully animated robot faces can communicate emotions effectively, but recognition becomes less accurate when only the eyes are visible. Under this constraint, more human-like robot faces support better emotion recognition, highlighting the importance of facial design for expressive and socially intuitive robots.