Please use this identifier to cite or link to this item:
Title: Multimodal Deep Learning for Emotion Recognition and Expression Synthesis with Applications in Human-Robot Interaction
Authors: Φιλντίσης, Παναγιώτης Παρασκευάς
Μαραγκός Πέτρος
Keywords: Αναγνώριση Συναισθήματος, Γλώσσα Σώματος, Σκηνή, Σύνθεση Συναι- σθήματος, Οπτικοακουστικός, Βαθιά Μάθηση, Τρισδιάστατη Ανακατασκευή, 3D Morphable Models, Οπτική Ομιλία
Emotion Recognition, Affect, Body Language, Context, Expression Syn- thesis, AudioVisual, Deep Learning, 3D Reconstruction, Visual Speech, 3D Morphable Models
Issue Date: 7-Nov-2022
Abstract: Affective computing is an exciting new research area with the goal of equipping computers and robots with the capability of recognizing, expressing, modeling, and even ``feeling" emotions. An interdisciplinary field, affective computing draws resources from computer science, mathematics, cognitive sciences, and psychology. In this thesis, which is split into two major parts, we explore two aspects of affective computing; namely ``emotion recognition" and ``expression synthesis" since they constitute the most important aspects one needs to consider when building human-robot interaction systems. To this end, in the first part, we explore and study various information streams that contain valuable information for recognizing the emotions of a human, and design architectures based on deep learning, that can efficiently combine information from these streams, with the ultimate goal of deploying the system for human-robot interaction, with an emphasis in child-robot interaction scenarios. While traditional approaches for emotion recognition have mostly focused on facial expressions and speech, we take into account the body language the context, and also employ embeddings that accurately capture the semantic distances of discrete emotions. In the second part, we first enhance existing methods for audiovisual speech synthesis, by giving them the capabilities to both combine, and express emotions in different intensity levels. Then, we design a deep learning-based architecture for expressive audiovisual speech synthesis which achieves a high level of realism and expressiveness, outperforming previous methods. Lastly, we present the first method for visual speech aware monocular perceptual 3D reconstruction in the wild. This work tackles the traditional bottleneck of data collection for high-fidelity 3D ground truth data and offers the field of affective computing a way for easier acquisition of expressive 3D facial data data from monocular videos.
Appears in Collections:Διδακτορικές Διατριβές - Ph.D. Theses

Files in This Item:
File Description SizeFormat 
Ph__D__Dissertation-5.pdfΚείμενο Διδακτορικής Διατριβής53.1 MBAdobe PDFView/Open

Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.