Multimodal Deep Learning for Emotion Recognition and Expression Synthesis with Applications in Human-Robot Interaction

Φιλντίσης, Παναγιώτης Παρασκευάς

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18566

Title:	Multimodal Deep Learning for Emotion Recognition and Expression Synthesis with Applications in Human-Robot Interaction
Authors:	Φιλντίσης, Παναγιώτης Παρασκευάς Μαραγκός Πέτρος
Keywords:	Αναγνώριση Συναισθήματος, Γλώσσα Σώματος, Σκηνή, Σύνθεση Συναι- σθήματος, Οπτικοακουστικός, Βαθιά Μάθηση, Τρισδιάστατη Ανακατασκευή, 3D Morphable Models, Οπτική Ομιλία Emotion Recognition, Affect, Body Language, Context, Expression Syn- thesis, AudioVisual, Deep Learning, 3D Reconstruction, Visual Speech, 3D Morphable Models
Issue Date:	7-Nov-2022
Abstract:	Affective computing is an exciting new research area with the goal of equipping computers and robots with the capability of recognizing, expressing, modeling, and even ``feeling" emotions. An interdisciplinary field, affective computing draws resources from computer science, mathematics, cognitive sciences, and psychology. In this thesis, which is split into two major parts, we explore two aspects of affective computing; namely ``emotion recognition" and ``expression synthesis" since they constitute the most important aspects one needs to consider when building human-robot interaction systems. To this end, in the first part, we explore and study various information streams that contain valuable information for recognizing the emotions of a human, and design architectures based on deep learning, that can efficiently combine information from these streams, with the ultimate goal of deploying the system for human-robot interaction, with an emphasis in child-robot interaction scenarios. While traditional approaches for emotion recognition have mostly focused on facial expressions and speech, we take into account the body language the context, and also employ embeddings that accurately capture the semantic distances of discrete emotions. In the second part, we first enhance existing methods for audiovisual speech synthesis, by giving them the capabilities to both combine, and express emotions in different intensity levels. Then, we design a deep learning-based architecture for expressive audiovisual speech synthesis which achieves a high level of realism and expressiveness, outperforming previous methods. Lastly, we present the first method for visual speech aware monocular perceptual 3D reconstruction in the wild. This work tackles the traditional bottleneck of data collection for high-fidelity 3D ground truth data and offers the field of affective computing a way for easier acquisition of expressive 3D facial data data from monocular videos.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18566
Appears in Collections:	Διδακτορικές Διατριβές - Ph.D. Theses

Files in This Item:

File	Description	Size	Format
Ph__D__Dissertation-5.pdf	Κείμενο Διδακτορικής Διατριβής	53.1 MB	Adobe PDF	View/Open

Show full item record