Efficient Incomplete Multimodal-Diffused Emotion Recognition

Ασπρογέρακας, Ιωάννης

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19886

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ασπρογέρακας, Ιωάννης	-
dc.date.accessioned	2025-11-04T08:07:14Z	-
dc.date.available	2025-11-04T08:07:14Z	-
dc.date.issued	2025-10-24	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19886	-
dc.description.abstract	Multimodal Emotion Recognition (MER) aims to model human affect by integrating complementary signals from language, vision, and audio. While deep learning methods have achieved impressive results through cross-modal fusion, most assume complete modality availability during training and inference, a condition rarely met in real world deployments where occlusions, noise, or sensor failures frequently cause missing modalities. Addressing this problem requires robust imputation strategies that can recover missing signals without sacrificing efficiency. In this work, we explore the design space of diffusion models for missing modality imputation, building upon and extending the IMDER framework. We propose a decoupled two-stage training scheme where modality-specific diffusion models are pre-trained independently and then integrated into the MER pipeline. This design avoids the instability of end-to-end IMDER training, where untrained diffusion models initially degrade classifier performance. In addition, we systematically compare stochastic differential equation (SDE) formulations, specifically Variance Preserving (VP) and Variance Exploding (VE) processes, evaluate alternative conditioning mechanisms with transformerbased backbones, and finally investigate multiple sampling strategies to balance efficiency and accuracy. Extensive experiments on CMU-MOSI and CMU-MOSEI demonstrate consistent improvements across both fixed and random missing protocols. Our quality-focused configuration achieves superior accuracy, with up to +2% F1 and +1.5% ACC2 gains over IMDER, while delivering 5× faster inference. Meanwhile, our speed-optimized configuration maintains competitive performance, +1% ACC2, +0.5% F1, but achieves remarkable efficiency with 15× faster inference, making it competitive for real-time MER applications.	en_US
dc.language	en	en_US
dc.subject	Diffusion Models	en_US
dc.subject	Multimodal Emotion Recognition	en_US
dc.subject	Stochastic Differential Equations	en_US
dc.subject	Multimodal Deep Learning	en_US
dc.subject	Deep Generative Modeling	en_US
dc.title	Efficient Incomplete Multimodal-Diffused Emotion Recognition	en_US
dc.description.pages	161	en_US
dc.contributor.supervisor	Ποταμιάνος Αλέξανδρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
ioannisasprogerakas_thesis.pdf		28.44 MB	Adobe PDF	View/Open

Show simple item record