Occlusion-Robust Audiovisual Face Reconstruction with Temporal Modeling

Αγγελική Τσινούκα

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19875

Τίτλος:	Occlusion-Robust Audiovisual Face Reconstruction with Temporal Modeling
Συγγραφείς:	Αγγελική Τσινούκα Μαραγκός Πέτρος
Λέξεις κλειδιά:	Audiovisual Face Reconstruction Multimodal Learning Talking Avatars Face Modeling Synthetic Occlusions
Ημερομηνία έκδοσης:	28-Οκτ-2025
Περίληψη:	Recent progress in deep learning for 3D face reconstruction and animation has enabled the creation of realistic digital humans, capable of reproducing subtle expressions and natural speechdriven motion. These advances open new opportunities for applications across communication, entertainment, AR/VR, and education. Nevertheless, significant challenges remain. Conventional approaches often fail to fully capture the expressive dynamics of human faces, they struggle with temporal consistency, and they are not robust to real-world conditions such as partial occlusions or interfering background voices. As the demand for detailed digital avatars grows, especially in interactive systems, developing methods that combine realism, expressiveness, and robustness becomes increasingly crucial. This thesis addresses the challenges in 3D face reconstruction through the design of an audiovisual learning technique from input video, which we call FAVOR, extending the SMIRK framework. Our method applies synthetic occlusions to the training dataset to improve robustness and employs a lip-reading loss for supervision, guiding the model toward more accurate mouth movements. By combining multimodal signals from and training strategies that reflect real-world variability, the proposed approach generates talking avatars that remain coherent and natural even when visual information is missing or corrupted. The system also ensures proper synchronization between speech and facial motion, reducing common artifacts. Extensive experimental analysis on videos with natural occlusions demonstrates that the proposed model achieves robust and temporally consistent results compared to single-modality methods, both in qualitative and quantitative evaluations. A user study further confirms the perceptual quality of the generated avatars, while an ablation study highlights the contribution of each component to the overall performance.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19875
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
thesis_angeliki_tsinouka.pdf		35.49 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.