Text-driven Articulate Talking Face Generation

Μίλης, Γεώργιος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19066

Title:	Text-driven Articulate Talking Face Generation
Authors:	Μίλης, Γεώργιος Μαραγκός Πέτρος
Keywords:	Talking Face Generation, Audiovisual Speech Synthesis, Text-to-Visual Speech, Photorealistic Talking Faces, Portrait Videos, Avatars, Multimodal Learning
Issue Date:	28-Mar-2024
Abstract:	Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans, creating a new era of lifelike virtual experiences. These endeavors not only push the boundaries of audiovisual synthesis but also hold immense potential for applications spanning entertainment, communication, and education. The state of the art in talking face generation focuses mainly on audio-driven methods, which are conditioned on either real or synthetic audios. However, having the ability to directly synthesize talking humans from text transcriptions is particularly beneficial for many applications and is expected to receive more and more attention, following the recent breakthroughs in large language models. A text-driven system can provide an animated avatar that utters a conversational agent's response, paving the way towards a more natural mode of human-machine interaction. Regarding text-driven generation, the predominant approach has been to employ a cascaded 2-stage architecture of a text-to-speech module followed by an audio-driven talking face generator. However this ignores the highly complex interplay between audio and visual streams that occurs during speaking. In this Diploma Thesis, we construct a text-driven audiovisual speech synthesizer that uses transformers for sequence modeling and does not follow the aforementioned cascaded approach. Instead, our method, which we call NEUral Text to ARticulate Talk (NEUTART), uses joint audiovisual modeling, as well as speech-informed 3D facial reconstructions and various perceptual losses for visual supervision. Notably, we incorporate a lipreading loss which adds realism to the speaker's mouth movements. The proposed model incorporates an audiovisual module that can generate 3D talking head videos with human-like articulation and synced audiovisual streams by design. Then, a photorealistic module leverages the power of generative adversarial networks to convert the 3D talking head into an RGB video. Our experiments on audiovisual datasets as well as in-the-wild videos reveal state-of-the-art generation quality both in terms of objective metrics and human evaluation, especially when assessing the realism of lip articulation. We also showcase the effectiveness of visual supervision for speech synthesis, since our experiments reveal that NEUTART produces more intelligible speech than a similar text-to-speech architecture.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19066
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
GeorgiosMilis_Thesis.pdf		14.95 MB	Adobe PDF	View/Open

Show full item record