Story Visualization via Masked Generative Transformers with Character Guidance and Caption Augmentation

Papadimitriou, Christos

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19054

Τίτλος:	Story Visualization via Masked Generative Transformers with Character Guidance and Caption Augmentation
Συγγραφείς:	Papadimitriou, Christos Στάμου Γιώργος
Λέξεις κλειδιά:	Story Visualization Artificial Intelligence Computer Vision NLP Transformers Caption Augmentation Character Guidance
Ημερομηνία έκδοσης:	2024
Περίληψη:	Story Visualization (SV) is a challenging Artificial Intelligence task that falls in the intersection of Natural Language Processing (NLP) and Computer Vision. The task consists of generating a sequence of images that serve as a visualization of a given sequence of sentences. The sentences form a coherent narrative and so should the images. It was introduced in 2019 and has since been approached in multiple manners, including GANs, Transformers and Diffusers. In this thesis we attempt to tackle the task based on an architecture called MaskGIT. MaskGIT is a relatively recent Transformer-based approach, proposed for Text-to-Image synthesis. We are the first to employ this method for SV. Specifically, we form our baseline model by enhancing the original MaskGIT architecture with additional Cross-Attention sub-layers, that allow the model to integrate information from past and future captions, while generating an image. We build on top of our baseline model in several different ways, in search of directions that improve performance. Some of our experiments, like leveraging a pre-trained text-encoder, attempting disentanglement in the latent space, using a Token-Critic and performing super-resolution in the latent space, do not yield better results. On the other hand, we manage to detect three directions that prove beneficial. We find that adding SV-Layers to the Transformer improves its performance in all metrics. Additionally, we propose a successful, image-agnostic caption augmentation technique, that uses an LLM. Finally, our Character Guidance method, based on both positive and negative prompting, directly affects the generation of main Characters in the images and results in major improvements across all metrics. We combine promising approaches to arrive at our top-performing architecture; MaskGST-CG w/ aug. captions. We test our approach on Pororo-SV, which is the most widely adopted dataset for the task. We evaluate our models using the most prominent metrics in previous literature (including FID, Char-F1, Char-Acc and BLEU-2/3). Our best model achieves SOTA results in terms of Char-F1, Char-Acc and BLEU-2/3, which speaks of the merit of our Character Guidance approach. Additionally, it outperforms all previous GAN and Transformer approaches in terms of FID.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19054
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Diploma_Thesis_Christos_Papadimitriou.pdf		31.02 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.