Text to image stylistic alignment via explainability of attention

Αραβανής, Τηλέμαχος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19642

Full metadata record

DC Field	Value	Language
dc.contributor.author	Αραβανής, Τηλέμαχος	-
dc.date.accessioned	2025-07-02T07:28:38Z	-
dc.date.available	2025-07-02T07:28:38Z	-
dc.date.issued	2025-07-01	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19642	-
dc.description.abstract	Generative image models have seen significant advancements in the past few years, enabling the creation of highly realistic images from text prompts. However, while proficient in high-fidelity image generation and alignment with the text prompt, text-to-image generative models do not offer the desired controllability to the user via just text. For this reason, personalization algorithms have been developed within these models to allow users to guide image generation in ways that reflect specific preferences, making the output more tailored and meaningful. More specifically, an envisioned research direction is the rendition of images that share the same visual interpretation of a text-specified style. Recent state-of-the-art personalization techniques in generative text-to-image models aim to achieve this by finetuning the model's backbone using a set of images that share common visual stylistic elements. To address the high computational cost associated with this optimization, more recent methods utilize the attention layers of the model during batched inference to transfer stylistic visual elements from a reference image to others within the batch. However, these stylistic alignment approaches often fail to effectively separate semantic content from stylistic elements, leading to content leakage from the reference image, due to the uniform application across instances. We contend that the inherent variability in text-to-image models, stemming from input prompts and noise, necessitates an adaptive approach within these style alignment methods. To address this challenge, we exploit the explainability of the attention mechanism and propose a novel method that mitigates content leakage in a semantically coherent manner within the context of attention-based style alignment, while preserving stylistic consistency. Furthermore, to enhance adaptivity, we introduce a content leakage localization process during inference, allowing the tuning of the stylistic alignment process to faithfully transfer the desired style. Our method’s evaluation across diverse image objects and styles, demonstrates a significant improvement compared to state-of-the-art style alignment methods, removing the undesired effect of content leakage, while maintaining the desired stylistic alignment.	en_US
dc.language	el	en_US
dc.subject	generative models	en_US
dc.subject	text-to-image generation	en_US
dc.subject	attention-based models	en_US
dc.subject	explainability	en_US
dc.subject	personalization	en_US
dc.subject	stylistic alignment	en_US
dc.subject	content leakage	en_US
dc.title	Text to image stylistic alignment via explainability of attention	en_US
dc.description.pages	116	en_US
dc.contributor.supervisor	Μαραγκός Πέτρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Thesis_Aravanis-2.pdf		122.54 MB	Adobe PDF	View/Open

Show simple item record