Auditory Insights into Visual Scenes: A Modular Approach Leveraging Audio Separation and Advanced Language Models

Κοκκίνης, Δημήτριος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592

Full metadata record

DC Field	Value	Language
dc.contributor.author	Κοκκίνης, Δημήτριος	-
dc.date.accessioned	2025-04-15T16:39:27Z	-
dc.date.available	2025-04-15T16:39:27Z	-
dc.date.issued	2025-03-26	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592	-
dc.description.abstract	Generative AI has garnered significant interest due to its remarkable advancements in guided image generation, which have enabled its widespread application across various fields. Although text is straightforward, it lacks the inherent connection that exists between visual and auditory realms, being unable to convey the entire spectrum of information. To that end, a number of works have introduced audio-guided image generation. However, they are limited to plain sounds and struggle to picture an intricate acoustic scene. Auditory information requires a complex process involving reasoning to infer missing details. Recently, Large Langage Models have exhibited reasoning abilities and have been integrated as the cognitive powerhouse of models with multi-modal extented abilities. These models, consisting of multi-modal encoders and LLM, have the capacity to understand audio and reason about its content. This thesis first investigates the capability of Audio Large Language Models (ALLMs) to produce a coherent and visually detailed description. It further explores a novel approach to image generation by leveraging ALLMs and introduces a structured framework designed to transform complex auditory inputs into meaningful and imaginative visual representations. The proposed pipeline integrates multiple components, including an audio source separation model, an audio language model (ALM), a large language model (LLM), and an image generation model. By first decomposing mixed audio into distinct sources, the ALM interprets and translates auditory information into textual descriptions, which are subsequently refined by the LLM to enhance contextual understanding. The resulting structured textual representation is then used to guide the image generation model, producing images that align semantically with the original audio input. The efficacy of the methodologies is assessed through a combination of quantitative and qualitative measures. By leveraging the internal knowledge and linguistic reasoning of LLMs, this research aims at alleviating the limitations imposed by constrained audio datasets, managing to infer visual details that can be deduced from the input audio, and producing a plausible description of the visual scene.	en_US
dc.language	en	en_US
dc.subject	Multi-Modal Large Language Models	en_US
dc.subject	Audio Language Models	en_US
dc.subject	Prompting	en_US
dc.subject	Universal Sound Separation	en_US
dc.subject	Stable Diffusion	en_US
dc.subject	Image Generation	en_US
dc.title	Auditory Insights into Visual Scenes: A Modular Approach Leveraging Audio Separation and Advanced Language Models	en_US
dc.description.pages	130	en_US
dc.contributor.supervisor	Βουλόδημος Αθανάσιος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Thesis_Dimitrios_Kokkinis.pdf		10.18 MB	Adobe PDF	View/Open

Show simple item record