Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΚοκκίνης, Δημήτριος-
dc.date.accessioned2025-04-15T16:39:27Z-
dc.date.available2025-04-15T16:39:27Z-
dc.date.issued2025-03-26-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592-
dc.description.abstractGenerative AI has garnered significant interest due to its remarkable advancements in guided image generation, which have enabled its widespread application across various fields. Although text is straightforward, it lacks the inherent connection that exists between visual and auditory realms, being unable to convey the entire spectrum of information. To that end, a number of works have introduced audio-guided image generation. However, they are limited to plain sounds and struggle to picture an intricate acoustic scene. Auditory information requires a complex process involving reasoning to infer missing details. Recently, Large Langage Models have exhibited reasoning abilities and have been integrated as the cognitive powerhouse of models with multi-modal extented abilities. These models, consisting of multi-modal encoders and LLM, have the capacity to understand audio and reason about its content. This thesis first investigates the capability of Audio Large Language Models (ALLMs) to produce a coherent and visually detailed description. It further explores a novel approach to image generation by leveraging ALLMs and introduces a structured framework designed to transform complex auditory inputs into meaningful and imaginative visual representations. The proposed pipeline integrates multiple components, including an audio source separation model, an audio language model (ALM), a large language model (LLM), and an image generation model. By first decomposing mixed audio into distinct sources, the ALM interprets and translates auditory information into textual descriptions, which are subsequently refined by the LLM to enhance contextual understanding. The resulting structured textual representation is then used to guide the image generation model, producing images that align semantically with the original audio input. The efficacy of the methodologies is assessed through a combination of quantitative and qualitative measures. By leveraging the internal knowledge and linguistic reasoning of LLMs, this research aims at alleviating the limitations imposed by constrained audio datasets, managing to infer visual details that can be deduced from the input audio, and producing a plausible description of the visual scene.en_US
dc.languageenen_US
dc.subjectMulti-Modal Large Language Modelsen_US
dc.subjectAudio Language Modelsen_US
dc.subjectPromptingen_US
dc.subjectUniversal Sound Separationen_US
dc.subjectStable Diffusionen_US
dc.subjectImage Generationen_US
dc.titleAuditory Insights into Visual Scenes: A Modular Approach Leveraging Audio Separation and Advanced Language Modelsen_US
dc.description.pages130en_US
dc.contributor.supervisorΒουλόδημος Αθανάσιοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Thesis_Dimitrios_Kokkinis.pdf10.18 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.