Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Κοκκίνης, Δημήτριος | - |
dc.date.accessioned | 2025-04-15T16:39:27Z | - |
dc.date.available | 2025-04-15T16:39:27Z | - |
dc.date.issued | 2025-03-26 | - |
dc.identifier.uri | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19592 | - |
dc.description.abstract | Generative AI has garnered significant interest due to its remarkable advancements in guided image generation, which have enabled its widespread application across various fields. Although text is straightforward, it lacks the inherent connection that exists between visual and auditory realms, being unable to convey the entire spectrum of information. To that end, a number of works have introduced audio-guided image generation. However, they are limited to plain sounds and struggle to picture an intricate acoustic scene. Auditory information requires a complex process involving reasoning to infer missing details. Recently, Large Langage Models have exhibited reasoning abilities and have been integrated as the cognitive powerhouse of models with multi-modal extented abilities. These models, consisting of multi-modal encoders and LLM, have the capacity to understand audio and reason about its content. This thesis first investigates the capability of Audio Large Language Models (ALLMs) to produce a coherent and visually detailed description. It further explores a novel approach to image generation by leveraging ALLMs and introduces a structured framework designed to transform complex auditory inputs into meaningful and imaginative visual representations. The proposed pipeline integrates multiple components, including an audio source separation model, an audio language model (ALM), a large language model (LLM), and an image generation model. By first decomposing mixed audio into distinct sources, the ALM interprets and translates auditory information into textual descriptions, which are subsequently refined by the LLM to enhance contextual understanding. The resulting structured textual representation is then used to guide the image generation model, producing images that align semantically with the original audio input. The efficacy of the methodologies is assessed through a combination of quantitative and qualitative measures. By leveraging the internal knowledge and linguistic reasoning of LLMs, this research aims at alleviating the limitations imposed by constrained audio datasets, managing to infer visual details that can be deduced from the input audio, and producing a plausible description of the visual scene. | en_US |
dc.language | en | en_US |
dc.subject | Multi-Modal Large Language Models | en_US |
dc.subject | Audio Language Models | en_US |
dc.subject | Prompting | en_US |
dc.subject | Universal Sound Separation | en_US |
dc.subject | Stable Diffusion | en_US |
dc.subject | Image Generation | en_US |
dc.title | Auditory Insights into Visual Scenes: A Modular Approach Leveraging Audio Separation and Advanced Language Models | en_US |
dc.description.pages | 130 | en_US |
dc.contributor.supervisor | Βουλόδημος Αθανάσιος | en_US |
dc.department | Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών | en_US |
Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Thesis_Dimitrios_Kokkinis.pdf | 10.18 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.