Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19595
Πλήρες αρχείο μεταδεδομένων
Πεδίο DC | Τιμή | Γλώσσα |
---|---|---|
dc.contributor.author | Στάμου, Πηνελόπη | - |
dc.date.accessioned | 2025-05-03T16:01:13Z | - |
dc.date.available | 2025-05-03T16:01:13Z | - |
dc.date.issued | 2025-03-21 | - |
dc.identifier.uri | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19595 | - |
dc.description.abstract | Vision-Language Models (VLMs) have demonstrated remarkable capabilities in complex visio-linguistic tasks. An extensive body of work has explored how prompting techniques and fine-tuning methods can be used to enhance their performance. However, modern multimodal LLMs still struggle with tasks that require complex reasoning, external knowledge, and human-aligned responses. In this work, we investigate the limitations of large-scale, multimodal models in handling open-ended tasks that demand external knowledge and commonsense reasoning. Focusing on the Stanford Image Paragraph Captioning and OK-VQA datasets, we find that although these models demonstrate substantial cognitive, linguistic, and reasoning abilities, their performance deteriorates when managing complex tasks simultaneously while adhering to specific response formats. Our analysis reveals that state-of-the-art multimodal models surpass existing datasets in paragraph generation but continue to face challenges in generating high-quality paragraphs. Similarly, they continue to struggle with knowledge-based, open-ended benchmarks such as OK-VQA. To boost their performance in the latter, we employ a collaborative framework comprising three models: the Scout, an LVLM that takes an image as input and describes it in a paragraph; the Analyser, an LLM that generates an initial answer to the question based on the image description; and the Resolver, an LLM that extracts and formats the final answer based on a set of predefined rules. Our framework yields improved performance over the single-agent baseline, indicating the effectiveness of a collaborative approach. | en_US |
dc.language | en | en_US |
dc.subject | Large Language Models | en_US |
dc.subject | Multimodal Large Language Models | en_US |
dc.subject | Multi-Agent Systems | en_US |
dc.subject | Knowledge-Based Visual Question Answering | en_US |
dc.subject | Image Paragraph Captioning | en_US |
dc.subject | Vision-Language Models | en_US |
dc.title | Enhancing Vision-Language Models: The Role of LLMs in Augmenting Performance and Reasoning | en_US |
dc.description.pages | 120 | en_US |
dc.contributor.supervisor | Βουλόδημος Αθανάσιος | en_US |
dc.department | Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών | en_US |
Εμφανίζεται στις συλλογές: | Διπλωματικές Εργασίες - Theses |
Αρχεία σε αυτό το τεκμήριο:
Αρχείο | Περιγραφή | Μέγεθος | Μορφότυπος | |
---|---|---|---|---|
Diploma Thesis Stamou Penelope.pdf | Diploma Thesis Stamou Penelope | 23.32 MB | Adobe PDF | Εμφάνιση/Άνοιγμα |
Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.