Enhancing Vision-Language Models: The Role of LLMs in Augmenting Performance and Reasoning

Στάμου, Πηνελόπη

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19595

Full metadata record

DC Field	Value	Language
dc.contributor.author	Στάμου, Πηνελόπη	-
dc.date.accessioned	2025-05-03T16:01:13Z	-
dc.date.available	2025-05-03T16:01:13Z	-
dc.date.issued	2025-03-21	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19595	-
dc.description.abstract	Vision-Language Models (VLMs) have demonstrated remarkable capabilities in complex visio-linguistic tasks. An extensive body of work has explored how prompting techniques and fine-tuning methods can be used to enhance their performance. However, modern multimodal LLMs still struggle with tasks that require complex reasoning, external knowledge, and human-aligned responses. In this work, we investigate the limitations of large-scale, multimodal models in handling open-ended tasks that demand external knowledge and commonsense reasoning. Focusing on the Stanford Image Paragraph Captioning and OK-VQA datasets, we find that although these models demonstrate substantial cognitive, linguistic, and reasoning abilities, their performance deteriorates when managing complex tasks simultaneously while adhering to specific response formats. Our analysis reveals that state-of-the-art multimodal models surpass existing datasets in paragraph generation but continue to face challenges in generating high-quality paragraphs. Similarly, they continue to struggle with knowledge-based, open-ended benchmarks such as OK-VQA. To boost their performance in the latter, we employ a collaborative framework comprising three models: the Scout, an LVLM that takes an image as input and describes it in a paragraph; the Analyser, an LLM that generates an initial answer to the question based on the image description; and the Resolver, an LLM that extracts and formats the final answer based on a set of predefined rules. Our framework yields improved performance over the single-agent baseline, indicating the effectiveness of a collaborative approach.	en_US
dc.language	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Multimodal Large Language Models	en_US
dc.subject	Multi-Agent Systems	en_US
dc.subject	Knowledge-Based Visual Question Answering	en_US
dc.subject	Image Paragraph Captioning	en_US
dc.subject	Vision-Language Models	en_US
dc.title	Enhancing Vision-Language Models: The Role of LLMs in Augmenting Performance and Reasoning	en_US
dc.description.pages	120	en_US
dc.contributor.supervisor	Βουλόδημος Αθανάσιος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Diploma Thesis Stamou Penelope.pdf	Diploma Thesis Stamou Penelope	23.32 MB	Adobe PDF	View/Open

Show simple item record