Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19865
Title: Neuro-Symbolic AI for Visual Question Answering
Authors: Δελήμπασης, Λεωνίδας
Βουλόδημος Αθανάσιος
Keywords: Neuro-Symbolic Artificial Intelligence
Visual Question Answering
Scene Graph Generation
Interpretability
Explainable Artificial Intelligence
Deep Learning
Issue Date: 29-Oct-2025
Abstract: Vision-Language Models (VLMs) have demonstrated impressive performance in tasks such as image captioning and visual question answering (VQA). However, they still exhibit several limitations, including inadequate multi-step reasoning, limited fine-grained image understanding, and a lack of interpretability. These shortcomings become evident when state-of-the-art VLMs are evaluated on the LoRA VQA Dataset. To address these issues, this thesis explores the development of a Neuro-Symbolic model that integrates neural networks with explicit, interpretable, and differentiable knowledge representation and reasoning capabilities. This hybrid approach enables the system to answer complex questions while training the neural components using logical rule satisfaction as the primary supervision signal, thereby reducing reliance on large-scale manual annotations. The proposed system consists of an object detector, a dense scene graph generator, and a large language model (LLM) that translates natural language questions into symbolic queries. These symbolic queries, generated using a predefined ontology, describe the reasoning steps required to answer questions about the visual content. Reasoning is performed through a fuzzy-logic-based computational graph constructed by connecting the symbolic query to relevant elements of the scene graph and knowledge base. This graph, built using the Logic Tensor Network framework, is fully differentiable, allowing the scene graph generator, including its object classifier, attribute predictor, and relationship predictor, to be trained using ground-truth answers to logical queries as supervision. Experimental results on the LoRA dataset show that the proposed approach improves question-answering accuracy by up to 49% over the LLaMa 3.2 11B Vision Instruct baseline, demonstrating the effectiveness of combining symbolic reasoning with neural architectures. Although this work focuses on training the scene graph generator, identified in the literature as the primary performance bottleneck, the framework is extensible to also refine the object detector and the question parser. Furthermore, while the current implementation targets the LoRA Dataset, the method can be generalized and instantiated in new domains, offering a step toward traits associated with artificial general intelligence (AGI) systems.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19865
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
thesis_NeSy_VQA.pdf8.72 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.