Neuro-Symbolic AI for Visual Question Answering

Δελήμπασης, Λεωνίδας

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19865

Τίτλος:	Neuro-Symbolic AI for Visual Question Answering
Συγγραφείς:	Δελήμπασης, Λεωνίδας Βουλόδημος Αθανάσιος
Λέξεις κλειδιά:	Neuro-Symbolic Artificial Intelligence Visual Question Answering Scene Graph Generation Interpretability Explainable Artificial Intelligence Deep Learning
Ημερομηνία έκδοσης:	29-Οκτ-2025
Περίληψη:	Vision-Language Models (VLMs) have demonstrated impressive performance in tasks such as image captioning and visual question answering (VQA). However, they still exhibit several limitations, including inadequate multi-step reasoning, limited fine-grained image understanding, and a lack of interpretability. These shortcomings become evident when state-of-the-art VLMs are evaluated on the LoRA VQA Dataset. To address these issues, this thesis explores the development of a Neuro-Symbolic model that integrates neural networks with explicit, interpretable, and differentiable knowledge representation and reasoning capabilities. This hybrid approach enables the system to answer complex questions while training the neural components using logical rule satisfaction as the primary supervision signal, thereby reducing reliance on large-scale manual annotations. The proposed system consists of an object detector, a dense scene graph generator, and a large language model (LLM) that translates natural language questions into symbolic queries. These symbolic queries, generated using a predefined ontology, describe the reasoning steps required to answer questions about the visual content. Reasoning is performed through a fuzzy-logic-based computational graph constructed by connecting the symbolic query to relevant elements of the scene graph and knowledge base. This graph, built using the Logic Tensor Network framework, is fully differentiable, allowing the scene graph generator, including its object classifier, attribute predictor, and relationship predictor, to be trained using ground-truth answers to logical queries as supervision. Experimental results on the LoRA dataset show that the proposed approach improves question-answering accuracy by up to 49% over the LLaMa 3.2 11B Vision Instruct baseline, demonstrating the effectiveness of combining symbolic reasoning with neural architectures. Although this work focuses on training the scene graph generator, identified in the literature as the primary performance bottleneck, the framework is extensible to also refine the object detector and the question parser. Furthermore, while the current implementation targets the LoRA Dataset, the method can be generalized and instantiated in new domains, offering a step toward traits associated with artificial general intelligence (AGI) systems.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19865
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
thesis_NeSy_VQA.pdf		8.72 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.