Incorporating Visual Features for Image Retrieval via Scene Graphs

Georgoulopoulos, Dimitrios

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721

Τίτλος:	Incorporating Visual Features for Image Retrieval via Scene Graphs
Συγγραφείς:	Georgoulopoulos, Dimitrios Στάμου Γιώργος
Λέξεις κλειδιά:	Image-to-image retrieval Graph Neural Networks Graph Attention Networks GATv2 Scene Graphs Graph Similarity
Ημερομηνία έκδοσης:	4-Ιου-2025
Περίληψη:	Image-to-image retrieval involves finding the most relevant images given a query image. A practical way to capture high-level semantic information in images is through scene graphs, which include objects and their relationships. Of course, raw visual features of the objects and the overall image are also critical for retrieval. By augmenting scene graphs with visual features and processing them with graph neural networks, we can derive powerful representations that effectively address the image-to-image retrieval problem. However, existing datasets of human-annotated scene graphs and their corresponding images are often in- consistently labeled and cluttered with objects that are irrelevant to the image’s main content. Moreover, attention-based graph neural networks typically overlook edge attributes—information about the relation- ships between objects—that can be crucial for distinguishing one image from another. In this thesis, we address these shortcomings in two ways. First, we introduce an Importance Prediction Module that lever- ages Transformer encoders and a multi-head attention mechanism with learned queries to find an importance score for each object and relation by combining semantic and visual cues. We then prune the scene graph, retaining only the most significant nodes and edges. Second, we propose an Edge-Aware GATv2 layer, which integrates edge features directly into both the attention computation and the messages exchanged between nodes. We train this model so that the resulting graph embeddings align with semantic similarity as measured by image captions. We evaluate our approach and model quantitatively—comparing it against standard GNN architectures, ab- lated variants, and alternative methods using retrieval and ranking metrics—and qualitatively, by presenting illustrative examples. Our filtered, multimodal scene graphs processed with Edge-Aware GATv2 produce em- beddings that capture rich information about objects, their visual appearance, and their relationships. Across all metrics, our method outperforms competing GNN variants and retrieval techniques, demonstrating its effectiveness for image-to-image retrieval.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Georgoulopoulos_Diploma_Thesis_updated.pdf		24.77 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.