Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721
Title: Incorporating Visual Features for Image Retrieval via Scene Graphs
Authors: Georgoulopoulos, Dimitrios
Στάμου Γιώργος
Keywords: Image-to-image retrieval
Graph Neural Networks
Graph Attention Networks
GATv2
Scene Graphs
Graph Similarity
Issue Date: 4-Jul-2025
Abstract: Image-to-image retrieval involves finding the most relevant images given a query image. A practical way to capture high-level semantic information in images is through scene graphs, which include objects and their relationships. Of course, raw visual features of the objects and the overall image are also critical for retrieval. By augmenting scene graphs with visual features and processing them with graph neural networks, we can derive powerful representations that effectively address the image-to-image retrieval problem. However, existing datasets of human-annotated scene graphs and their corresponding images are often in- consistently labeled and cluttered with objects that are irrelevant to the image’s main content. Moreover, attention-based graph neural networks typically overlook edge attributes—information about the relation- ships between objects—that can be crucial for distinguishing one image from another. In this thesis, we address these shortcomings in two ways. First, we introduce an Importance Prediction Module that lever- ages Transformer encoders and a multi-head attention mechanism with learned queries to find an importance score for each object and relation by combining semantic and visual cues. We then prune the scene graph, retaining only the most significant nodes and edges. Second, we propose an Edge-Aware GATv2 layer, which integrates edge features directly into both the attention computation and the messages exchanged between nodes. We train this model so that the resulting graph embeddings align with semantic similarity as measured by image captions. We evaluate our approach and model quantitatively—comparing it against standard GNN architectures, ab- lated variants, and alternative methods using retrieval and ranking metrics—and qualitatively, by presenting illustrative examples. Our filtered, multimodal scene graphs processed with Edge-Aware GATv2 produce em- beddings that capture rich information about objects, their visual appearance, and their relationships. Across all metrics, our method outperforms competing GNN variants and retrieval techniques, demonstrating its effectiveness for image-to-image retrieval.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Georgoulopoulos_Diploma_Thesis_updated.pdf24.77 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.