Incorporating Visual Features for Image Retrieval via Scene Graphs

Georgoulopoulos, Dimitrios

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721

Title:	Incorporating Visual Features for Image Retrieval via Scene Graphs
Authors:	Georgoulopoulos, Dimitrios Στάμου Γιώργος
Keywords:	Image-to-image retrieval Graph Neural Networks Graph Attention Networks GATv2 Scene Graphs Graph Similarity
Issue Date:	4-Jul-2025
Abstract:	Image-to-image retrieval involves finding the most relevant images given a query image. A practical way to capture high-level semantic information in images is through scene graphs, which include objects and their relationships. Of course, raw visual features of the objects and the overall image are also critical for retrieval. By augmenting scene graphs with visual features and processing them with graph neural networks, we can derive powerful representations that effectively address the image-to-image retrieval problem. However, existing datasets of human-annotated scene graphs and their corresponding images are often in- consistently labeled and cluttered with objects that are irrelevant to the image’s main content. Moreover, attention-based graph neural networks typically overlook edge attributes—information about the relation- ships between objects—that can be crucial for distinguishing one image from another. In this thesis, we address these shortcomings in two ways. First, we introduce an Importance Prediction Module that lever- ages Transformer encoders and a multi-head attention mechanism with learned queries to find an importance score for each object and relation by combining semantic and visual cues. We then prune the scene graph, retaining only the most significant nodes and edges. Second, we propose an Edge-Aware GATv2 layer, which integrates edge features directly into both the attention computation and the messages exchanged between nodes. We train this model so that the resulting graph embeddings align with semantic similarity as measured by image captions. We evaluate our approach and model quantitatively—comparing it against standard GNN architectures, ab- lated variants, and alternative methods using retrieval and ranking metrics—and qualitatively, by presenting illustrative examples. Our filtered, multimodal scene graphs processed with Edge-Aware GATv2 produce em- beddings that capture rich information about objects, their visual appearance, and their relationships. Across all metrics, our method outperforms competing GNN variants and retrieval techniques, demonstrating its effectiveness for image-to-image retrieval.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19721
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Georgoulopoulos_Diploma_Thesis_updated.pdf		24.77 MB	Adobe PDF	View/Open

Show full item record