Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΞηρος, Νικολας-
dc.date.accessioned2025-10-31T13:09:00Z-
dc.date.available2025-10-31T13:09:00Z-
dc.date.issued2025-10-24-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869-
dc.description.abstractImage encoders transform visual inputs into feature representations within a learned embedding space. Broadly, they fall into two main categories: unimodal self-supervised encoders, such as DINOv2, which achieve strong visual understanding and discrimination, and multimodal encoders, such as CLIP, which learn representations through a contrastive objective aligning images with language. Despite their success, both paradigms exhibit notable limitations. DINOv2 constructs a highly consistent and discriminative visual space but lacks linguistic grounding, while CLIP achieves cross-modal alignment yet suffers from weak intra-modal consistency, often reflected in distortions within its geometric structure. To address these issues, we propose two training approaches designed to enforce CLIP’s image encoder to follow the intra-modal structure of DINO. We evaluate the effectiveness of these methods across multiple dimensions, demonstrating notable improvements both in cross-modal retrieval and in shaping geometrically richer visual spaces. Specifically, regarding modality retrieval, the DINO-Soft Targets method shows a 3% improvement in Text R@1 and a 2.8% improvement in Image R@1 compared to CLIP. Furthermore, analyzing the geometric spaces resulting from our methodology, we observe a significantly lower proportion of CLIP-blind pairs relative to DINOv2, as well as similarity distributions within the visual space enriched by DINOv2 geometry. Another interesting finding of our study indicates that, in certain cases, a smaller average distance between modalities in the unified space (Modality Gap) does not necessarily imply better internal alignment. Additionally, we experimentally investigate the mechanisms that make this integration effective, such as the role of symmetric terms influencing both the visual and textual representations and the use of lightweight projection of CLIP features in a intermediate space before enforcing DINO-Soft loss, offering insights into how visual and multimodal representations can be effectively unified. Overall, our findings highlight a promising direction toward bridging the gap between unimodal and multimodal learning frameworksen_US
dc.languageenen_US
dc.subjectImage Encoderen_US
dc.subjectMultimodal Learningen_US
dc.subjectContrastive Learningen_US
dc.subjectLanguage Alginmenten_US
dc.subjectSelf-Supervised Learningen_US
dc.titleRefining CLIP Visual and Language Representations via Self-Supervised Visual Structureen_US
dc.description.pages102en_US
dc.contributor.supervisorΠοταμιάνος Αλέξανδροςen_US
dc.departmentΤομέας Σημάτων, Ελέγχου και Ρομποτικήςen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Thesis_Final_2.pdf6.36 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.