Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Ξηρος, Νικολας | - |
| dc.date.accessioned | 2025-10-31T13:09:00Z | - |
| dc.date.available | 2025-10-31T13:09:00Z | - |
| dc.date.issued | 2025-10-24 | - |
| dc.identifier.uri | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869 | - |
| dc.description.abstract | Image encoders transform visual inputs into feature representations within a learned embedding space. Broadly, they fall into two main categories: unimodal self-supervised encoders, such as DINOv2, which achieve strong visual understanding and discrimination, and multimodal encoders, such as CLIP, which learn representations through a contrastive objective aligning images with language. Despite their success, both paradigms exhibit notable limitations. DINOv2 constructs a highly consistent and discriminative visual space but lacks linguistic grounding, while CLIP achieves cross-modal alignment yet suffers from weak intra-modal consistency, often reflected in distortions within its geometric structure. To address these issues, we propose two training approaches designed to enforce CLIP’s image encoder to follow the intra-modal structure of DINO. We evaluate the effectiveness of these methods across multiple dimensions, demonstrating notable improvements both in cross-modal retrieval and in shaping geometrically richer visual spaces. Specifically, regarding modality retrieval, the DINO-Soft Targets method shows a 3% improvement in Text R@1 and a 2.8% improvement in Image R@1 compared to CLIP. Furthermore, analyzing the geometric spaces resulting from our methodology, we observe a significantly lower proportion of CLIP-blind pairs relative to DINOv2, as well as similarity distributions within the visual space enriched by DINOv2 geometry. Another interesting finding of our study indicates that, in certain cases, a smaller average distance between modalities in the unified space (Modality Gap) does not necessarily imply better internal alignment. Additionally, we experimentally investigate the mechanisms that make this integration effective, such as the role of symmetric terms influencing both the visual and textual representations and the use of lightweight projection of CLIP features in a intermediate space before enforcing DINO-Soft loss, offering insights into how visual and multimodal representations can be effectively unified. Overall, our findings highlight a promising direction toward bridging the gap between unimodal and multimodal learning frameworks | en_US |
| dc.language | en | en_US |
| dc.subject | Image Encoder | en_US |
| dc.subject | Multimodal Learning | en_US |
| dc.subject | Contrastive Learning | en_US |
| dc.subject | Language Alginment | en_US |
| dc.subject | Self-Supervised Learning | en_US |
| dc.title | Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure | en_US |
| dc.description.pages | 102 | en_US |
| dc.contributor.supervisor | Ποταμιάνος Αλέξανδρος | en_US |
| dc.department | Τομέας Σημάτων, Ελέγχου και Ρομποτικής | en_US |
| Appears in Collections: | Διπλωματικές Εργασίες - Theses | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| Thesis_Final_2.pdf | 6.36 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.