Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure

Ξηρος, Νικολας

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ξηρος, Νικολας	-
dc.date.accessioned	2025-10-31T13:09:00Z	-
dc.date.available	2025-10-31T13:09:00Z	-
dc.date.issued	2025-10-24	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869	-
dc.description.abstract	Image encoders transform visual inputs into feature representations within a learned embedding space. Broadly, they fall into two main categories: unimodal self-supervised encoders, such as DINOv2, which achieve strong visual understanding and discrimination, and multimodal encoders, such as CLIP, which learn representations through a contrastive objective aligning images with language. Despite their success, both paradigms exhibit notable limitations. DINOv2 constructs a highly consistent and discriminative visual space but lacks linguistic grounding, while CLIP achieves cross-modal alignment yet suffers from weak intra-modal consistency, often reflected in distortions within its geometric structure. To address these issues, we propose two training approaches designed to enforce CLIP’s image encoder to follow the intra-modal structure of DINO. We evaluate the effectiveness of these methods across multiple dimensions, demonstrating notable improvements both in cross-modal retrieval and in shaping geometrically richer visual spaces. Specifically, regarding modality retrieval, the DINO-Soft Targets method shows a 3% improvement in Text R@1 and a 2.8% improvement in Image R@1 compared to CLIP. Furthermore, analyzing the geometric spaces resulting from our methodology, we observe a significantly lower proportion of CLIP-blind pairs relative to DINOv2, as well as similarity distributions within the visual space enriched by DINOv2 geometry. Another interesting finding of our study indicates that, in certain cases, a smaller average distance between modalities in the unified space (Modality Gap) does not necessarily imply better internal alignment. Additionally, we experimentally investigate the mechanisms that make this integration effective, such as the role of symmetric terms influencing both the visual and textual representations and the use of lightweight projection of CLIP features in a intermediate space before enforcing DINO-Soft loss, offering insights into how visual and multimodal representations can be effectively unified. Overall, our findings highlight a promising direction toward bridging the gap between unimodal and multimodal learning frameworks	en_US
dc.language	en	en_US
dc.subject	Image Encoder	en_US
dc.subject	Multimodal Learning	en_US
dc.subject	Contrastive Learning	en_US
dc.subject	Language Alginment	en_US
dc.subject	Self-Supervised Learning	en_US
dc.title	Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure	en_US
dc.description.pages	102	en_US
dc.contributor.supervisor	Ποταμιάνος Αλέξανδρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Thesis_Final_2.pdf		6.36 MB	Adobe PDF	View/Open

Show simple item record