Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure

Ξηρος, Νικολας

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869

Title:	Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure
Authors:	Ξηρος, Νικολας Ποταμιάνος Αλέξανδρος
Keywords:	Image Encoder Multimodal Learning Contrastive Learning Language Alginment Self-Supervised Learning
Issue Date:	24-Oct-2025
Abstract:	Image encoders transform visual inputs into feature representations within a learned embedding space. Broadly, they fall into two main categories: unimodal self-supervised encoders, such as DINOv2, which achieve strong visual understanding and discrimination, and multimodal encoders, such as CLIP, which learn representations through a contrastive objective aligning images with language. Despite their success, both paradigms exhibit notable limitations. DINOv2 constructs a highly consistent and discriminative visual space but lacks linguistic grounding, while CLIP achieves cross-modal alignment yet suffers from weak intra-modal consistency, often reflected in distortions within its geometric structure. To address these issues, we propose two training approaches designed to enforce CLIP’s image encoder to follow the intra-modal structure of DINO. We evaluate the effectiveness of these methods across multiple dimensions, demonstrating notable improvements both in cross-modal retrieval and in shaping geometrically richer visual spaces. Specifically, regarding modality retrieval, the DINO-Soft Targets method shows a 3% improvement in Text R@1 and a 2.8% improvement in Image R@1 compared to CLIP. Furthermore, analyzing the geometric spaces resulting from our methodology, we observe a significantly lower proportion of CLIP-blind pairs relative to DINOv2, as well as similarity distributions within the visual space enriched by DINOv2 geometry. Another interesting finding of our study indicates that, in certain cases, a smaller average distance between modalities in the unified space (Modality Gap) does not necessarily imply better internal alignment. Additionally, we experimentally investigate the mechanisms that make this integration effective, such as the role of symmetric terms influencing both the visual and textual representations and the use of lightweight projection of CLIP features in a intermediate space before enforcing DINO-Soft loss, offering insights into how visual and multimodal representations can be effectively unified. Overall, our findings highlight a promising direction toward bridging the gap between unimodal and multimodal learning frameworks
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Thesis_Final_2.pdf		6.36 MB	Adobe PDF	View/Open

Show full item record