Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure

Ξηρος, Νικολας

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869

Τίτλος:	Refining CLIP Visual and Language Representations via Self-Supervised Visual Structure
Συγγραφείς:	Ξηρος, Νικολας Ποταμιάνος Αλέξανδρος
Λέξεις κλειδιά:	Image Encoder Multimodal Learning Contrastive Learning Language Alginment Self-Supervised Learning
Ημερομηνία έκδοσης:	24-Οκτ-2025
Περίληψη:	Image encoders transform visual inputs into feature representations within a learned embedding space. Broadly, they fall into two main categories: unimodal self-supervised encoders, such as DINOv2, which achieve strong visual understanding and discrimination, and multimodal encoders, such as CLIP, which learn representations through a contrastive objective aligning images with language. Despite their success, both paradigms exhibit notable limitations. DINOv2 constructs a highly consistent and discriminative visual space but lacks linguistic grounding, while CLIP achieves cross-modal alignment yet suffers from weak intra-modal consistency, often reflected in distortions within its geometric structure. To address these issues, we propose two training approaches designed to enforce CLIP’s image encoder to follow the intra-modal structure of DINO. We evaluate the effectiveness of these methods across multiple dimensions, demonstrating notable improvements both in cross-modal retrieval and in shaping geometrically richer visual spaces. Specifically, regarding modality retrieval, the DINO-Soft Targets method shows a 3% improvement in Text R@1 and a 2.8% improvement in Image R@1 compared to CLIP. Furthermore, analyzing the geometric spaces resulting from our methodology, we observe a significantly lower proportion of CLIP-blind pairs relative to DINOv2, as well as similarity distributions within the visual space enriched by DINOv2 geometry. Another interesting finding of our study indicates that, in certain cases, a smaller average distance between modalities in the unified space (Modality Gap) does not necessarily imply better internal alignment. Additionally, we experimentally investigate the mechanisms that make this integration effective, such as the role of symmetric terms influencing both the visual and textual representations and the use of lightweight projection of CLIP features in a intermediate space before enforcing DINO-Soft loss, offering insights into how visual and multimodal representations can be effectively unified. Overall, our findings highlight a promising direction toward bridging the gap between unimodal and multimodal learning frameworks
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19869
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Thesis_Final_2.pdf		6.36 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.