Cross-modal Flow Matching for Domain Generalization

Κρητικός, Αντώνιος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19929

Title:	Cross-modal Flow Matching for Domain Generalization
Authors:	Κρητικός, Αντώνιος Βουλόδημος Αθανάσιος
Keywords:	machine learning computer vision domain generalization contrastive learning modality gap flow matching optimal transport
Issue Date:	11-Nov-2025
Abstract:	Domain generalization (DG) requires models to perform robustly under domain shift, which in the realm of computer vision mainly takes form as stylistic variations between samples. Despite intensive research, DG still poses a major challenge, as models often overfit to domain-specific appearance cues and fail to capture class semantics. Therefore, many efforts have explored the use of natural language, due to its inherent domain-invariance. However, current multimodal approaches, specifically those relying on cosine similarity for cross-modal contrastive alignment in a joint embedding space, suffer from the modality gap, a phenomenon where image and text embeddings occupy separate regions despite semantic alignment. In this thesis, we address this residual gap by applying flow matching to learn a continuous transformation between unnormalized image and text embeddings of the same class, in the joint Euclidean latent space. Unlike most prior work which uses simple source distributions, we instead train a vector field that explicitly flows potentially domain-biased image embeddings to domain-invariant text embeddings. The resulting framework, CrossFlowDG, is tested with the efficient VMamba image encoder, which achieves linear complexity, compared to the widely-used quadratic-complexity transformer backbones, and establishes state-of-the-art classification accuracy among similar methods across several challenging domains from relevant benchmarks. To further enhance inference efficiency, and therefore deployability on edge devices with restricted compute, we propose an optimal transport-informed variant, called OT-CrossFlowDG. This second framework incorporates optimal transport alignment to minimize the 2-Wasserstein distance between modality distributions of each class in the Euclidean space. By solving the Sinkhorn-Knopp problem to obtain approximate optimal couplings between image and text embeddings of the same class and using barycentric targets as refined flow supervision, OT-CrossFlowDG achieves its peak performance in only 1 inference step, highlighting its efficiency in computationally constrained deployment scenarios.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19929
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Antonios_Kritikos_Diploma_Thesis.pdf		901.18 kB	Adobe PDF	View/Open

Show full item record