Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200
Title: Enhancing Contrastive Language-Vision Pre-training with Generative Dialogue
Authors: Τσαπραζλής, Ευθύμιος
Μαραγκός Πέτρος
Keywords: Contrastive Learning
Self-supervised Learning
Multimodal Learning
Deep Learning
Neural Networks
Machine Learning
Issue Date: 18-Jul-2024
Abstract: Image-text models have become essential in machine learning, leading to the development of several state-of-the-art architectures, such as CLIP, DALL-E, and Stable Diffusion, among others. These foundational models can be applied to various tasks, often different from their initial training objectives. This versatility arises from their ability to use each modality to infer knowledge about the other, enabling them to function without specific task-related data. However, these architectures are also very expensive to train, typically requiring millions of image-text pairs. Therefore, the use of these extensive foundational models often involves finetuning them rather than training them from scratch. Additionally, these models can be used directly for inference, serving as the foundation upon which more complex pipelines can be built. Moreover, multimodal generative models have taken the world by storm, providing exceptional capabilities for image captioning. These models can generate meaningful descriptions and high-quality responses regarding visual concepts. We believe that a dialogue between a user and a generative model about a given image can offer an additional perspective on the image-text pair. Initially, we study the problem of training a third tower for a new modality given a pre-trained CLIP model. This additional component can be used to incorporate other modalities into the model pipeline. In our framework, called CLIP-3Modal, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. Τhen, we abandon the third tower approach and focus on fine-tuning the original CLIP to adapt to question-answer style textual inputs. We introduce DRAFT (Dual Representation Adaptive Fine-Tuning), a method based on contrastive learning and distribution alignment, designed to adapt CLIP-like models to out-of-distribution textual descriptions, such as dialogue. We conducted extensive experimentation and ablation studies to demonstrate the advantages and benefits of our method over the baseline model. Our experiments primarily focused on Visual Question Answering tasks, where our method significantly improved CLIP's performance.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
EfthymiosTsaprazlis_DiplomaThesis2024.pdf8.16 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.