Enhancing Contrastive Language-Vision Pre-training with Generative Dialogue

Τσαπραζλής, Ευθύμιος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200

Full metadata record

DC Field	Value	Language
dc.contributor.author	Τσαπραζλής, Ευθύμιος	-
dc.date.accessioned	2024-07-19T10:31:02Z	-
dc.date.available	2024-07-19T10:31:02Z	-
dc.date.issued	2024-07-18	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200	-
dc.description.abstract	Image-text models have become essential in machine learning, leading to the development of several state-of-the-art architectures, such as CLIP, DALL-E, and Stable Diffusion, among others. These foundational models can be applied to various tasks, often different from their initial training objectives. This versatility arises from their ability to use each modality to infer knowledge about the other, enabling them to function without specific task-related data. However, these architectures are also very expensive to train, typically requiring millions of image-text pairs. Therefore, the use of these extensive foundational models often involves finetuning them rather than training them from scratch. Additionally, these models can be used directly for inference, serving as the foundation upon which more complex pipelines can be built. Moreover, multimodal generative models have taken the world by storm, providing exceptional capabilities for image captioning. These models can generate meaningful descriptions and high-quality responses regarding visual concepts. We believe that a dialogue between a user and a generative model about a given image can offer an additional perspective on the image-text pair. Initially, we study the problem of training a third tower for a new modality given a pre-trained CLIP model. This additional component can be used to incorporate other modalities into the model pipeline. In our framework, called CLIP-3Modal, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. Τhen, we abandon the third tower approach and focus on fine-tuning the original CLIP to adapt to question-answer style textual inputs. We introduce DRAFT (Dual Representation Adaptive Fine-Tuning), a method based on contrastive learning and distribution alignment, designed to adapt CLIP-like models to out-of-distribution textual descriptions, such as dialogue. We conducted extensive experimentation and ablation studies to demonstrate the advantages and benefits of our method over the baseline model. Our experiments primarily focused on Visual Question Answering tasks, where our method significantly improved CLIP's performance.	en_US
dc.language	en	en_US
dc.subject	Contrastive Learning	en_US
dc.subject	Self-supervised Learning	en_US
dc.subject	Multimodal Learning	en_US
dc.subject	Deep Learning	en_US
dc.subject	Neural Networks	en_US
dc.subject	Machine Learning	en_US
dc.title	Enhancing Contrastive Language-Vision Pre-training with Generative Dialogue	en_US
dc.description.pages	114	en_US
dc.contributor.supervisor	Μαραγκός Πέτρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
EfthymiosTsaprazlis_DiplomaThesis2024.pdf		8.16 MB	Adobe PDF	View/Open

Show simple item record