Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΤσαπραζλής, Ευθύμιος-
dc.date.accessioned2024-07-19T10:31:02Z-
dc.date.available2024-07-19T10:31:02Z-
dc.date.issued2024-07-18-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19200-
dc.description.abstractImage-text models have become essential in machine learning, leading to the development of several state-of-the-art architectures, such as CLIP, DALL-E, and Stable Diffusion, among others. These foundational models can be applied to various tasks, often different from their initial training objectives. This versatility arises from their ability to use each modality to infer knowledge about the other, enabling them to function without specific task-related data. However, these architectures are also very expensive to train, typically requiring millions of image-text pairs. Therefore, the use of these extensive foundational models often involves finetuning them rather than training them from scratch. Additionally, these models can be used directly for inference, serving as the foundation upon which more complex pipelines can be built. Moreover, multimodal generative models have taken the world by storm, providing exceptional capabilities for image captioning. These models can generate meaningful descriptions and high-quality responses regarding visual concepts. We believe that a dialogue between a user and a generative model about a given image can offer an additional perspective on the image-text pair. Initially, we study the problem of training a third tower for a new modality given a pre-trained CLIP model. This additional component can be used to incorporate other modalities into the model pipeline. In our framework, called CLIP-3Modal, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. Τhen, we abandon the third tower approach and focus on fine-tuning the original CLIP to adapt to question-answer style textual inputs. We introduce DRAFT (Dual Representation Adaptive Fine-Tuning), a method based on contrastive learning and distribution alignment, designed to adapt CLIP-like models to out-of-distribution textual descriptions, such as dialogue. We conducted extensive experimentation and ablation studies to demonstrate the advantages and benefits of our method over the baseline model. Our experiments primarily focused on Visual Question Answering tasks, where our method significantly improved CLIP's performance.en_US
dc.languageenen_US
dc.subjectContrastive Learningen_US
dc.subjectSelf-supervised Learningen_US
dc.subjectMultimodal Learningen_US
dc.subjectDeep Learningen_US
dc.subjectNeural Networksen_US
dc.subjectMachine Learningen_US
dc.titleEnhancing Contrastive Language-Vision Pre-training with Generative Dialogueen_US
dc.description.pages114en_US
dc.contributor.supervisorΜαραγκός Πέτροςen_US
dc.departmentΤομέας Σημάτων, Ελέγχου και Ρομποτικήςen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
EfthymiosTsaprazlis_DiplomaThesis2024.pdf8.16 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.