Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/17308
Title: Grounded Visual Question Answering Using Sequence-to-Sequence Modeling
Authors: Νικάνδρου, Μαρία-Βασιλική
Ποταμιάνος Αλέξανδρος
Keywords: visual question answering
deep learning
multimodal learning
Issue Date: 5-Jul-2019
Abstract: Visual Question Answering (VQA) constitutes a task at the intersection of Natural Language Processing and Computer Vision. Given an open-ended, natural-language question and an image, the goal is to predict the correct answer. In recent years, VQA has attracted a lot of interest from the research community, as it transcends the preliminary of extracting meaningful representations from each modality. In addition to that, it requires the capability to reason over the inputs and to infer their relations. Answering arbitrary visual questions is a challenging problem, as it assesses a large range of skills extending across the linguistic and the visual domains. One limitation of typical, modern VQA systems is that they approach the task as a classification problem over a limited set of pre-defined answers. The goal of this Diploma Thesis is to alleviate the above problem, via proposing a sequence generation model that can produce answers of arbitrary length. The proposed method is a sequence-to-sequence network conditioned on textual and visual information from the question and the image respectively. The question encoder is a bidirectional Recurrent Neural Network augmented with a self-attention mechanism, while the image feature extractor is comprised of a pretrained Convolutional Neural Network. The answer is generated following a greedy decoding process, which uses a Recurrent Neural Network cell as the decoder. At each decoding step, the answer decoder attends to the visual features by employing a cross-modal, scaled dot-product attention mechanism. We conduct an ablation study that investigates the contribution of each module. Our results indicate that the feedback loop, which allows access to image features at each decoding step, is an effective conditioning mechanism. The proposed model is evaluated on the VQA-CP v2 dataset, which tests the capacity to reason over the visual input without depending on language-based statistical biases. Our model shows significant improvement of prediction accuracy compared to baseline approaches. We also measure the performance of the proposed model against state-of-the-art VQA systems. The reported performance is comparable to that of existing models in the literature, showcasing the feasibility of free-form answer generation for open-ended VQA.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/17308
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Grounded Visual Question Answering Using Sequence-to-Sequence Modeling.pdf5.25 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.