Grounded Visual Question Answering Using Sequence-to-Sequence Modeling

Νικάνδρου, Μαρία-Βασιλική

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/17308

Title:	Grounded Visual Question Answering Using Sequence-to-Sequence Modeling
Authors:	Νικάνδρου, Μαρία-Βασιλική Ποταμιάνος Αλέξανδρος
Keywords:	visual question answering deep learning multimodal learning
Issue Date:	5-Jul-2019
Abstract:	Visual Question Answering (VQA) constitutes a task at the intersection of Natural Language Processing and Computer Vision. Given an open-ended, natural-language question and an image, the goal is to predict the correct answer. In recent years, VQA has attracted a lot of interest from the research community, as it transcends the preliminary of extracting meaningful representations from each modality. In addition to that, it requires the capability to reason over the inputs and to infer their relations. Answering arbitrary visual questions is a challenging problem, as it assesses a large range of skills extending across the linguistic and the visual domains. One limitation of typical, modern VQA systems is that they approach the task as a classification problem over a limited set of pre-defined answers. The goal of this Diploma Thesis is to alleviate the above problem, via proposing a sequence generation model that can produce answers of arbitrary length. The proposed method is a sequence-to-sequence network conditioned on textual and visual information from the question and the image respectively. The question encoder is a bidirectional Recurrent Neural Network augmented with a self-attention mechanism, while the image feature extractor is comprised of a pretrained Convolutional Neural Network. The answer is generated following a greedy decoding process, which uses a Recurrent Neural Network cell as the decoder. At each decoding step, the answer decoder attends to the visual features by employing a cross-modal, scaled dot-product attention mechanism. We conduct an ablation study that investigates the contribution of each module. Our results indicate that the feedback loop, which allows access to image features at each decoding step, is an effective conditioning mechanism. The proposed model is evaluated on the VQA-CP v2 dataset, which tests the capacity to reason over the visual input without depending on language-based statistical biases. Our model shows significant improvement of prediction accuracy compared to baseline approaches. We also measure the performance of the proposed model against state-of-the-art VQA systems. The reported performance is comparable to that of existing models in the literature, showcasing the feasibility of free-form answer generation for open-ended VQA.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/17308
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Grounded Visual Question Answering Using Sequence-to-Sequence Modeling.pdf		5.25 MB	Adobe PDF	View/Open

Show full item record