Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18376
Title: Discovering Approaches for Social Video Question Answering using Deep Learning
Authors: Σαρτζετάκη, Χριστίνα
Ποταμιάνος Αλέξανδρος
Keywords: Deep Learning
Video Question Answering
Social reasoning
Social Video Question Answering
Compositional Attention Networks
MAC
Social cues
Eye-gaze detection
Emotion recognition
Natural Language Processing
Transformers
BERT
Issue Date: 14-Jun-2022
Abstract: Humans are social creatures; our survival and well-being depends on our effective communica- tion with others. This is achieved through perceiving and understanding information from multiple sensory modalities as well as reasoning and arriving to conclusions, in order to respond accordingly. Social Video Question Answering is a Machine Learning task to test the social reasoning abilities of an AI agent, based on how accurately it can answer questions on a given video. It can require sophisticated combinations of emotion recognition, language understanding, cultural knowledge, logical and causal reasoning, on top of non-social layers of comprehension about physical events. In this Diploma Thesis, we focus on discovering different approaches for Social Video Question Answering that leverage Deep Learning methods, through building on previous work in different fields such as Computer Vision and Natural Language Processing. We take two distinct approaches in the course of our research. In the first part of our work, we propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC) [1], and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset [2] and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art. In the second part of our work, we follow the direction of question answering on video captioning, which we obtain through augmentation of the dialogue transcripts with explicit social cues detection information, namely emotional eye-gaze information. This is the first time, to the best of our knowledge, that a feature extraction pipeline specifically designed for social video is proposed, standing in as a general framework for leveraging social information in video. We experiment with different natural language caption generation methods from an intermediate graph structure, and provide ablation studies for several BERT [3]-like language models and fine-tuning levels, as well as a hierarchical summary scheme based on question conditioning via extractive question answering. We apply our method to the Social IQ dataset [2] and obtain significant improvements over the baselines.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18376
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
thesis_sartzetaki.pdf24.19 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.