Discovering Approaches for Social Video Question Answering using Deep Learning

Σαρτζετάκη, Χριστίνα

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18376

Τίτλος:	Discovering Approaches for Social Video Question Answering using Deep Learning
Συγγραφείς:	Σαρτζετάκη, Χριστίνα Ποταμιάνος Αλέξανδρος
Λέξεις κλειδιά:	Deep Learning Video Question Answering Social reasoning Social Video Question Answering Compositional Attention Networks MAC Social cues Eye-gaze detection Emotion recognition Natural Language Processing Transformers BERT
Ημερομηνία έκδοσης:	14-Ιου-2022
Περίληψη:	Humans are social creatures; our survival and well-being depends on our effective communica- tion with others. This is achieved through perceiving and understanding information from multiple sensory modalities as well as reasoning and arriving to conclusions, in order to respond accordingly. Social Video Question Answering is a Machine Learning task to test the social reasoning abilities of an AI agent, based on how accurately it can answer questions on a given video. It can require sophisticated combinations of emotion recognition, language understanding, cultural knowledge, logical and causal reasoning, on top of non-social layers of comprehension about physical events. In this Diploma Thesis, we focus on discovering different approaches for Social Video Question Answering that leverage Deep Learning methods, through building on previous work in different fields such as Computer Vision and Natural Language Processing. We take two distinct approaches in the course of our research. In the first part of our work, we propose a novel deep architecture for the task of reasoning about social interactions in videos. We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC) [1], and propose a multimodal extension (MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level fusion of input modalities (visual, auditory, text) over multiple reasoning steps, by use of a temporal attention mechanism. We then combine MAC-X with LSTMs for temporal input processing in an end-to-end architecture. Our ablation studies show that the proposed MAC-X architecture can effectively leverage multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the task of Social Video Question Answering in the Social IQ dataset [2] and obtain a 2.5% absolute improvement in terms of binary accuracy over the current state-of-the-art. In the second part of our work, we follow the direction of question answering on video captioning, which we obtain through augmentation of the dialogue transcripts with explicit social cues detection information, namely emotional eye-gaze information. This is the first time, to the best of our knowledge, that a feature extraction pipeline specifically designed for social video is proposed, standing in as a general framework for leveraging social information in video. We experiment with different natural language caption generation methods from an intermediate graph structure, and provide ablation studies for several BERT [3]-like language models and fine-tuning levels, as well as a hierarchical summary scheme based on question conditioning via extractive question answering. We apply our method to the Social IQ dataset [2] and obtain significant improvements over the baselines.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18376
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
thesis_sartzetaki.pdf		24.19 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.