Automated Audio Captioning

Κουζέλης, Θοδωρής

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18393

Τίτλος:	Automated Audio Captioning
Συγγραφείς:	Κουζέλης, Θοδωρής Ποταμιάνος Αλέξανδρος
Λέξεις κλειδιά:	Deep Learning Automated Audio Captioning
Ημερομηνία έκδοσης:	14-Ιου-2022
Περίληψη:	The purpose of this dissertation is to study Automated Audio Captioning. The aim of this task is to describe the content of an audio clip using natural language. It is a cross-modal translation task at the intersection of audio signal processing and natural language processing. Audio Captioning focuses on the audio events and their spaciotem- poral relationships in an audio clip and expresses them in natural language. It is a recent and rather unexplored task, that has a great potential for practical applications. In this work, we model caption generation for a given audio clip as a sequence-to- sequence task, using a Transformer architecture. We show that using recent strategies from the related field of Audio Tagging, allows us to significantly reduce the complexity of our model without affecting performance. In order to generate rich and varied descriptions we investigate decoding algorithms that minimize the trade-off between, semantic fidelity and diversity in captions. As a real world application of Automated Audio Captioning, we propose a novel task, where given the audio of movie a system aims to generate captions of salient sound events. Essentially, the aim of our proposed task is to automatically generate Subtitles for Deaf and Hard of Hearing (SDH). Our proposed system detects the segments of sound events using a pre-trained tagging model and generates a textual description using our model for Audio Captioning. To improve the performance of our audio captioning model we create a task specific dataset using SDH subtitles and movies. Furthermore, we integrate the textual information of the tagging model into caption generation by building a model for text guided audio captioning. Finally, we propose an novel metric to evaluate our results.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18393
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Thesis.pdf		12.76 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.