Self-supervised Music Audio Representation Learning and Domain Adaptation Across Diverse Music Datasets

Κανατάς, Άγγελος Νικόλαος

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19644

Τίτλος:	Self-supervised Music Audio Representation Learning and Domain Adaptation Across Diverse Music Datasets
Συγγραφείς:	Κανατάς, Άγγελος Νικόλαος Ποταμιάνος Αλέξανδρος
Λέξεις κλειδιά:	Music Information Retrieval, Music Representation Learning, Computational Ethnomusicology, Self-supervised Learning, Continual Pre-Training, Domain Adaptation, Transfer Learning, Cross-Cultural Transfer, Model Merging, Task Arithmetic, Deep Learning, Foundation Models, Automatic Classification
Ημερομηνία έκδοσης:	1-Ιου-2025
Περίληψη:	This thesis focuses on self-supervised music audio representation learning and cross-cultural adaptation of music foundation models to diverse musical traditions. Recent advances in music foundation models have improved audio representation learning and have brought them to the forefront of music information retrieval (MIR). However, their effectiveness across diverse musical traditions remains limited, as they are primarily trained on Western-centric data, overlooking the diversity of global musical cultures. To address this, we introduce CultureMERT-95M, a multi-culturally adapted foundation model developed to enhance cross-cultural music representation learning and understanding. To achieve this, we propose a two-stage continual pre-training (CPT) strategy that integrates learning rate re-warming and re-decaying, enabling stable adaptation even with limited computational resources. Continually pre-training MERT-95M on a 650-hour multi-cultural data mix, comprising Greek, Turkish, and Indian music traditions, results in an average improvement of 4.9% in ROC-AUC and AP across diverse non-Western music auto-tagging tasks, surpassing prior state-of-the-art, with minimal forgetting on Western-centric benchmarks. We further investigate task arithmetic, an alternative approach to multi-cultural adaptation that merges culturally specialized models in the weight space. Task arithmetic performs on par with our multi-culturally trained model on non-Western auto-tagging tasks and shows no regression on Western datasets. Finally, we analyze cross-cultural transferability between single-culture adapted models (via CPT), showing that musical traditions differ in how well they transfer to others, a pattern that correlates with acoustic token-level similarity among cultures, using as metrics the cosine distance and Jensen-Shannon divergence computed over EnCodec-extracted token distributions. Our findings demonstrate that exposure to culturally diverse data through multi-cultural CPT enhances cross-cultural generalization and leads to improved overall performance. This study contributes to the development of more culturally aware foundation models for music that generalize across diverse underrepresented musical traditions and enable world music understanding.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19644
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
angeloskanatas_thesis.pdf		11.69 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.