Self-supervised Music Audio Representation Learning and Domain Adaptation Across Diverse Music Datasets

Κανατάς, Άγγελος Νικόλαος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19644

Title:	Self-supervised Music Audio Representation Learning and Domain Adaptation Across Diverse Music Datasets
Authors:	Κανατάς, Άγγελος Νικόλαος Ποταμιάνος Αλέξανδρος
Keywords:	Music Information Retrieval, Music Representation Learning, Computational Ethnomusicology, Self-supervised Learning, Continual Pre-Training, Domain Adaptation, Transfer Learning, Cross-Cultural Transfer, Model Merging, Task Arithmetic, Deep Learning, Foundation Models, Automatic Classification
Issue Date:	1-Jul-2025
Abstract:	This thesis focuses on self-supervised music audio representation learning and cross-cultural adaptation of music foundation models to diverse musical traditions. Recent advances in music foundation models have improved audio representation learning and have brought them to the forefront of music information retrieval (MIR). However, their effectiveness across diverse musical traditions remains limited, as they are primarily trained on Western-centric data, overlooking the diversity of global musical cultures. To address this, we introduce CultureMERT-95M, a multi-culturally adapted foundation model developed to enhance cross-cultural music representation learning and understanding. To achieve this, we propose a two-stage continual pre-training (CPT) strategy that integrates learning rate re-warming and re-decaying, enabling stable adaptation even with limited computational resources. Continually pre-training MERT-95M on a 650-hour multi-cultural data mix, comprising Greek, Turkish, and Indian music traditions, results in an average improvement of 4.9% in ROC-AUC and AP across diverse non-Western music auto-tagging tasks, surpassing prior state-of-the-art, with minimal forgetting on Western-centric benchmarks. We further investigate task arithmetic, an alternative approach to multi-cultural adaptation that merges culturally specialized models in the weight space. Task arithmetic performs on par with our multi-culturally trained model on non-Western auto-tagging tasks and shows no regression on Western datasets. Finally, we analyze cross-cultural transferability between single-culture adapted models (via CPT), showing that musical traditions differ in how well they transfer to others, a pattern that correlates with acoustic token-level similarity among cultures, using as metrics the cosine distance and Jensen-Shannon divergence computed over EnCodec-extracted token distributions. Our findings demonstrate that exposure to culturally diverse data through multi-cultural CPT enhances cross-cultural generalization and leads to improved overall performance. This study contributes to the development of more culturally aware foundation models for music that generalize across diverse underrepresented musical traditions and enable world music understanding.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19644
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
angeloskanatas_thesis.pdf		11.69 MB	Adobe PDF	View/Open

Show full item record