Evaluating the Effectiveness of Pretrained Audio Representations in Session-Based Music Recommender Systems

Μπότσας, Χαράλαμπος Σπυρίδων

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19922

Title:	Evaluating the Effectiveness of Pretrained Audio Representations in Session-Based Music Recommender Systems
Authors:	Μπότσας, Χαράλαμπος Σπυρίδων Μαραγκός Πέτρος
Keywords:	Music Recommender Systems Music Information Retrieval Representation Learning Sequential Recommendation Neural Networks
Issue Date:	13-Oct-2025
Abstract:	Recent advances in Music Information Retrieval (MIR) have showcased the ability of pretrained audio representations to achieve competitive performance in various downstream tasks. However, their potential in the area of music recommender systems, and particularly in sequential recommendation, has not been fully explored. The present work examines whether audio-based embeddings from pretrained models can prove useful in a session-based recommendation setting. Using Music4All, an openly available dataset of listening histories along with audio segments, we generate listening sessions and we extract embeddings from three pretrained models: MusiCNN, MERT and a custom artist-based model and experiment with different strategies of integrating them into Transformer-based recommender architectures built upon two sequential recommendation frameworks: Transformers4Rec and Represent-Then-Aggregate (RTA). Results show that while audio representations accelerate convergence, they do not consistently outperform randomly initialized baselines under the Transformers4Rec setup. In contrast, within the RTA framework, content-based embeddings yield modest but consistent improvements, but fail to surpass other metadata-driven baselines. These opposite trends across the two frameworks can be partly attributed to their different task formulation and training objective, highlighting that the benefit of pretrained audio representations depends on how well the downstream objective aligns with the semantics of the embedding space. This is also supported by the fact that fine-tuning the original embeddings on behavioral data improves their effectiveness, outperforming the out-of-the-box embeddings and demonstrating the value of lightweight domain adaptation. Finally, an ablation study revealed the crucial role of proper hyperparameter choice in leveraging the full potential of the pretrained representations.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19922
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Botsas_DiplomaThesis.pdf		10.77 MB	Adobe PDF	View/Open

Show full item record