Evaluating the Effectiveness of Pretrained Audio Representations in Session-Based Music Recommender Systems

Μπότσας, Χαράλαμπος Σπυρίδων

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19922

Full metadata record

DC Field	Value	Language
dc.contributor.author	Μπότσας, Χαράλαμπος Σπυρίδων	-
dc.date.accessioned	2025-11-12T15:09:54Z	-
dc.date.available	2025-11-12T15:09:54Z	-
dc.date.issued	2025-10-13	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19922	-
dc.description.abstract	Recent advances in Music Information Retrieval (MIR) have showcased the ability of pretrained audio representations to achieve competitive performance in various downstream tasks. However, their potential in the area of music recommender systems, and particularly in sequential recommendation, has not been fully explored. The present work examines whether audio-based embeddings from pretrained models can prove useful in a session-based recommendation setting. Using Music4All, an openly available dataset of listening histories along with audio segments, we generate listening sessions and we extract embeddings from three pretrained models: MusiCNN, MERT and a custom artist-based model and experiment with different strategies of integrating them into Transformer-based recommender architectures built upon two sequential recommendation frameworks: Transformers4Rec and Represent-Then-Aggregate (RTA). Results show that while audio representations accelerate convergence, they do not consistently outperform randomly initialized baselines under the Transformers4Rec setup. In contrast, within the RTA framework, content-based embeddings yield modest but consistent improvements, but fail to surpass other metadata-driven baselines. These opposite trends across the two frameworks can be partly attributed to their different task formulation and training objective, highlighting that the benefit of pretrained audio representations depends on how well the downstream objective aligns with the semantics of the embedding space. This is also supported by the fact that fine-tuning the original embeddings on behavioral data improves their effectiveness, outperforming the out-of-the-box embeddings and demonstrating the value of lightweight domain adaptation. Finally, an ablation study revealed the crucial role of proper hyperparameter choice in leveraging the full potential of the pretrained representations.	en_US
dc.language	en	en_US
dc.subject	Music Recommender Systems	en_US
dc.subject	Music Information Retrieval	en_US
dc.subject	Representation Learning	en_US
dc.subject	Sequential Recommendation	en_US
dc.subject	Neural Networks	en_US
dc.title	Evaluating the Effectiveness of Pretrained Audio Representations in Session-Based Music Recommender Systems	en_US
dc.description.pages	117	en_US
dc.contributor.supervisor	Μαραγκός Πέτρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Botsas_DiplomaThesis.pdf		10.77 MB	Adobe PDF	View/Open

Show simple item record