Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19654
Title: | Automatic Cover Song Generation from Audio Signal |
Authors: | Μαρκαντωνάτος, Γεράσιμος Ποταμιάνος Αλέξανδρος |
Keywords: | Music Information Retrieval Domain Adaptation Transfer Learning Cross-Cultural Transfer Cross-Instrumental Transfer Deep Learning Cover Song Generation |
Issue Date: | 1-Jul-2025 |
Abstract: | Cover song generation represents a challenging task in Music Information Retrieval, requiring systems to preserve the musical essence of original compositions while adapting them to specific instruments and styles. This thesis addresses key limitations in the field: the scarcity of training data for non-pop musical genres and the lack of cover generation models for instruments beyond piano. We present two key dataset contributions: the GreekSong2Piano dataset, containing 659 Greek songs paired with piano covers across eight distinct genres (Rembetiko, Laiko, Entexno, etc.), and the Pop2Guitar dataset with 40 song-guitar pairs for cross-instrument domain adaptation. These datasets enable systematic investigation of transfer learning approaches in low-resource scenarios. Our methodology employs a T5-based encoder-decoder Transformer architecture that treats cover generation as a sequence-to-sequence translation problem, converting audio spectrograms to symbolic MIDI representations. We systematically compare three training strategies: from-scratch training, partial fine-tuning, and full fine-tuning. Additionally, we introduce a novel sequential fine-tuning approach that performs multi-step domain adaptation from Western pop piano covers to Greek piano covers to guitar covers. Experimental results demonstrate clear advantages for transfer learning approaches over from-scratch training. For Greek piano covers, fine-tuning strategies achieve up to 21.0% improvement in Melody Chroma Accuracy compared to baseline models. The sequential fine-tuning approach shows particular promise for guitar generation, with the partial fine-tuned model achieving the highest similarity ratings (3.31±0.33) in user studies, approaching human performance (4.17±0.28). Our comprehensive evaluation framework combines objective metrics (melody similarity, cover song identification, embedding-based measures) with subjective user assessment, demonstrating strong correlation between computational measures and human perception. This work establishes a foundation for culturally-aware and instrumentdiverse music arrangement systems, contributing to both creative applications and computational understanding of musical translation across cultural and instrumental boundaries |
URI: | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19654 |
Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
diploma_thesis_final_version.pdf | 6.24 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.