Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Μπάρλου, Όλγα | - |
dc.date.accessioned | 2024-10-29T07:56:30Z | - |
dc.date.available | 2024-10-29T07:56:30Z | - |
dc.date.issued | 2024-10-18 | - |
dc.identifier.uri | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347 | - |
dc.description.abstract | Music plays a fundamental role in human culture, serving as a universal language that transcends barriers and resonates deeply with people's emotions and experiences. As artificial intelligence continues to advance, there is growing interest in applying these technologies to creative domains, including lyric generation. While Large Language Models (LLMs) have shown promise in creative writing tasks, the potential of multimodal approaches in lyric generation remains largely unexplored. In this diploma thesis, we conduct a comprehensive evaluation of four distinct approaches to lyric generation, progressively incorporating different modalities. We begin with traditional text-to-text lyric generation using state-of-the-art LLMs. We then explore audio-enhanced text generation through two approaches: a Variational Autoencoder model with Transformer-like architecture, and a model with a projection layer to align music and text representations between the Whisper and OpenOrca models. Following this, we implement a two-stage process using SALMONN for music tag extraction followed by Claude for lyric generation. Finally, we propose a novel multimodal pipeline combining SALMONN for movie scene description, Stable Diffusion for visualization, and LLaVA for final lyric generation. Our evaluation, based on both LLM-based metrics and human assessment, reveals several key findings. First, instruction-tuned LLMs demonstrate strong baseline performance even without domain-specific training. Second, the addition of audio modality through music tag extraction significantly enhances the correlation between generated lyrics and music. Third, our novel approach incorporating visual representations achieves the best balance between lyrical coherence and musical correlation. Interestingly, while few-shot prompting improved similarity metrics, it showed decreased performance in creative quality assessments. These findings suggest that thoughtfully integrated multimodal approaches can enhance lyric generation while maintaining creative expression, though there exists a delicate balance between guided generation and creative freedom. This research contributes new methodologies to music information retrieval tasks and opens avenues for future exploration in multimodal approaches to creative AI applications. | en_US |
dc.language | en | en_US |
dc.subject | Multimodal Models | en_US |
dc.subject | Large Language Models | en_US |
dc.subject | Music Information Retrieval | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Generative AI | en_US |
dc.title | Multimodal Approaches to Automatic Lyric Generation | en_US |
dc.description.pages | 124 | en_US |
dc.contributor.supervisor | Ποταμιάνος Αλέξανδρος | en_US |
dc.department | Τομέας Σημάτων, Ελέγχου και Ρομποτικής | en_US |
Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
DiplomaThesisOlgaBarlou.pdf | 7.47 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.