Multimodal Approaches to Automatic Lyric Generation

Μπάρλου, Όλγα

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347

Full metadata record

DC Field	Value	Language
dc.contributor.author	Μπάρλου, Όλγα	-
dc.date.accessioned	2024-10-29T07:56:30Z	-
dc.date.available	2024-10-29T07:56:30Z	-
dc.date.issued	2024-10-18	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347	-
dc.description.abstract	Music plays a fundamental role in human culture, serving as a universal language that transcends barriers and resonates deeply with people's emotions and experiences. As artificial intelligence continues to advance, there is growing interest in applying these technologies to creative domains, including lyric generation. While Large Language Models (LLMs) have shown promise in creative writing tasks, the potential of multimodal approaches in lyric generation remains largely unexplored. In this diploma thesis, we conduct a comprehensive evaluation of four distinct approaches to lyric generation, progressively incorporating different modalities. We begin with traditional text-to-text lyric generation using state-of-the-art LLMs. We then explore audio-enhanced text generation through two approaches: a Variational Autoencoder model with Transformer-like architecture, and a model with a projection layer to align music and text representations between the Whisper and OpenOrca models. Following this, we implement a two-stage process using SALMONN for music tag extraction followed by Claude for lyric generation. Finally, we propose a novel multimodal pipeline combining SALMONN for movie scene description, Stable Diffusion for visualization, and LLaVA for final lyric generation. Our evaluation, based on both LLM-based metrics and human assessment, reveals several key findings. First, instruction-tuned LLMs demonstrate strong baseline performance even without domain-specific training. Second, the addition of audio modality through music tag extraction significantly enhances the correlation between generated lyrics and music. Third, our novel approach incorporating visual representations achieves the best balance between lyrical coherence and musical correlation. Interestingly, while few-shot prompting improved similarity metrics, it showed decreased performance in creative quality assessments. These findings suggest that thoughtfully integrated multimodal approaches can enhance lyric generation while maintaining creative expression, though there exists a delicate balance between guided generation and creative freedom. This research contributes new methodologies to music information retrieval tasks and opens avenues for future exploration in multimodal approaches to creative AI applications.	en_US
dc.language	en	en_US
dc.subject	Multimodal Models	en_US
dc.subject	Large Language Models	en_US
dc.subject	Music Information Retrieval	en_US
dc.subject	Machine Learning	en_US
dc.subject	Generative AI	en_US
dc.title	Multimodal Approaches to Automatic Lyric Generation	en_US
dc.description.pages	124	en_US
dc.contributor.supervisor	Ποταμιάνος Αλέξανδρος	en_US
dc.department	Τομέας Σημάτων, Ελέγχου και Ρομποτικής	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
DiplomaThesisOlgaBarlou.pdf		7.47 MB	Adobe PDF	View/Open

Show simple item record