Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΜπάρλου, Όλγα-
dc.date.accessioned2024-10-29T07:56:30Z-
dc.date.available2024-10-29T07:56:30Z-
dc.date.issued2024-10-18-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19347-
dc.description.abstractMusic plays a fundamental role in human culture, serving as a universal language that transcends barriers and resonates deeply with people's emotions and experiences. As artificial intelligence continues to advance, there is growing interest in applying these technologies to creative domains, including lyric generation. While Large Language Models (LLMs) have shown promise in creative writing tasks, the potential of multimodal approaches in lyric generation remains largely unexplored. In this diploma thesis, we conduct a comprehensive evaluation of four distinct approaches to lyric generation, progressively incorporating different modalities. We begin with traditional text-to-text lyric generation using state-of-the-art LLMs. We then explore audio-enhanced text generation through two approaches: a Variational Autoencoder model with Transformer-like architecture, and a model with a projection layer to align music and text representations between the Whisper and OpenOrca models. Following this, we implement a two-stage process using SALMONN for music tag extraction followed by Claude for lyric generation. Finally, we propose a novel multimodal pipeline combining SALMONN for movie scene description, Stable Diffusion for visualization, and LLaVA for final lyric generation. Our evaluation, based on both LLM-based metrics and human assessment, reveals several key findings. First, instruction-tuned LLMs demonstrate strong baseline performance even without domain-specific training. Second, the addition of audio modality through music tag extraction significantly enhances the correlation between generated lyrics and music. Third, our novel approach incorporating visual representations achieves the best balance between lyrical coherence and musical correlation. Interestingly, while few-shot prompting improved similarity metrics, it showed decreased performance in creative quality assessments. These findings suggest that thoughtfully integrated multimodal approaches can enhance lyric generation while maintaining creative expression, though there exists a delicate balance between guided generation and creative freedom. This research contributes new methodologies to music information retrieval tasks and opens avenues for future exploration in multimodal approaches to creative AI applications.en_US
dc.languageenen_US
dc.subjectMultimodal Modelsen_US
dc.subjectLarge Language Modelsen_US
dc.subjectMusic Information Retrievalen_US
dc.subjectMachine Learningen_US
dc.subjectGenerative AIen_US
dc.titleMultimodal Approaches to Automatic Lyric Generationen_US
dc.description.pages124en_US
dc.contributor.supervisorΠοταμιάνος Αλέξανδροςen_US
dc.departmentΤομέας Σημάτων, Ελέγχου και Ρομποτικήςen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
DiplomaThesisOlgaBarlou.pdf7.47 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.