Large Language Models, Adapters and Perplexity Scores  for Multigenerator, Multidomain, and Multilingual  Black-Box Machine-Generated Text Detection

Ανδρεάδης, Δημήτριος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19186

Title:	Large Language Models, Adapters and Perplexity Scores for Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection
Authors:	Ανδρεάδης, Δημήτριος Στάμου Γιώργος
Keywords:	Machine Generated Text Detection Author Attribution Pretrained Language Models Large Language Models Adapter Tuning Prompt Tuning Fixed-length Perplexity Multilingual
Issue Date:	16-Jul-2024
Abstract:	Large language models (LLMs) are becoming mainstream and easily accessible, ushering in an explosion of machine-generated content over various channels, such as news, social media, question-answering forums, educational, and even academic contexts. Recent LLMs, such as ChatGPT and GPT-4, generate remarkably fluent responses to a wide variety of user queries. The articulate nature of such generated texts makes LLMs attractive for replacing human labor in many scenarios. However, this has also resulted in concerns regarding their potential misuse, such as spreading misinformation and causing disruptions in the education system. Since humans perform only slightly better than chance when classifying machine-generated vs. human written text, there is a need to develop automatic systems to identify machine-generated text with the goal of mitigating its potential misuse. This need is addressed by the 8th task of the SemEval Workshop 2024. In this thesis, we aimed to make a substantial step towards exploring this interesting task by addresing subtasks A and B for the 8th SemEval task. As a starting point, we experimented on fine-tuning pre trained language models (PLMs) for machine-generated text detection (MGTD), examining the effect of the hyperparameters on the accuracy. We suggest the use of prompt tuning as an effective adapter technique that further boosts performance. Moreover, we tried to apply our findings to the more difficult subtask of author attribution (AA). For the multilingual track of MGTD, we attempted to detect the source language of the texts and then translated them as well as used language adapters to test if further improvements can be achieved. Apart from model and adapter tuning, we also explored another approach. By making use of multiple PLMs, we calculated fixed-length perplexities. Overall, this thesis attempts to unveil the potential of methods towards the solution of the problems of MGTD and AA, reaching insightful conclusions.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19186
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Diploma_Thesis_Andreadis_Dimitrios.pdf		1.51 MB	Adobe PDF	View/Open

Show full item record