Large Language Models for Detection of Adversarial Attacks in Text Classification

Κώστας, Νικόλαος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386

Title:	Large Language Models for Detection of Adversarial Attacks in Text Classification
Authors:	Κώστας, Νικόλαος Στάμου Γιώργος
Keywords:	Adversarial Attacks Detection Text Classification Large Language Models Natural Language Processing
Issue Date:	1-Nov-2024
Abstract:	Adversarial attacks in natural language processing (NLP) pose a critical threat to the integrity of text classification models. By generating subtle perturbations in input data, they can significantly impair model performance, misleading them into making incorrect predictions, all while not affecting human judgment. In this thesis, we investigate the applicability of Large Language Models (LLMs) as detectors of such adversarial attacks. To this end, we develop a prompt engineering framework with the goal of crafting natural language prompts that enable LLMs to effectively perform this task. We investigate the effects that each applied prompting technique has on model performance and draw conclusions about the models' potential competence at this task. After arriving at the best-performing prompt, we use it to evaluate the adversarial detection ability of multiple Large Language Models across different combinations of text classification datasets, adversarial attacks, and attacked models. In order to further evaluate this methods’ performance, we conduct a human evaluation and a sanity test for data contamination. In addition, we propose another approach for adversarial text detection which utilizes the attacked language model itself, by inspecting the classifications given to each individual sentence of a text and comparing them with the classification given to the entire text. After also evaluating this approach under multiple scenarios, we combine our two methods into a unified approach which is then compared to other state-of-the-art detection frameworks. Our experimental results show both the necessity of appropriate prompt engineering and the potential efficacy of LLM prompting in adversarial detection. Furthermore, its combination with the also effective, second proposed method, yields competitive results and establishes our approach as a viable solution for plug-and-play detection of textual adversarial samples.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Nikolaos_Kostas-Diploma_Thesis.pdf		3.19 MB	Adobe PDF	View/Open

Show full item record