Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386
Title: Large Language Models for Detection of Adversarial Attacks in Text Classification
Authors: Κώστας, Νικόλαος
Στάμου Γιώργος
Keywords: Adversarial Attacks
Detection
Text Classification
Large Language Models
Natural Language Processing
Issue Date: 1-Nov-2024
Abstract: Adversarial attacks in natural language processing (NLP) pose a critical threat to the integrity of text classification models. By generating subtle perturbations in input data, they can significantly impair model performance, misleading them into making incorrect predictions, all while not affecting human judgment. In this thesis, we investigate the applicability of Large Language Models (LLMs) as detectors of such adversarial attacks. To this end, we develop a prompt engineering framework with the goal of crafting natural language prompts that enable LLMs to effectively perform this task. We investigate the effects that each applied prompting technique has on model performance and draw conclusions about the models' potential competence at this task. After arriving at the best-performing prompt, we use it to evaluate the adversarial detection ability of multiple Large Language Models across different combinations of text classification datasets, adversarial attacks, and attacked models. In order to further evaluate this methods’ performance, we conduct a human evaluation and a sanity test for data contamination. In addition, we propose another approach for adversarial text detection which utilizes the attacked language model itself, by inspecting the classifications given to each individual sentence of a text and comparing them with the classification given to the entire text. After also evaluating this approach under multiple scenarios, we combine our two methods into a unified approach which is then compared to other state-of-the-art detection frameworks. Our experimental results show both the necessity of appropriate prompt engineering and the potential efficacy of LLM prompting in adversarial detection. Furthermore, its combination with the also effective, second proposed method, yields competitive results and establishes our approach as a viable solution for plug-and-play detection of textual adversarial samples.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Nikolaos_Kostas-Diploma_Thesis.pdf3.19 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.