Large Language Models for Detection of Adversarial Attacks in Text Classification

Κώστας, Νικόλαος

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386

Πλήρες αρχείο μεταδεδομένων

Πεδίο DC	Τιμή	Γλώσσα
dc.contributor.author	Κώστας, Νικόλαος	-
dc.date.accessioned	2024-11-06T10:09:48Z	-
dc.date.available	2024-11-06T10:09:48Z	-
dc.date.issued	2024-11-01	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386	-
dc.description.abstract	Adversarial attacks in natural language processing (NLP) pose a critical threat to the integrity of text classification models. By generating subtle perturbations in input data, they can significantly impair model performance, misleading them into making incorrect predictions, all while not affecting human judgment. In this thesis, we investigate the applicability of Large Language Models (LLMs) as detectors of such adversarial attacks. To this end, we develop a prompt engineering framework with the goal of crafting natural language prompts that enable LLMs to effectively perform this task. We investigate the effects that each applied prompting technique has on model performance and draw conclusions about the models' potential competence at this task. After arriving at the best-performing prompt, we use it to evaluate the adversarial detection ability of multiple Large Language Models across different combinations of text classification datasets, adversarial attacks, and attacked models. In order to further evaluate this methods’ performance, we conduct a human evaluation and a sanity test for data contamination. In addition, we propose another approach for adversarial text detection which utilizes the attacked language model itself, by inspecting the classifications given to each individual sentence of a text and comparing them with the classification given to the entire text. After also evaluating this approach under multiple scenarios, we combine our two methods into a unified approach which is then compared to other state-of-the-art detection frameworks. Our experimental results show both the necessity of appropriate prompt engineering and the potential efficacy of LLM prompting in adversarial detection. Furthermore, its combination with the also effective, second proposed method, yields competitive results and establishes our approach as a viable solution for plug-and-play detection of textual adversarial samples.	en_US
dc.language	en	en_US
dc.subject	Adversarial Attacks	en_US
dc.subject	Detection	en_US
dc.subject	Text Classification	en_US
dc.subject	Large Language Models	en_US
dc.subject	Natural Language Processing	en_US
dc.title	Large Language Models for Detection of Adversarial Attacks in Text Classification	en_US
dc.description.pages	122	en_US
dc.contributor.supervisor	Στάμου Γιώργος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Nikolaos_Kostas-Diploma_Thesis.pdf		3.19 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε τη σύντομη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.