Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386
Πλήρες αρχείο μεταδεδομένων
Πεδίο DC ΤιμήΓλώσσα
dc.contributor.authorΚώστας, Νικόλαος-
dc.date.accessioned2024-11-06T10:09:48Z-
dc.date.available2024-11-06T10:09:48Z-
dc.date.issued2024-11-01-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19386-
dc.description.abstractAdversarial attacks in natural language processing (NLP) pose a critical threat to the integrity of text classification models. By generating subtle perturbations in input data, they can significantly impair model performance, misleading them into making incorrect predictions, all while not affecting human judgment. In this thesis, we investigate the applicability of Large Language Models (LLMs) as detectors of such adversarial attacks. To this end, we develop a prompt engineering framework with the goal of crafting natural language prompts that enable LLMs to effectively perform this task. We investigate the effects that each applied prompting technique has on model performance and draw conclusions about the models' potential competence at this task. After arriving at the best-performing prompt, we use it to evaluate the adversarial detection ability of multiple Large Language Models across different combinations of text classification datasets, adversarial attacks, and attacked models. In order to further evaluate this methods’ performance, we conduct a human evaluation and a sanity test for data contamination. In addition, we propose another approach for adversarial text detection which utilizes the attacked language model itself, by inspecting the classifications given to each individual sentence of a text and comparing them with the classification given to the entire text. After also evaluating this approach under multiple scenarios, we combine our two methods into a unified approach which is then compared to other state-of-the-art detection frameworks. Our experimental results show both the necessity of appropriate prompt engineering and the potential efficacy of LLM prompting in adversarial detection. Furthermore, its combination with the also effective, second proposed method, yields competitive results and establishes our approach as a viable solution for plug-and-play detection of textual adversarial samples.en_US
dc.languageenen_US
dc.subjectAdversarial Attacksen_US
dc.subjectDetectionen_US
dc.subjectText Classificationen_US
dc.subjectLarge Language Modelsen_US
dc.subjectNatural Language Processingen_US
dc.titleLarge Language Models for Detection of Adversarial Attacks in Text Classificationen_US
dc.description.pages122en_US
dc.contributor.supervisorΣτάμου Γιώργοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Εμφανίζεται στις συλλογές:Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:
Αρχείο Περιγραφή ΜέγεθοςΜορφότυπος 
Nikolaos_Kostas-Diploma_Thesis.pdf3.19 MBAdobe PDFΕμφάνιση/Άνοιγμα


Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.