Evaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Study

Kopsidas, Timotheos

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727

Τίτλος:	Evaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Study
Συγγραφείς:	Kopsidas, Timotheos Νικήτα Κωνσταντίνα
Λέξεις κλειδιά:	large language models clinical diagnosis differential diagnosis prompt engineering top-N accuracy diagnostic reasoning human–AI collaboration evaluation frameworks
Ημερομηνία έκδοσης:	4-Ιου-2025
Περίληψη:	This thesis investigates the diagnostic performance of large language models (LLMs) in complex clinical scenarios, focusing on differential diagnosis generation across medical specialties. As LLMs gain attention for clinical decision support, this study offers a structured evaluation of their accuracy and diagnostic efficiency in real-world scenarios. Eighty clinical cases were drawn from The New England Journal of Medicine (NEJM) clinicopathological series. Four LLMs were assessed: two commercial models (GPT-4o and o3-mini) and two open-source models (Qwen-2.5-32B and Qwen-QWQ-32B). Each model received the same case input and system instructions, and generated a primary diagnosis along with a ranked differential list, in zero-shot learning setup. To ensure consistency in scoring, a separate LLM, Gemini Flash 2.0 was used to verify whether the correct diagnosis appeared in the output, and in which rank. Evaluation metrics included Top-N accuracy (Top-1, Top-3, Top-5, Top-10) and a novel diagnostic efficiency score that considers both the rank of the correct diagnosis and the total number of suggestions. Comparative and statistical analyses were performed to assess the effects of reasoning capability, temperature variation, and model types (open-source vs. commercial) on the performance. Results showed that reasoning-enabled models outperformed non-reasoning ones , and commercial models generally surpassed open-source alternatives, though direct comparison was limited by transparency gaps. Among all language models, OpenAI’s o3-mini in multiple comparisons, consistently demonstrated the best performance, achieving both high accuracy and focused differential lists. Moreover, temperature had no impact on diagnostic accuracy but improved output consistency. Performance varied across specialties, indicating uneven generalization among models. This study contributes a reproducible methodology for LLM diagnostic evaluation and highlights the importance of moving beyond single dimension accuracy metrics toward more holistic assessments of clinical reasoning. As LLMs approach clinical relevance, future work should prioritize benchmarking tools that capture reasoning quality, specialty-specific performance, and resource efficiency. Small, domain-adapted models and multimodal capabilities represent promising directions. Overall, while strict accuracy evaluations still offer valuable insight into model progress and advancements, responsible deployment will require deeper integration with clinical workflows and standards.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727
Εμφανίζεται στις συλλογές:	Μεταπτυχιακές Εργασίες - M.Sc. Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Thesis.pdf	Master_TEAM_Timotheos Kopsidas - Thesis	2.17 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.