Evaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Study

Kopsidas, Timotheos

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727

Full metadata record

DC Field	Value	Language
dc.contributor.author	Kopsidas, Timotheos	-
dc.date.accessioned	2025-07-15T10:13:04Z	-
dc.date.available	2025-07-15T10:13:04Z	-
dc.date.issued	2025-07-04	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727	-
dc.description.abstract	This thesis investigates the diagnostic performance of large language models (LLMs) in complex clinical scenarios, focusing on differential diagnosis generation across medical specialties. As LLMs gain attention for clinical decision support, this study offers a structured evaluation of their accuracy and diagnostic efficiency in real-world scenarios. Eighty clinical cases were drawn from The New England Journal of Medicine (NEJM) clinicopathological series. Four LLMs were assessed: two commercial models (GPT-4o and o3-mini) and two open-source models (Qwen-2.5-32B and Qwen-QWQ-32B). Each model received the same case input and system instructions, and generated a primary diagnosis along with a ranked differential list, in zero-shot learning setup. To ensure consistency in scoring, a separate LLM, Gemini Flash 2.0 was used to verify whether the correct diagnosis appeared in the output, and in which rank. Evaluation metrics included Top-N accuracy (Top-1, Top-3, Top-5, Top-10) and a novel diagnostic efficiency score that considers both the rank of the correct diagnosis and the total number of suggestions. Comparative and statistical analyses were performed to assess the effects of reasoning capability, temperature variation, and model types (open-source vs. commercial) on the performance. Results showed that reasoning-enabled models outperformed non-reasoning ones , and commercial models generally surpassed open-source alternatives, though direct comparison was limited by transparency gaps. Among all language models, OpenAI’s o3-mini in multiple comparisons, consistently demonstrated the best performance, achieving both high accuracy and focused differential lists. Moreover, temperature had no impact on diagnostic accuracy but improved output consistency. Performance varied across specialties, indicating uneven generalization among models. This study contributes a reproducible methodology for LLM diagnostic evaluation and highlights the importance of moving beyond single dimension accuracy metrics toward more holistic assessments of clinical reasoning. As LLMs approach clinical relevance, future work should prioritize benchmarking tools that capture reasoning quality, specialty-specific performance, and resource efficiency. Small, domain-adapted models and multimodal capabilities represent promising directions. Overall, while strict accuracy evaluations still offer valuable insight into model progress and advancements, responsible deployment will require deeper integration with clinical workflows and standards.	en_US
dc.language	en	en_US
dc.subject	large language models	en_US
dc.subject	clinical diagnosis	en_US
dc.subject	differential diagnosis	en_US
dc.subject	prompt engineering	en_US
dc.subject	top-N accuracy	en_US
dc.subject	diagnostic reasoning	en_US
dc.subject	human–AI collaboration	en_US
dc.subject	evaluation frameworks	en_US
dc.title	Evaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Study	en_US
dc.description.pages	69	en_US
dc.contributor.supervisor	Νικήτα Κωνσταντίνα	en_US
dc.department	Άλλο	en_US
Appears in Collections:	Μεταπτυχιακές Εργασίες - M.Sc. Theses

Files in This Item:

File	Description	Size	Format
Thesis.pdf	Master_TEAM_Timotheos Kopsidas - Thesis	2.17 MB	Adobe PDF	View/Open

Show simple item record