Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727
Full metadata record
DC FieldValueLanguage
dc.contributor.authorKopsidas, Timotheos-
dc.date.accessioned2025-07-15T10:13:04Z-
dc.date.available2025-07-15T10:13:04Z-
dc.date.issued2025-07-04-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727-
dc.description.abstractThis thesis investigates the diagnostic performance of large language models (LLMs) in complex clinical scenarios, focusing on differential diagnosis generation across medical specialties. As LLMs gain attention for clinical decision support, this study offers a structured evaluation of their accuracy and diagnostic efficiency in real-world scenarios. Eighty clinical cases were drawn from The New England Journal of Medicine (NEJM) clinicopathological series. Four LLMs were assessed: two commercial models (GPT-4o and o3-mini) and two open-source models (Qwen-2.5-32B and Qwen-QWQ-32B). Each model received the same case input and system instructions, and generated a primary diagnosis along with a ranked differential list, in zero-shot learning setup. To ensure consistency in scoring, a separate LLM, Gemini Flash 2.0 was used to verify whether the correct diagnosis appeared in the output, and in which rank. Evaluation metrics included Top-N accuracy (Top-1, Top-3, Top-5, Top-10) and a novel diagnostic efficiency score that considers both the rank of the correct diagnosis and the total number of suggestions. Comparative and statistical analyses were performed to assess the effects of reasoning capability, temperature variation, and model types (open-source vs. commercial) on the performance. Results showed that reasoning-enabled models outperformed non-reasoning ones , and commercial models generally surpassed open-source alternatives, though direct comparison was limited by transparency gaps. Among all language models, OpenAI’s o3-mini in multiple comparisons, consistently demonstrated the best performance, achieving both high accuracy and focused differential lists. Moreover, temperature had no impact on diagnostic accuracy but improved output consistency. Performance varied across specialties, indicating uneven generalization among models. This study contributes a reproducible methodology for LLM diagnostic evaluation and highlights the importance of moving beyond single dimension accuracy metrics toward more holistic assessments of clinical reasoning. As LLMs approach clinical relevance, future work should prioritize benchmarking tools that capture reasoning quality, specialty-specific performance, and resource efficiency. Small, domain-adapted models and multimodal capabilities represent promising directions. Overall, while strict accuracy evaluations still offer valuable insight into model progress and advancements, responsible deployment will require deeper integration with clinical workflows and standards.en_US
dc.languageenen_US
dc.subjectlarge language modelsen_US
dc.subjectclinical diagnosisen_US
dc.subjectdifferential diagnosisen_US
dc.subjectprompt engineeringen_US
dc.subjecttop-N accuracyen_US
dc.subjectdiagnostic reasoningen_US
dc.subjecthuman–AI collaborationen_US
dc.subjectevaluation frameworksen_US
dc.titleEvaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Studyen_US
dc.description.pages69en_US
dc.contributor.supervisorΝικήτα Κωνσταντίναen_US
dc.departmentΆλλοen_US
Appears in Collections:Μεταπτυχιακές Εργασίες - M.Sc. Theses

Files in This Item:
File Description SizeFormat 
Thesis.pdfMaster_TEAM_Timotheos Kopsidas - Thesis2.17 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.