Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Kopsidas, Timotheos | - |
dc.date.accessioned | 2025-07-15T10:13:04Z | - |
dc.date.available | 2025-07-15T10:13:04Z | - |
dc.date.issued | 2025-07-04 | - |
dc.identifier.uri | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19727 | - |
dc.description.abstract | This thesis investigates the diagnostic performance of large language models (LLMs) in complex clinical scenarios, focusing on differential diagnosis generation across medical specialties. As LLMs gain attention for clinical decision support, this study offers a structured evaluation of their accuracy and diagnostic efficiency in real-world scenarios. Eighty clinical cases were drawn from The New England Journal of Medicine (NEJM) clinicopathological series. Four LLMs were assessed: two commercial models (GPT-4o and o3-mini) and two open-source models (Qwen-2.5-32B and Qwen-QWQ-32B). Each model received the same case input and system instructions, and generated a primary diagnosis along with a ranked differential list, in zero-shot learning setup. To ensure consistency in scoring, a separate LLM, Gemini Flash 2.0 was used to verify whether the correct diagnosis appeared in the output, and in which rank. Evaluation metrics included Top-N accuracy (Top-1, Top-3, Top-5, Top-10) and a novel diagnostic efficiency score that considers both the rank of the correct diagnosis and the total number of suggestions. Comparative and statistical analyses were performed to assess the effects of reasoning capability, temperature variation, and model types (open-source vs. commercial) on the performance. Results showed that reasoning-enabled models outperformed non-reasoning ones , and commercial models generally surpassed open-source alternatives, though direct comparison was limited by transparency gaps. Among all language models, OpenAI’s o3-mini in multiple comparisons, consistently demonstrated the best performance, achieving both high accuracy and focused differential lists. Moreover, temperature had no impact on diagnostic accuracy but improved output consistency. Performance varied across specialties, indicating uneven generalization among models. This study contributes a reproducible methodology for LLM diagnostic evaluation and highlights the importance of moving beyond single dimension accuracy metrics toward more holistic assessments of clinical reasoning. As LLMs approach clinical relevance, future work should prioritize benchmarking tools that capture reasoning quality, specialty-specific performance, and resource efficiency. Small, domain-adapted models and multimodal capabilities represent promising directions. Overall, while strict accuracy evaluations still offer valuable insight into model progress and advancements, responsible deployment will require deeper integration with clinical workflows and standards. | en_US |
dc.language | en | en_US |
dc.subject | large language models | en_US |
dc.subject | clinical diagnosis | en_US |
dc.subject | differential diagnosis | en_US |
dc.subject | prompt engineering | en_US |
dc.subject | top-N accuracy | en_US |
dc.subject | diagnostic reasoning | en_US |
dc.subject | human–AI collaboration | en_US |
dc.subject | evaluation frameworks | en_US |
dc.title | Evaluating the Diagnostic Performance of Large Language Models in Complex Clinical Cases: A Comparative Study | en_US |
dc.description.pages | 69 | en_US |
dc.contributor.supervisor | Νικήτα Κωνσταντίνα | en_US |
dc.department | Άλλο | en_US |
Appears in Collections: | Μεταπτυχιακές Εργασίες - M.Sc. Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Thesis.pdf | Master_TEAM_Timotheos Kopsidas - Thesis | 2.17 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.