Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΔιβριώτης, Κωνσταντίνος-
dc.date.accessioned2025-07-15T10:21:11Z-
dc.date.available2025-07-15T10:21:11Z-
dc.date.issued2025-07-02-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728-
dc.description.abstractLarge Language Models (LLMs) have emerged as powerful tools in Natural Language Processing, propelled by the ever-expanding scale of model sizes and training datasets. While such resources exist for high-resource languages, low-resource languages such as Greek remain significantly underrepresented in modern LLM research and development. In this thesis, we address this gap by constructing two foundational datasets for Greek LLM development: a pretraining dataset and an instruction tuning dataset. For pretraining, we collected and processed large volumes of conversational data from YouTube transcripts and formal, structured texts from publicly available PDF documents, mostly books and academic material. For instruction tuning, we translated existing high-quality instruction corpora using a custom translation pipeline, ensuring cultural relevance and context-aware conversation in Greek. Throughout the data creation process, we implemented a series of processing steps, including noise removal, formatting normalization, language filtering, and deduplication, leading to the development of a robust processing pipeline. The final datasets, comprising over 2.3 billion words and 6 billion tokens, mark a significant advancement toward training high-quality Greek LLMs. Our work contributes both reusable infrastructure and curated data to support future research and development in Greek NLP.en_US
dc.languageenen_US
dc.subjectLarge Language Modelsen_US
dc.subjectΜεγάλα Γλωσσικά Μοντέλαen_US
dc.subjectLow-Resource Languagesen_US
dc.subjectΓλώσσες Χαμηλών Πόρωνen_US
dc.subjectGreek Dataseten_US
dc.subjectΕλληνικό Σύνολο Δεδομένωνen_US
dc.subjectPretrainingen_US
dc.subjectΠροεκπαίδευσηen_US
dc.subjectInstruction Tuningen_US
dc.subjectΕκπαίδευση Βάσει Οδηγιώνen_US
dc.subjectData Curationen_US
dc.subjectΕπιμέλεια Δεδομένωνen_US
dc.subjectProcessing Pipelineen_US
dc.subjectΑγωγός Επεξεργασίας Δεδομένωνen_US
dc.titleData Acquisition, Exploration and Preparation for LLM Training - The Case of the Greek Languageen_US
dc.description.pages86en_US
dc.contributor.supervisorΣτάμου Γιώργοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
Diploma Thesis - Konstantinos Divriotis.pdf1.69 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.