Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728
Title: | Data Acquisition, Exploration and Preparation for LLM Training - The Case of the Greek Language |
Authors: | Διβριώτης, Κωνσταντίνος Στάμου Γιώργος |
Keywords: | Large Language Models Μεγάλα Γλωσσικά Μοντέλα Low-Resource Languages Γλώσσες Χαμηλών Πόρων Greek Dataset Ελληνικό Σύνολο Δεδομένων Pretraining Προεκπαίδευση Instruction Tuning Εκπαίδευση Βάσει Οδηγιών Data Curation Επιμέλεια Δεδομένων Processing Pipeline Αγωγός Επεξεργασίας Δεδομένων |
Issue Date: | 2-Jul-2025 |
Abstract: | Large Language Models (LLMs) have emerged as powerful tools in Natural Language Processing, propelled by the ever-expanding scale of model sizes and training datasets. While such resources exist for high-resource languages, low-resource languages such as Greek remain significantly underrepresented in modern LLM research and development. In this thesis, we address this gap by constructing two foundational datasets for Greek LLM development: a pretraining dataset and an instruction tuning dataset. For pretraining, we collected and processed large volumes of conversational data from YouTube transcripts and formal, structured texts from publicly available PDF documents, mostly books and academic material. For instruction tuning, we translated existing high-quality instruction corpora using a custom translation pipeline, ensuring cultural relevance and context-aware conversation in Greek. Throughout the data creation process, we implemented a series of processing steps, including noise removal, formatting normalization, language filtering, and deduplication, leading to the development of a robust processing pipeline. The final datasets, comprising over 2.3 billion words and 6 billion tokens, mark a significant advancement toward training high-quality Greek LLMs. Our work contributes both reusable infrastructure and curated data to support future research and development in Greek NLP. |
URI: | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728 |
Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Diploma Thesis - Konstantinos Divriotis.pdf | 1.69 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.