Data Acquisition, Exploration and Preparation for LLM Training - The Case of the Greek Language

Διβριώτης, Κωνσταντίνος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728

Full metadata record

DC Field	Value	Language
dc.contributor.author	Διβριώτης, Κωνσταντίνος	-
dc.date.accessioned	2025-07-15T10:21:11Z	-
dc.date.available	2025-07-15T10:21:11Z	-
dc.date.issued	2025-07-02	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19728	-
dc.description.abstract	Large Language Models (LLMs) have emerged as powerful tools in Natural Language Processing, propelled by the ever-expanding scale of model sizes and training datasets. While such resources exist for high-resource languages, low-resource languages such as Greek remain significantly underrepresented in modern LLM research and development. In this thesis, we address this gap by constructing two foundational datasets for Greek LLM development: a pretraining dataset and an instruction tuning dataset. For pretraining, we collected and processed large volumes of conversational data from YouTube transcripts and formal, structured texts from publicly available PDF documents, mostly books and academic material. For instruction tuning, we translated existing high-quality instruction corpora using a custom translation pipeline, ensuring cultural relevance and context-aware conversation in Greek. Throughout the data creation process, we implemented a series of processing steps, including noise removal, formatting normalization, language filtering, and deduplication, leading to the development of a robust processing pipeline. The final datasets, comprising over 2.3 billion words and 6 billion tokens, mark a significant advancement toward training high-quality Greek LLMs. Our work contributes both reusable infrastructure and curated data to support future research and development in Greek NLP.	en_US
dc.language	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Μεγάλα Γλωσσικά Μοντέλα	en_US
dc.subject	Low-Resource Languages	en_US
dc.subject	Γλώσσες Χαμηλών Πόρων	en_US
dc.subject	Greek Dataset	en_US
dc.subject	Ελληνικό Σύνολο Δεδομένων	en_US
dc.subject	Pretraining	en_US
dc.subject	Προεκπαίδευση	en_US
dc.subject	Instruction Tuning	en_US
dc.subject	Εκπαίδευση Βάσει Οδηγιών	en_US
dc.subject	Data Curation	en_US
dc.subject	Επιμέλεια Δεδομένων	en_US
dc.subject	Processing Pipeline	en_US
dc.subject	Αγωγός Επεξεργασίας Δεδομένων	en_US
dc.title	Data Acquisition, Exploration and Preparation for LLM Training - The Case of the Greek Language	en_US
dc.description.pages	86	en_US
dc.contributor.supervisor	Στάμου Γιώργος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
Diploma Thesis - Konstantinos Divriotis.pdf		1.69 MB	Adobe PDF	View/Open

Show simple item record