Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Πλήρες αρχείο μεταδεδομένων
Πεδίο DC ΤιμήΓλώσσα
dc.contributor.authorΘεοδωρόπουλος, Νικήτας-
dc.date.accessioned2024-11-13T10:57:25Z-
dc.date.available2024-11-13T10:57:25Z-
dc.date.issued2024-11-13-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409-
dc.description.abstractModern Language Models (LMs) have shown remarkable feats of linguistic proficiency and language understanding, often surpassing the abilities of humans in certain tasks. To achieve state of-the-art performance, LMs require ever-increasing computational resources and training data, up to thousands of GPUs and trillions of tokens. In stark contrast, children can acquire a language natively by being exposed to no more than a hundred million words by 13 years of age, showing incredible sample efficiency. Motivated by this discrepancy, we investigate LM pre-training with a limited word budget, similar in size to the input human children receive during their development. For our experiments we adopt two configurations, a reduced setting with a maximum of 10 million words, and a larger setting with a maximum of 100 million words. Our method is rooted in Data Augmentation. We first use TinyStories — a dataset of sort and simple stories, typically understood by 3-4 year old children — to train generative decoder transformer models. We vary the amount of data in the training set of the decoders, and evaluate extensively their language proficiency as well as the quality and diversity of their generations. Our evaluation demonstrates that these models can generate novel and grammatically correct stories even with a few million words of pre-training data. Next, we use the generative models to create a synthetic story dataset, which we combine with a diverse corpus of texts to train encoder transformer models. To assess the effect of our data augmentation methodology we measure the general language understanding, grammatical knowledge and world knowledge of the encoder models, and compare with a variety of baselines, with and without the use of synthetic data. We find that while synthetic data augmentation can offer some modest gains, overall it has a negative effect in final linguistic performance. Finally, we underscore the limitations of our proposed approach, and highlight interesting avenues for future work.en_US
dc.languageenen_US
dc.subjectLanguage Modelsen_US
dc.subjectΓλωσσικά Μοντέλαen_US
dc.subjectData Augmentationen_US
dc.subjectΕπαύξηση Δεδομένωνen_US
dc.subjectLanguage Pre-trainingen_US
dc.subjectΓλωσσική Προεκπαίδευσηen_US
dc.subjectSmall Language Modelsen_US
dc.subjectΜικρά Γλωσσικά Μοντέλαen_US
dc.subjectGenerative Modelsen_US
dc.subjectΠαραγωγικά Μοντέλαen_US
dc.subjectStory Generationen_US
dc.subjectΠαραγωγή Ιστοριώνen_US
dc.titleLeveraging Synthetic Story Data for Sample-Efficient Language Pre-trainingen_US
dc.description.pages118en_US
dc.contributor.supervisorΣτάμου Γιώργοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Εμφανίζεται στις συλλογές:Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:
Αρχείο Περιγραφή ΜέγεθοςΜορφότυπος 
thesis_2.pdf3.97 MBAdobe PDFΕμφάνιση/Άνοιγμα


Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.