Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Title: Leveraging Synthetic Story Data for Sample-Efficient Language Pre-training
Authors: Θεοδωρόπουλος, Νικήτας
Στάμου Γιώργος
Keywords: Language Models
Γλωσσικά Μοντέλα
Data Augmentation
Επαύξηση Δεδομένων
Language Pre-training
Γλωσσική Προεκπαίδευση
Small Language Models
Μικρά Γλωσσικά Μοντέλα
Generative Models
Παραγωγικά Μοντέλα
Story Generation
Παραγωγή Ιστοριών
Issue Date: 13-Nov-2024
Abstract: Modern Language Models (LMs) have shown remarkable feats of linguistic proficiency and language understanding, often surpassing the abilities of humans in certain tasks. To achieve state of-the-art performance, LMs require ever-increasing computational resources and training data, up to thousands of GPUs and trillions of tokens. In stark contrast, children can acquire a language natively by being exposed to no more than a hundred million words by 13 years of age, showing incredible sample efficiency. Motivated by this discrepancy, we investigate LM pre-training with a limited word budget, similar in size to the input human children receive during their development. For our experiments we adopt two configurations, a reduced setting with a maximum of 10 million words, and a larger setting with a maximum of 100 million words. Our method is rooted in Data Augmentation. We first use TinyStories — a dataset of sort and simple stories, typically understood by 3-4 year old children — to train generative decoder transformer models. We vary the amount of data in the training set of the decoders, and evaluate extensively their language proficiency as well as the quality and diversity of their generations. Our evaluation demonstrates that these models can generate novel and grammatically correct stories even with a few million words of pre-training data. Next, we use the generative models to create a synthetic story dataset, which we combine with a diverse corpus of texts to train encoder transformer models. To assess the effect of our data augmentation methodology we measure the general language understanding, grammatical knowledge and world knowledge of the encoder models, and compare with a variety of baselines, with and without the use of synthetic data. We find that while synthetic data augmentation can offer some modest gains, overall it has a negative effect in final linguistic performance. Finally, we underscore the limitations of our proposed approach, and highlight interesting avenues for future work.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
thesis_2.pdf3.97 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.