Leveraging Synthetic Story Data for Sample-Efficient Language Pre-training

Θεοδωρόπουλος, Νικήτας

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409

Title:	Leveraging Synthetic Story Data for Sample-Efficient Language Pre-training
Authors:	Θεοδωρόπουλος, Νικήτας Στάμου Γιώργος
Keywords:	Language Models Γλωσσικά Μοντέλα Data Augmentation Επαύξηση Δεδομένων Language Pre-training Γλωσσική Προεκπαίδευση Small Language Models Μικρά Γλωσσικά Μοντέλα Generative Models Παραγωγικά Μοντέλα Story Generation Παραγωγή Ιστοριών
Issue Date:	13-Nov-2024
Abstract:	Modern Language Models (LMs) have shown remarkable feats of linguistic proficiency and language understanding, often surpassing the abilities of humans in certain tasks. To achieve state of-the-art performance, LMs require ever-increasing computational resources and training data, up to thousands of GPUs and trillions of tokens. In stark contrast, children can acquire a language natively by being exposed to no more than a hundred million words by 13 years of age, showing incredible sample efficiency. Motivated by this discrepancy, we investigate LM pre-training with a limited word budget, similar in size to the input human children receive during their development. For our experiments we adopt two configurations, a reduced setting with a maximum of 10 million words, and a larger setting with a maximum of 100 million words. Our method is rooted in Data Augmentation. We first use TinyStories — a dataset of sort and simple stories, typically understood by 3-4 year old children — to train generative decoder transformer models. We vary the amount of data in the training set of the decoders, and evaluate extensively their language proficiency as well as the quality and diversity of their generations. Our evaluation demonstrates that these models can generate novel and grammatically correct stories even with a few million words of pre-training data. Next, we use the generative models to create a synthetic story dataset, which we combine with a diverse corpus of texts to train encoder transformer models. To assess the effect of our data augmentation methodology we measure the general language understanding, grammatical knowledge and world knowledge of the encoder models, and compare with a variety of baselines, with and without the use of synthetic data. We find that while synthetic data augmentation can offer some modest gains, overall it has a negative effect in final linguistic performance. Finally, we underscore the limitations of our proposed approach, and highlight interesting avenues for future work.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
thesis_2.pdf		3.97 MB	Adobe PDF	View/Open

Show full item record