Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Τίτλος: Leveraging Synthetic Story Data for Sample-Efficient Language Pre-training
Συγγραφείς: Θεοδωρόπουλος, Νικήτας
Στάμου Γιώργος
Λέξεις κλειδιά: Language Models
Γλωσσικά Μοντέλα
Data Augmentation
Επαύξηση Δεδομένων
Language Pre-training
Γλωσσική Προεκπαίδευση
Small Language Models
Μικρά Γλωσσικά Μοντέλα
Generative Models
Παραγωγικά Μοντέλα
Story Generation
Παραγωγή Ιστοριών
Ημερομηνία έκδοσης: 13-Νοε-2024
Περίληψη: Modern Language Models (LMs) have shown remarkable feats of linguistic proficiency and language understanding, often surpassing the abilities of humans in certain tasks. To achieve state of-the-art performance, LMs require ever-increasing computational resources and training data, up to thousands of GPUs and trillions of tokens. In stark contrast, children can acquire a language natively by being exposed to no more than a hundred million words by 13 years of age, showing incredible sample efficiency. Motivated by this discrepancy, we investigate LM pre-training with a limited word budget, similar in size to the input human children receive during their development. For our experiments we adopt two configurations, a reduced setting with a maximum of 10 million words, and a larger setting with a maximum of 100 million words. Our method is rooted in Data Augmentation. We first use TinyStories — a dataset of sort and simple stories, typically understood by 3-4 year old children — to train generative decoder transformer models. We vary the amount of data in the training set of the decoders, and evaluate extensively their language proficiency as well as the quality and diversity of their generations. Our evaluation demonstrates that these models can generate novel and grammatically correct stories even with a few million words of pre-training data. Next, we use the generative models to create a synthetic story dataset, which we combine with a diverse corpus of texts to train encoder transformer models. To assess the effect of our data augmentation methodology we measure the general language understanding, grammatical knowledge and world knowledge of the encoder models, and compare with a variety of baselines, with and without the use of synthetic data. We find that while synthetic data augmentation can offer some modest gains, overall it has a negative effect in final linguistic performance. Finally, we underscore the limitations of our proposed approach, and highlight interesting avenues for future work.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19409
Εμφανίζεται στις συλλογές:Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:
Αρχείο Περιγραφή ΜέγεθοςΜορφότυπος 
thesis_2.pdf3.97 MBAdobe PDFΕμφάνιση/Άνοιγμα


Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.