Structured Pruning for Deep Learning Language Models

Αχλάτης, Στέφανος Σταμάτης

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18209

Title:	Structured Pruning for Deep Learning Language Models
Authors:	Αχλάτης, Στέφανος Σταμάτης Ποταμιάνος Αλέξανδρος
Keywords:	Deep Learning, Natural Language Processing, Transfer Learning, Transformers, BERT, Compression, Structured Pruning, Lottery Ticket Hypothesis, fine-tuning
Issue Date:	9-Nov-2021
Abstract:	In this Diploma Thesis, we study the compression of Deep Neural Networks, and more precisely, we study the structured pruning in Natural Language Processing models. From our work, we draw some conclusions that can be divided into two main aspects: • Studying pruning for pruning. We propose a better implementation of pruning that considers both the pre-trained and the fine-tuned model. • Studying pruning for a better understanding of the model. We see that through fine-tuning, the model forgets prior knowledge, and in order to overcome this problem, we study both the pre-trained and the fine-tuned model for better pruning results. Many studies have been done regarding compression neural networks, and they apply different compression techniques such as Magnitude Pruning, Structural Pruning, Quantization, Knowledge Distillation, and Neural Architecture Search. However, through Structured Pruning, one can compress the network and understand the fundamental structured components of the model. Therefore, we focus on Structured Pruning of BERT-based models, the dominant models used in Natural Language Processing. Other studies regarding Structured Pruning of BERT-based models considered only the final fine-tuned model, even though these models are created after a fine-tuning process, where weight values are mostly predetermined by the original model and are only fine-tuned on the end task. Thus, we suggest that pruning strategies should both consider the pre-trained and the final fine-tuned model, and the head importance score should be calculated considering both the importance of the pre-trained and the fine-tuned head. In this study, we examine how this idea could be implemented for BERT-base models, and in order to illustrate the impact of our approach, we execute experiments considering BERT models pre-trained on different corpora and fine-tuned on datasets of different domains. Moreover, we study our approach through the Lottery Ticket Hypothesis, where we see that obtaining initialization pruning masks considering both the pre-trained and the fine-tuned model outperforms the approach which only considers the fine-tuned model. Moreover, we propose a better application of the Lottery Ticket Hypothesis in structured pruning, and we name this approach "Iterative Structural Pruning". Last but not least, we examine the generalization ability of our methodology through different modalities, and we examine a speech model named wav2vec 2.0. This study is essential because many Transformer-based architectures achieve significant results on different modalities such as speech and vision. With this thesis, we wish to open new roads and create new aspects to explore pruning mechanisms, Lottery Ticket Hypothesis, and fine-tuning techniques.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18209
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
thesisAchlatis.pdf		5.1 MB	Adobe PDF	View/Open

Show full item record