Accelerating SpMM for Multi-head Self-Attention: Kernel-Level Design and Performance Analysis

Λαγός, Αναστάσιος

National Technical University of Athens

School of Electrical and Computer Engineering

Artemis is Live!

Welcome to our digital repository! The aim of Artemis is the systematic archiving and dissemination of the scientific work produced in the School of Electrical and Computer Engineering, National Technical University of Athens, Greece, using the technology of digital libraries.

Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19900

Full metadata record

DC Field	Value	Language
dc.contributor.author	Λαγός, Αναστάσιος	-
dc.date.accessioned	2025-11-06T21:05:04Z	-
dc.date.available	2025-11-06T21:05:04Z	-
dc.date.issued	2025-11-06	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19900	-
dc.description.abstract	Large Language Models (LLMs) rely on the Transformer architecture to capture dependencies on input tokens. As model sizes and input contexts continue to grow, the ability to efficiently handle extended sequences becomes increasingly critical to maintaining performance and scalability. However, the computational and memory demands of the standard attention mechanism increase by an order of O(n2) with respect to sequence length, impeding models from scaling further. A proposed solution, the sparse approximation of attention, aims to reduce the overall complexity while maintaining model quality. However, when implemented on accelerator platforms such as GPUs, sparse attention implementations often showcase a performance degradation caused by the emergent properties of sparsity. This work aims to analyze common bottlenecks in the implementation details of the Sparse matrix-Matrix Multiplication (SpMM) kernel and its consequent optimization, comparing performance to NVIDIA’s cuSPARSE library. The best kernel at low sparsity achieves a 57% speedup but for high sparsities, demonstrates a 49% slowdown. A different kernel design achieves a 64% speedup but for low sparsities, a 250% slowdown. Thus, the study concludes that tailoring kernel designs to the specific parameters of each problem achieves significantly better results than applying a catch-all approach.	en_US
dc.language	en	en_US
dc.subject	Neural Networks	en_US
dc.subject	Deep Learning	en_US
dc.subject	Reinforcement Learning	en_US
dc.subject	Large Language Models	en_US
dc.subject	Transformers	en_US
dc.subject	Multi-head Self-Attention	en_US
dc.subject	Sparse Attention	en_US
dc.subject	Sparse Matrix-Matrix Multiplication	en_US
dc.subject	Parallel Computing Systems	en_US
dc.subject	GPU Programming	en_US
dc.subject	Kernels	en_US
dc.subject	CUDA	en_US
dc.subject	Optimization	en_US
dc.title	Accelerating SpMM for Multi-head Self-Attention: Kernel-Level Design and Performance Analysis	en_US
dc.description.pages	61	en_US
dc.contributor.supervisor	Γκούμας Γεώργιος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Appears in Collections:	Διπλωματικές Εργασίες - Theses

Files in This Item:

File	Description	Size	Format
main.pdf		2.08 MB	Adobe PDF	View/Open

Show simple item record