Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19900
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΛαγός, Αναστάσιος-
dc.date.accessioned2025-11-06T21:05:04Z-
dc.date.available2025-11-06T21:05:04Z-
dc.date.issued2025-11-06-
dc.identifier.urihttp://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19900-
dc.description.abstractLarge Language Models (LLMs) rely on the Transformer architecture to capture dependencies on input tokens. As model sizes and input contexts continue to grow, the ability to efficiently handle extended sequences becomes increasingly critical to maintaining performance and scalability. However, the computational and memory demands of the standard attention mechanism increase by an order of O(n2) with respect to sequence length, impeding models from scaling further. A proposed solution, the sparse approximation of attention, aims to reduce the overall complexity while maintaining model quality. However, when implemented on accelerator platforms such as GPUs, sparse attention implementations often showcase a performance degradation caused by the emergent properties of sparsity. This work aims to analyze common bottlenecks in the implementation details of the Sparse matrix-Matrix Multiplication (SpMM) kernel and its consequent optimization, comparing performance to NVIDIA’s cuSPARSE library. The best kernel at low sparsity achieves a 57% speedup but for high sparsities, demonstrates a 49% slowdown. A different kernel design achieves a 64% speedup but for low sparsities, a 250% slowdown. Thus, the study concludes that tailoring kernel designs to the specific parameters of each problem achieves significantly better results than applying a catch-all approach.en_US
dc.languageenen_US
dc.subjectNeural Networksen_US
dc.subjectDeep Learningen_US
dc.subjectReinforcement Learningen_US
dc.subjectLarge Language Modelsen_US
dc.subjectTransformersen_US
dc.subjectMulti-head Self-Attentionen_US
dc.subjectSparse Attentionen_US
dc.subjectSparse Matrix-Matrix Multiplicationen_US
dc.subjectParallel Computing Systemsen_US
dc.subjectGPU Programmingen_US
dc.subjectKernelsen_US
dc.subjectCUDAen_US
dc.subjectOptimizationen_US
dc.titleAccelerating SpMM for Multi-head Self-Attention: Kernel-Level Design and Performance Analysisen_US
dc.description.pages61en_US
dc.contributor.supervisorΓκούμας Γεώργιοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
main.pdf2.08 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.