Please use this identifier to cite or link to this item:
http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20107| Title: | Fine-grained token scheduling for efficient pipeline parallel LLM inference serving |
| Authors: | Καβαλλάρης, Χρήστος Σούντρης Δημήτριος |
| Keywords: | LLMs, LLM Infernce, Pipeline Parallelism, Scheduling, Latency, Throughput. |
| Issue Date: | 16-Mar-2026 |
| Abstract: | Large Language Models (LLMs) are increasingly deployed in production systems, creating a growing demand for efficient distributed inference strategies. Pipeline parallelism has emerged as a predominant approach for partitioning model computation across multiple devices. While LLM Inference remains far from hardware-effective due to the autoregressive decoding process, pipeline-parallel execution introduces further inefficiencies, as workload imbalance between pipeline stages can create idle periods known as pipeline bubbles. In this work, we closely examine the LLM inference process and existing scheduling approaches for pipeline parallel execution. Based on insights derived from this analysis, we propose a scheduling algorithm for pipeline-parallel inference designed to construct balanced microbatches, mitigate pipeline bubbles, and improve resource utilization. Experimental evaluation demonstrates that the proposed approach achieves higher throughput while maintaining lower latency compared to existing scheduling strategies. |
| URI: | http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20107 |
| Appears in Collections: | Διπλωματικές Εργασίες - Theses |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| diploma_thesis.pdf | 1.6 MB | Adobe PDF | View/Open |
Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.