Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20107
Title: Fine-grained token scheduling for efficient pipeline parallel LLM inference serving
Authors: Καβαλλάρης, Χρήστος
Σούντρης Δημήτριος
Keywords: LLMs, LLM Infernce, Pipeline Parallelism, Scheduling, Latency, Throughput.
Issue Date: 16-Mar-2026
Abstract: Large Language Models (LLMs) are increasingly deployed in production systems, creating a growing demand for efficient distributed inference strategies. Pipeline parallelism has emerged as a predominant approach for partitioning model computation across multiple devices. While LLM Inference remains far from hardware-effective due to the autoregressive decoding process, pipeline-parallel execution introduces further inefficiencies, as workload imbalance between pipeline stages can create idle periods known as pipeline bubbles. In this work, we closely examine the LLM inference process and existing scheduling approaches for pipeline parallel execution. Based on insights derived from this analysis, we propose a scheduling algorithm for pipeline-parallel inference designed to construct balanced microbatches, mitigate pipeline bubbles, and improve resource utilization. Experimental evaluation demonstrates that the proposed approach achieves higher throughput while maintaining lower latency compared to existing scheduling strategies.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/20107
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
diploma_thesis.pdf1.6 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.