Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19913
Title: Elastic Resource Management in Microservices Architectures: A Hybrid Approach Combining Reinforcement Learning, Supervised Learning and Critical Path Extraction
Authors: Τσικριτέας, Παναγιώτης
Κοζύρης Νεκτάριος
Keywords: Kubernetes
Resource Allocation
Elasticity
Cloud Computing
Machine Learning
Reinforcement Learning
DeathStarBench
FIRM
CRISP
Microservices Architectures
Issue Date: 7-Nov-2025
Abstract: The increasing computational demands and scalability limitations of monolithic architectures have driven the adoption of microservices-based architectures where applications are composed of loosely coupled deployable services. To manage that complexity in an automated way, platforms like Kubernetes have become both the industry and academic standard due to their resilience, scalability and versatile support of orchestrating these microservices-based architectures, especially with the introduction of scalers like Horizontal Pod Autoscaler (HPA). However, these scalers rely on simplistic thresholdbased heuristics which do not respond well to the complex workload patterns these microservices-based systems encounter. To address this limitation, this thesis’s main objective is to propose a resource management pipeline, that combines supervised learning with probabilistic calibration to identify the critical components extracted from the system’s trace data and reinforcement learning to allocate Kubernetes resources effectively. This work uses CRISP to extract the constantly changing critical paths of the system, guiding the decision-making of the reinforcement learning agent accordingly. To assess the aforementioned information extracted, this thesis compares three reinforcement learning agents, Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO) and Synchronous Advantage Actor-Critic (A2C), with the PPO agent being trained with two different number of episodes, evaluating their ability to manage Kubernetes resources efficiently with metrics such as the end-to-end latency on different percentiles (50, 95, 99) and the average deployed Pods in the cluster while at the same time highlighting each agent’s limitations based on the aforementioned metrics. The evaluation results demonstrate that long-term training in such complex systems is necessary in order to obtain the ability to allocate resources optimally. Notably, even with limited training, the agents achieved strong performance compared to each other and the KHPA baseline, showcasing the importance of such agents in Kubernetes clusters.
URI: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19913
Appears in Collections:Διπλωματικές Εργασίες - Theses

Files in This Item:
File Description SizeFormat 
diploma_thesis_panagiotis_tsikriteas.pdf5.22 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.