Interference and Resource Aware Predictive Inference Serving on Cloud Infrastructures

Χρυσομέρης, Παναγιώτης

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18563

Πλήρες αρχείο μεταδεδομένων

Πεδίο DC	Τιμή	Γλώσσα
dc.contributor.author	Χρυσομέρης, Παναγιώτης	-
dc.date.accessioned	2022-11-30T15:35:18Z	-
dc.date.available	2022-11-30T15:35:18Z	-
dc.date.issued	2022-11-15	-
dc.identifier.uri	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18563	-
dc.description.abstract	Over the last years, the growth of applications that utilize Artificial Intelligence (AI) is rapidly increasing and is expected to grow further in the future. To satisfy this ever-increasing demand for Machine Learning (ML) driven applications, Cloud providers offer inference serving systems as online services (ML-as-a-Service), which end-users can query to take advantage of "out-of-the-box" ML solutions without having to install software or provision their own servers. Typically, ML inference serving requests are accompanied with performance requirements, also known as Quality-of-Service (QoS) constraints, Service-Level-Objectives (SLOs) or Service-Level Agreements (SLAs), which correspond to latency constraints set by the respective application. To this end, a key challenge for Cloud providers is to guarantee such requirements while also maximizing the resource efficiency of their infrastructures, thus leading to reduced operational costs. However, satisfying such contradictory optimization goals becomes really challenging due to i) the high diversity in terms of ML inference serving solutions available and ii) the performance variability due to the unpredictability of user requests during the day. On top of that, Cloud providers tend to co-locate applications in shared physical servers to maximize the resource utilization of their infrastructure, which, however, imposes performance degradation due to resource interference effects. In this diploma thesis, we propose an interference and resource aware, predictive scheduling framework for ML inference engines, that is capable of efficiently utilizing CPU resources to satisfy QoS constraints. Our framework considers the effect of resource interference in the Cloud by leveraging low-level system metrics to predict the Queries per Second (QPS) that an inference engine will achieve, based on the current load and resource utilization, as well as to select the appropriate parallelism level for deployment. We also introduce a model-less approach to the scheduling framework, which navigates the trade-off space of diverse ML model-variants for a specific inference task, on behalf of developers, to meet the application-specific objective with minimum resource utilization. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks nowadays. We evaluate our scheduling framework using a set of inference engines from the MLPerf Inference Benchmark Suite. Experimental results show that our scheduling framework utilizes a moderate amount of CPU resources, dependent on the target QoS and resource load, to violate QoS constraints, on average, 1.8x less often, compared to the max CPU utilization inference serving system, and 3.1x less often, compared to the min CPU utilization inference serving system, and with a performance variability that is better concentrated around the target QoS, for a variety of interference scenarios and different QoS constraints. Moreover, as the QoS constraints change, the model-less scheduling framework retains a similar, on average, overall performance and resource utilization, to the best performing, most efficient inference engine each time, resulting, on average, in 1.5x less violations of the QoS constraints and 1.4x less CPU utilization, compared to the model-specific scheduling framework.	en_US
dc.language	en	en_US
dc.subject	Cloud computing	en_US
dc.subject	Inference	en_US
dc.subject	Scheduling	en_US
dc.subject	Interference-aware	en_US
dc.subject	Resource management	en_US
dc.subject	Machine Learning	en_US
dc.subject	Kubernetes	en_US
dc.subject	Χρονοδρομολόγηση	en_US
dc.subject	Παρεμβολές	en_US
dc.subject	Διαχείριση πόρων	en_US
dc.subject	Μηχανική Μάθηση	en_US
dc.title	Interference and Resource Aware Predictive Inference Serving on Cloud Infrastructures	en_US
dc.description.pages	134	en_US
dc.contributor.supervisor	Σούντρης Δημήτριος	en_US
dc.department	Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών	en_US
Εμφανίζεται στις συλλογές:	Διπλωματικές Εργασίες - Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
thesis_chrysomeris.pdf		5.7 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε τη σύντομη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.