Performance Variation in Digital Systems: Workload Dependent Modeling and Mitigation

Νόλτσης, Μιχάλης

Εθνικό Μετσόβιο Πολυτεχνείο

Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών

Καλώς ήρθατε στο Άρτεμις

Σκοπός του Άρτεμις είναι η συστηματική αρχειοθέτηση και διαδοση της πνευματικής παραγωγής της Σχολής Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Εθνικού Μετσόβιου Πολυτεχνείου, με τη βοήθεια της τεχνολογίας των ψηφιακών βιβλιοθηκών.

Παρακαλώ χρησιμοποιήστε αυτό το αναγνωριστικό για να παραπέμψετε ή να δημιουργήσετε σύνδεσμο προς αυτό το τεκμήριο: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18205

Τίτλος:	Performance Variation in Digital Systems: Workload Dependent Modeling and Mitigation
Συγγραφείς:	Νόλτσης, Μιχάλης Σούντρης Δημήτριος
Λέξεις κλειδιά:	Variability Reliability Failure Mechanisms Performance
Ημερομηνία έκδοσης:	14-Ιαν-2021
Περίληψη:	The computer industry is witnessing an unprecedented demand for more functionality and performance and is continuously using silicon components with smaller form factor and feature size. This aggressive downscaling of hardware components has brought about several failure mechanisms that degrade the system’s operation, threatening its dependability. Such failure mechanisms can be the result of the natural occurring variation in the attributes of circuit elements during the fabrication procedure, or can be attributed to the aging and the gradual wearout of the hardware and other variability effects related to space particles and power/ground line voltage variation. The inherent stochastic nature of these failure phenomena contributes to the so-called performance variation of digital systems, in the sense that system behavior and response cannot be fully deterministic and have a dynamic component. In the software layer, computational- and data-intensive applications, user interaction and quality of service conditions also generate persistently varying and unpredictable workloads, deteriorating this effect. While software applications are becoming even more complex and resource-hungry (especially due to the continued “virtualization” that leads to the ubiquitous use of run-time threads and dynamic memory allocation) and since the shrinking of transistor and interconnect dimensions is not expected to end in this decade, we can assume that we have already entered an era of inevitable, strongly dynamic performance variation. Under this highly dynamic context of system operation, ensuring dependability and meeting timing constraints seems challenging. The goal of the current research is to study existing methodologies that mitigate performance variation and develop related schemes that can ultimately ensure and guarantee timing deadlines. For this reason, we first briefly discuss the dominant failure mechanisms which create defects is the silicon layer and deteriorate the reliability of the system. A thorough review of the prior art on the subject of reliability mitigation is also important in order to realize the current, state-of-the-art mitigation approaches and methodologies. We specifically explore the research domain of parametric reliability, namely the mitigation techniques that protect the system from extreme fluctuations of operation parameters, especially regarding the timing aspects. Then, the aforementioned reliability threats need to be captured and modeled while their impact on the system’s performance should be described and estimated. Hence, we employ existing tools, and suggest new ones, in order to develop a complete framework that effectively evaluates the failure probability of electronic components, focusing especially on the SRAM buffers of NoC routers. While the reliability community widely uses the time-consuming Monte-Carlo experiments to asses this metric, our framework is founded on the analyticsbased MPFP methodology, which we have studied and improved for the case of typical 6T memory cells. We later present a realistic case study of a closed-loop PID controller that mitigates performance variation with a reactive DVFS response. This concept has been discussed but was only illustrated on small benchmarks; in particular, the integration of the approach on an embedded platform and the extension to manage dynamic workloads has not been shown earlier. This scheme is compared against the version of a Linux CPU frequency governor in terms of energy consumption and timing response. Moreover, we move forward and suggest another flavor of this scheme to perform thermal management. Again this controller is implemented on pure hardware and illustrated with a realistic case study. Next, the aforementioned PID controller is improved to operate on finer granularity, at the thread node level. The concepts of performance and deadline vulnerability factor are introduced to support the formulation of a discrete time control problem while the basis of this new approach utilizes the system scenario methodology; this methodology, along with related terms and definitions, is studied in detail. Finally, we propose proactive DVFS actuations on the thread node level using dynamic scenarios to guarantee timelines in a costefficient manner. Simulation results present significant energy gains compared to previous frequency guardband methods while experimental results on the hardware platform substantiate the effectiveness of our scheme.
URI:	http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/18205
Εμφανίζεται στις συλλογές:	Διδακτορικές Διατριβές - Ph.D. Theses

Αρχεία σε αυτό το τεκμήριο:

Αρχείο	Περιγραφή	Μέγεθος	Μορφότυπος
Noltsis_PHDthesis_NTUA_v2.pdf		4.64 MB	Adobe PDF	Εμφάνιση/Άνοιγμα

Δείξε την πλήρη περιγραφή του τεκμηρίου

Όλα τα τεκμήρια του δικτυακού τόπου προστατεύονται από πνευματικά δικαιώματα.