Please use this identifier to cite or link to this item:
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΝόλτσης, Μιχάλης-
dc.description.abstractThe computer industry is witnessing an unprecedented demand for more functionality and performance and is continuously using silicon components with smaller form factor and feature size. This aggressive downscaling of hardware components has brought about several failure mechanisms that degrade the system’s operation, threatening its dependability. Such failure mechanisms can be the result of the natural occurring variation in the attributes of circuit elements during the fabrication procedure, or can be attributed to the aging and the gradual wearout of the hardware and other variability effects related to space particles and power/ground line voltage variation. The inherent stochastic nature of these failure phenomena contributes to the so-called performance variation of digital systems, in the sense that system behavior and response cannot be fully deterministic and have a dynamic component. In the software layer, computational- and data-intensive applications, user interaction and quality of service conditions also generate persistently varying and unpredictable workloads, deteriorating this effect. While software applications are becoming even more complex and resource-hungry (especially due to the continued “virtualization” that leads to the ubiquitous use of run-time threads and dynamic memory allocation) and since the shrinking of transistor and interconnect dimensions is not expected to end in this decade, we can assume that we have already entered an era of inevitable, strongly dynamic performance variation. Under this highly dynamic context of system operation, ensuring dependability and meeting timing constraints seems challenging. The goal of the current research is to study existing methodologies that mitigate performance variation and develop related schemes that can ultimately ensure and guarantee timing deadlines. For this reason, we first briefly discuss the dominant failure mechanisms which create defects is the silicon layer and deteriorate the reliability of the system. A thorough review of the prior art on the subject of reliability mitigation is also important in order to realize the current, state-of-the-art mitigation approaches and methodologies. We specifically explore the research domain of parametric reliability, namely the mitigation techniques that protect the system from extreme fluctuations of operation parameters, especially regarding the timing aspects. Then, the aforementioned reliability threats need to be captured and modeled while their impact on the system’s performance should be described and estimated. Hence, we employ existing tools, and suggest new ones, in order to develop a complete framework that effectively evaluates the failure probability of electronic components, focusing especially on the SRAM buffers of NoC routers. While the reliability community widely uses the time-consuming Monte-Carlo experiments to asses this metric, our framework is founded on the analyticsbased MPFP methodology, which we have studied and improved for the case of typical 6T memory cells. We later present a realistic case study of a closed-loop PID controller that mitigates performance variation with a reactive DVFS response. This concept has been discussed but was only illustrated on small benchmarks; in particular, the integration of the approach on an embedded platform and the extension to manage dynamic workloads has not been shown earlier. This scheme is compared against the version of a Linux CPU frequency governor in terms of energy consumption and timing response. Moreover, we move forward and suggest another flavor of this scheme to perform thermal management. Again this controller is implemented on pure hardware and illustrated with a realistic case study. Next, the aforementioned PID controller is improved to operate on finer granularity, at the thread node level. The concepts of performance and deadline vulnerability factor are introduced to support the formulation of a discrete time control problem while the basis of this new approach utilizes the system scenario methodology; this methodology, along with related terms and definitions, is studied in detail. Finally, we propose proactive DVFS actuations on the thread node level using dynamic scenarios to guarantee timelines in a costefficient manner. Simulation results present significant energy gains compared to previous frequency guardband methods while experimental results on the hardware platform substantiate the effectiveness of our scheme.en_US
dc.subjectFailure Mechanismsen_US
dc.titlePerformance Variation in Digital Systems: Workload Dependent Modeling and Mitigationen_US
dc.contributor.supervisorΣούντρης Δημήτριοςen_US
dc.departmentΤομέας Τεχνολογίας Πληροφορικής και Υπολογιστώνen_US
Appears in Collections:Διδακτορικές Διατριβές - Ph.D. Theses

Files in This Item:
File Description SizeFormat 
Noltsis_PHDthesis_NTUA_v2.pdf4.64 MBAdobe PDFView/Open

Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.