Please use this identifier to cite or link to this item: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/9098
Title: Communication Performance Prediction On Large-scale Systems
Authors: Nikela Papadopoulou
Keywords: performance modeling
predictive modeling
communication time
hpc applications
mpi
supercomputers
clusters
statistical learning
machine learning
Issue Date: 16-Oct-2017
Abstract: On the path to exascale, supercomputers will grow to host hundreds of millions of cores and various complex heterogeneous processing elements, yet even today, users fail to leverage the existing compute power of large-scale systems, as large classes of typical HPC applications are bound by non-scalable communication phases. The ability to predict the communication time of parallel applications can assist users, compilers, runtime systems and schedulers with decision-making for optimal resource utilization, performance optimizations, power saving and resilience.This thesis presents a methodology for predictive communication modeling of HPC applications. Communication time depends on a complex set of parameters, relevant to the application, the system architecture, the runtime configuration and runtime conditions. To handle this complexity, we follow an empirical modeling approach. We define features that can be extracted from the application, the process mapping and the allocation shape ahead of execution, deploy a single benchmark to sweep over the parameter space and develop predictive models for communication time on three large-scale computing systems, Vilje, Piz Daint and ARIS, using different subsets of our features, statistical and machine-learning methods and training sets. We compare the predictive performance of our models on various communication patterns and applications, for multiple problem sizes, executions and runtime configurations, ranging from a few dozen to a few thousand cores. Our methodology is successful across all tested communication patterns on all systems and exhibits high prediction accuracy and goodness-of-fit. Our models are applicable just-in-time ahead of the execution of an HPC application, and, as we demonstrate in this thesis, their high accuracy make them suitable for communication-aware decision making, towards the optimization of resource utilization on large-scale systems.
URI: http://artemis-new.cslab.ece.ntua.gr:8080/jspui/handle/123456789/9098
Appears in Collections:Διδακτορικές Διατριβές - Ph.D. Theses

Files in This Item:
File SizeFormat 
PD2017-0030.pdf11.32 MBAdobe PDFView/Open


Items in Artemis are protected by copyright, with all rights reserved, unless otherwise indicated.