A Performance Model and Optimization Strategies for Automatic GPU Code Generation of PDE Systems Described by a Domain-Specific Language

Identifier

etd-08082016-164729

Degree

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

Document Type

Dissertation

Abstract

Stencil computations are a class of algorithms operating on multi-dimensional arrays, which update array elements using their nearest-neighbors. This type of computation forms the basis for computer simulations across almost every field of science, such as computational fluid dynamics, weather simulation, and earthquake prediction. Its mostly regular data access patterns potentially enable it to take advantage of GPU's high computation and data bandwidth. However, manual GPU programming is time-consuming and error-prone, as well as requiring an in-depth knowledge of GPU architecture and programming. This is especially true when the target is high performance. To overcome the difficulties in manual programming, a number of stencil frameworks have been developed to automatically generate GPU codes from user-written stencil code, usually in a Domain Specific Language. The previous stencil frameworks prove that it is feasible to automatically generate GPU codes, but they also introduce a set of unprecedented challenges on generating highly-optimized GPU codes for real stencil applications that may consist of large calculations. This dissertation is based on the Chemora stencil code auto-generation framework, aiming to deal with large calculations existing in real stencil applications. Large stencil calculations usually consist of dozens of grid functions with a variety of stencil patterns, resulting in extremely large code-generation ways to generate code. This dissertation introduces two algorithms to be used in generating highly tuned code. First, we propose to map a calculation into one or more kernels. An algorithm is proposed to optimize the kernel mapping by improving thread-level parallelism and minimizing off-chip memory accesses. Next, we propose an efficiency-based buffering algorithm which operates by scoring a change in buffering strategy for a grid function (GF) using a performance estimation and resource usage. With the algorithm, a near optimal solution can be found in (b-1)N(N+1)/2 steps, instead of b^N steps, for a calculation with N GFs and b buffering strategies. The current Chemora framework supports six (i.e. b = 5) buffering strategies. Finally, an analytic performance model is proposed to predict the execution time of a code generation way. In addition, we wrote a set of micro-benchmarks to explore and measure some performance-critical GPU microarchitecture features and parameters for better performance modeling.

Date

2016

Document Availability at the Time of Submission

Release the entire work immediately for access worldwide.

Committee Chair

Koppelman, David

DOI

10.31390/gradschool_dissertations.1917

This document is currently not available here.

Share

COinS