Doctor of Engineering (DEng)


Electrical and Computer Engineering

Document Type



In this dissertation, we explore multiple designs for a Distributed Transactional Memory framework for GPU clusters. Using Transactional Memory, we relieve the programmer of many concerns including 1) how to move data between many discrete memory spaces; 2) how to ensure data correctness when shared objects may be accessed by multiple devices; 3) how to prevent catastrophic warp divergence caused by atomic operations; 4) how to prevent catastrophic warp divergence caused by long-latency off-device communications; and 5) how to ensure Atomicity, Consistency, Isolation, Durability for programs with irregular memory accesses. Each of these concerns individually can be daunting to programmers who lack expert knowledge of the GPUs architectural quirks including the use of SIMD, weak memory model, and lack of direct access to a NIC. The goal of this work is to significantly reduce the programming effort required to realize performant GPU applications despite workload characteristics that are not favorable to the underlying architecture. Using our automatic concurrency control system, CUDA-DTM, programmers can convert some traditional applications to GPU applications in an afternoon that would have otherwise taken months to develop and debug.

We analyze the performance and workload flexibility of CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive concurrency control scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability and preventing deadlocks. CUDA-DTM extends existing GPU Software Transactional Memory model to allow individual threads across many GPUs to initiate access to a Partitioned Global Address Space using a proposed scheme for GPU-to-GPU communication using CUDA-Aware MPI.

CUDA-DTM allows programmers to treat individual GPU threads as though they were as flexible and independent as CPU threads, using a run-time system that automatically resolves conflicts, prevents warp-divergence, facilitates data movement between host and device memory spaces, and preserves data integrity. While CUDA-DTM will ensure applications run correctly and to completion without deadlocks or live-locks, there is no free lunch and programmers must be aware of the underlying architecture as well as locality of PGAS memory accesses to achieve the "100x" speedup that is so enticingly advertised by existing GPU literature. In fact, achieving the best-performing implementation for a given workload would likely require replacing each of the CUDA-DTM components with an application-specific design. However, using CUDA-DTM, programmers can rapidly test the suitability of GPU clusters for a given workload and study the performance of a working application to decide whether it is worth investing the significant programming effort required to write a distributed CUDA application.

To reduce CUDA-DTM's sensitivity to memory access partition locality, we relax CUDA-DTMS consistency model using BifurKTM (BKTM), a read-optimized Transactional Memory model, and a distributed shared memory system that allows GPU threads to access read-only copies of shared data elements with a bounded staleness. BKTM uses K-Opaque Approximate Consistency to reduce contention, allowing shared objects to be read and written by transactions simultaneously. BKTM uses a combination of Control Flow and Data Flow DTM: transactions that modify shared memory use Control Flow to avoid cache maintenance overheads and allow communications to be removed from the critical path; read-only transactions instead use the Data Flow model in which shared objects are loaded in advance into a local cache, allowing read-only transactions to hide long latencies from remote communications.

Committee Chair

Dr. Lu Peng

Available for download on Wednesday, October 28, 2026