Tuesday, February 25, 2020

Tech Book Face Off: Programming Massively Parallel Processors Vs. Professional CUDA C Programming

After getting an introduction to GPU programming with CUDA by Example, I wanted to dig in deeper and get to know the real ins and outs of CUDA programming. That desire quickly lead to the selection of books for this Tech Book Face Off. The first book is definitely geared to be a college textbook, and as I spent years learning from books like this, I felt comfortable taking a look at Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu. The second book is targeted more at the working professional, as the title suggests: Professional CUDA C Programming by John Cheng, Max Grossman, and Ty McKercher. I was surprised by both books, and not in the same way. Let's see how they do at teaching CUDA programming.

Programming Massively Parallel Multiprocessors front coverVS.Professional CUDA C Programming front cover

Programming Massively Parallel Processors

The polite way to critique this book is to say, it's somewhat verbose and repetitive, but if you can get past that, it has a lot to offer in the way of example CUDA programs that show how to optimize code for the GPU architecture. A slightly less polite way to say that would be that while this book does offer some good code examples, the writing leaves much to be desired, and much better books are out there that cover the same material. The honest assessment is that this book is just a mess. Half the book could be cut and the other half rewritten to better explain things with clearer, non-circular definitions. The only good thing about the book is the code examples, and many of those examples are also redundant, filling the pages of the book with lines of code that the reader has seen multiple times before. This book could have been a third the length and covered the same amount of material.

Even though that last bit was a pretty harsh review, we should still explore what's in the book, if only to see how the breadth of material compares to Professional CUDA C Programming. The first chapter is the normal introduction to the book's material, describing the architecture of a GPU and discussing how parallel programming with this architecture is so different than programming on a CPU. The verbosity of this chapter alone should have been a clue that this book would drag on and on, but I was willing to give it a chance. The next chapter introduces our first real CUDA program with a vector addition kernel. We're still getting started with CUDA C at this point, so I chalk up the authors' overly detailed explanations to taking extra care with novice readers. We end up walking through all of the parts of a working CUDA program, explaining everything in excruciating detail.

The third chapter covers how to work more efficiently with threads and loading data into GPU memory from the CPU with a more complex example of calculating image blur. We also get our first exposure to thread synchronization, something that must be thoroughly understood to program GPUs effectively. This chapter is also where I start to realize how nutty some of the explanations are getting. Here's just one example of them describing how arrays are laid out in memory:
A two-dimensional array can be linearized in at least two ways. One way is to place all elements of the same row into consecutive locations. The rows are then placed one after another into the memory space. This arrangement, called row-major layout, is depicted in Fig. 3.3. To improve readability, we will use Mj,i to denote the M element at the jth row and the ith column. Pj,i is equivalent to the C expression M[j][i] but is slightly more readable. Fig. 3.3 illustrates how a 4×4 matrix M is linearized into a 16-element one-dimensional array, with all elements of row 0 first, followed by the four elements of row 1, and so on. Therefore, the one-dimensional equivalent index for M in row j and column i is j*4 +i. The j*4 term skips all elements of the rows before row j. The i term then selects the right element within the section for row j. The one-dimensional index for M2,1 is 2*4 +1 =9, as shown in Fig. 3.3, where M9 is the one-dimensional equivalent to M2,1. This process shows the way C compilers linearize two-dimensional arrays.
Wow. I'm not sure how a reader that needs this level of detail for understanding how a matrix is arranged in memory is going to understand the memory hierarchy and synchronization issues of GPU programming. This explanation is just too much for a book like this. Readers should already have some knowledge of standard C programming, including multi-dimensional array memory layout, before attempting CUDA programming. I can't imagine learning both at the same time going very well. As for readers who already know how all of this stuff works, they could easily skip every other paragraph and skim the rest to make trudging through these explanations more tolerable.

The next chapter is on how to manage memory and arrange data access to optimize memory usage and bandwidth. We find that memory management is just as, if not more important than thread management for making optimal use of the GPU computing resources, and the book solidifies this understanding through an extended optimization example of a matrix multiplication kernel.

At this point we've learned the fundamentals of GPU programming, so the next chapter moves into more advanced topics in performance optimization with the memory hierarchy and the compute core architecture. Then, chapter six covers number format considerations between integers and single- and double-precision floating point representations. The authors' definition of representable numbers struck me as exceptionally cringe-worthy here:
The representable numbers of a representation format are the numbers that can be exactly represented in the format.
This is but one example of their impenetrable and useless definitions. More often than not, I found that if I hadn't already known what they were talking about, their discussions would provide no further illumination.

Now we get into the halfway decent part of the book, the extended example chapters on parallel patterns. Each of these chapters works through a different example kernel of a particular problem that comes up often in parallel programming, and they introduce additional features of GPU programming that can assist in solving these problems in a more optimal way. The contents of these chapters are as follows:
  • Chapter 7: Convolution
  • Chapter 8: Prefix Sum (Accumulator)
  • Chapter 9: Parallel Histogram Calculation
  • Chapter 10: Sparse Matrix Computation
  • Chapter 11: Merge Sort
  • Chapter 12: Graph Search
As long as you skim the descriptions of the problems and solutions, and focus on understanding the code yourself, these chapters are quite useful examples of how to write performant parallel programs with CUDA. However, the authors continue to suffer from what seems to be a mis-interpretation of the phrase, "a picture is worth a thousand words." For every diagram they use, they also include a thousand words or more of explanation, describing the diagrams ad nauseam. 

The next chapter covers how to kick off kernels from within other kernels in order to enable dynamic parallelism. Up until this point, all kernels have been launched from the host (CPU) code, but it is possible to have kernels launch other kernels to take advantage of additional parallelism while the code is executing on the GPU, an effective feature for some algorithms. Then, the next three chapters are fairly useful application case studies. Like the parallel pattern example chapters, these chapters use CUDA code to show how to take advantage of more advanced features of the GPU, and how to put together everything we've learned so far to optimize some example parallel programs. The applications described are for non-Cartesian MRI, molecular visualization and analysis, and machine learning neural networks, so nice, interesting topics for GPU programming.

The last five chapters were either more drudgery or topics I wasn't interested in, so I skipped them and called it quits for this long and tedious book. For completeness, those chapters are on how to think when parallel programming (so a pep talk on what to think about from authors that couldn't clearly describe much else in the book), multi-GPU programming, OpenACC (another GPU programming framework, like CUDA), still more performance considerations, and a summary chapter. 

I couldn't bring myself to keep reading chapters that wouldn't amount to anything, so I put down the book after finishing the last chapter on application case studies. I found that chapters seven through sixteen contained most of the useful information in the book, but the introduction to CUDA programming was too verbose and confusing. There are much better books out there for learning that part of CUDA programming. Case in point: CUDA by Example or the next book in this review.

Professional CUDA C Programming

Unlike the last book, I was surprised by how readable this book was. The authors did an excellent job of presenting concepts in CUDA programming in a clear, direct, and succinct manner. They also did this without resorting to humor, which can sometimes work if the author is an excellent writer, but it often feels forced and lame when done poorly. It's better to stick to clear descriptions and tight writing, as these authors did quite well. I was actually disappointed that I didn't read this book first, instead saving it until last, because it did the best job of explaining all of the CUDA programming concepts while covering essentially the same material as Programming Massively Parallel Processors and certainly more than CUDA by Example

The first chapter is the obligatory introduction to CUDA with the requisite Hello, World program showing how to run code on the GPU. Right away, we can see how well-written the descriptions are with this discussion of how parallel programming is different than sequential programming:

When implementing a sequential algorithm, you may not need to understand the details of the computer architecture to write a correct program. However, when implementing algorithms for multicore machines, it is much more important for programmers to be aware of the characteristics of the underlying computer architecture. Writing both correct and efficient parallel programs requires a fundamental knowledge of multicore architectures.
We need to be prepared to think differently about problems when parallel programming, and we're going to have to learn the architecture of the underlying hardware to make full use of it. That leads us right into chapter 2, where we learn about the CUDA programming model and how to organize threads on the device, but it doesn't end there. Throughout the book we're learning more and more about the nVidia GPU architecture (specifically the older Fermi and Kepler architectures, since those were available at the time of the book's writing) in order to take full advantage of its compute power. I like how the authors grounded their discussions in specific GPU architectures and showed how the architecture was evolving from one generation to the next. I'm sure the newer Pascal, Volta, and Turing architectures provide more advanced and flexible features, but the book builds a great foundation. Chapter 2 also contains the clearest definition of a kernel that I've seen, yet:
A kernel function is the code to be executed on the device side. In a kernel function, you define the computation for a single thread, and the data access for that thread. When the kernel is called, many different CUDA threads perform the same computation in parallel.
This explanation is the essence of the paradigm shift from sequential to parallel programming, and it's important to understand the effect it has on the code that you write and how it runs on the hardware. In addition to the excellent writing, each chapter has some nice exercises at the end. That's not normally something you find in programming books like this. Exercises seem to be left to textbooks, like Programming Massively Parallel Processors, which had them as well, but in Professional CUDA C Programming they're more well-conceived and more relevant.

The next chapter covers the CUDA execution model, or how the code runs on the real hardware. Here is where we learn how to optimize CUDA programs to take advantage of all of those independent compute cores on the GPU, and this chapter even gets into dynamic parallelism earlier in the book rather than waiting and treating it as a special topic like the last book did.

Chapter 4 covers global memory and chapter 5 looks at shared and constant memory. Understanding the trade-offs of each of these memories is important to getting the maximum performance out of the GPU because most often these programs are memory-bound, not compute-bound. Like everything else, the authors do an excellent job explaining the memory hierarchy and how those trade-offs affect CUDA programs. The examples used throughout the book are simple so that the reader doesn't get bogged down trying to understand unrelated algorithm details. The more complex examples may be thought-provoking, but simple examples do a good job of showcasing the specifics of the topic at hand.

Chapter 6 addresses streams and events, which are used to overlap computation with data transfer. Using streams can partially, or in some cases completely hide the time it takes to get the data into the GPU memory. Chapter 7 explains more optimization techniques by using CUDA instruction-level primitives to directly control how computations are performed on the GPU. These instructions trade some accuracy for speed, and they should be used only if the accuracy is not critical to the application. The authors do a good job of explaining all of the implications here.

The last three chapters weren't as interesting to me, not because I was tired of the book this time, but because they were about the same topics that I skipped in the other CUDA books: OpenACC, multi-GPU programming, and the CUDA development process. The rest of the book was excellent, and far better than the other two CUDA books I read. The writing is clear with plenty of diagrams for better understanding of each topic, and the book organization is done well. If you're interested in GPU programming and want to read one book, this one is it.


Between these two CUDA books, the choice is obvious. Programming Massively Parallel Processors was a bloated dud. It may be worth it just for the large set of example programs it contains, but there are other options coming down the pipeline for that kind of cookbook that may be better. Professional CUDA C Programming was better in every way, and really the book to get for learning CUDA programming. The authors did a great job of explaining complex topics in GPU architecture with concise, understandable writing, relevant diagrams, and appropriate exercises for practice. It's exactly the kind of book I want for learning a new programming language, or in this case, programming paradigm. If you're at all interested in CUDA programming, it's worth checking out.