If you’re here, you’ve probably identified a serious performance bottleneck in your code, and need a way around it. That’s where my team was about a year ago. Our application is a large raster calculator that either renders the raster into a picture, or summarizes the dataset represented by the raster depending on what request the user has made. Additionally, we can’t pre-compute the results as there is a rather complex JSON-based Domain Specific Language (DSL) that lets the user describe, at a granular level, what dataset they want us to render or summarize. All of this has to be done in real-time during an API request. Ultimately we decided to parallelize the calculations with CUDA because during testing it performed significantly better than the other options with our test datasets.
Ultimately, shelling out to CUDA was significantly easier than I expected. It came down to 4 steps.
- Write CUDA kernel
- Wrap CUDA kernel in a C function
- Package into shared library
- Call C function from cgo
For those of you that want to play along at home, a working minimal repo with all of the code samples I’m using in this post can be found here
Terminology
Before we begin, I’d like to take a moment and clarify the terminology I’m going to be using for this post.
host: refers to the computer the GPU is attached to
device: the GPU itself
kernel: function executed in parallel on the GPU
CUDA Basics
Footguns
Beyond the fact that you’re working in C++ (or C or Fortran), there’s really only two jarring things about working with CUDA (in the easy case anyway).
- You’re now working with two address spaces
- Some functions are only callable from either the host side or device side. Your functions get colored unless you specifically annotate them to tell the compiler to make them available on both sides. Variables share a similar fate.
Ok, so what does any of that mean?
Well, the first one means that both the host and device have their own address space. You can have pointers to either address space, but dereferencing them on the wrong side will, most likely, cause your program to abort.
The second means that you have to tell the nvidia compiler (nvcc) where your function is callable from and where variables will be referenced from. There are four options.
- Functions with no attributes, or with the
__host__
attribute are callable from the host only. This includes basically all of the C++ standard library, along with STL data structure methods. - Functions with the
__global__
attribute (kernels) exist on the device, and are callable from either the host or the device. These functions operate in parallel. - Functions marked with the
__device__
attribute exist on the device and are only callable from the device. - Functions with both the
__host__
and__device__
attributes exist on both sides and are callable from either side. Unlike__global__
functions, these are more like the traditional serial functions we know and love.
Writing the kernel
Ok, with all of that out of the way, it’s time to write some code. We need one of those fancy __global__
functions to run in parallel on the device. Here’s an example of one that adds one number in an array to another number.
__global__ void add_kernel(double *a, double *b, size_t len) {
size_t i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < len) {
a[i] = a[i] + b[i];
}
}
Wait, what are those magical blockIdx
, blockDim
, and threadIdx
variables? Those are globals that the nvidia compiler makes available for you to figure out what unit of work you need to do. See, your kernel function is run in parallel in groups of threads known as a “block”. A block can contain up to 1024 threads. blockIdx
is which block you’re running on, blockDim
is the size of the block, and threadIdx
is which thread this is. Using those three numbers, we can figure out which element of the array we’re supposed to be operating on. We’re accessing the single dimension we have located at the x
property, but you can have more dimensions (populated under y
and z
properties), but their use is out of scope for this article.
An additional important note, we need to compare our index to the length of the arrays we’re working on. A whole number of blocks will run, so if your array isn’t an exact multiple of the block size, you could end up overwriting memory you didn’t intend to.
Calling the kernel
Now that we have the kernel, we need to call it. We can’t quite call it like a normal function. You remember those magical blockIdx
and blockDim
variables? When we call the kernel, we need to tell the compiler how big of blocks to use, as well as how many blocks we want to run. To do this, CUDA uses syntax like this:
add_kernel<<<num_blocks, block_size>>>(dev_a, dev_b, len);
I’ve uses integers for num_block
and block_size
in my example code since we’re just working with arrays, but you can actually supply a struct of 3 integers called a dim3
. Supplying this will populate the y
and z
dimensions of the block we talked about before.
Alright, so we’ve covered writing and calling the kernels. We’ve got to be just about done, right? Well, not quite. We still have to get the data we want to operate on onto the device itself. The general pattern when calling a kernel is:
- allocate memory on the device
- copy memory onto the device
- call kernel
- copy result back to the host from the device
- free memory on device
You can see a full example of calling a kernel here.
Creating the Library
Creating the library is a relatively simple process. All you need to do is wrap your function in an extern "C"
block so that go can call it. Technically, you could just put all of your code in the wrapper function itself, but this is cleaner to me, especially when you have many functions you’re trying to make available on the go side.
The only real choice here is between a shared or a static library (if you’re on linux, these will produce a .so
or a .a
file on respectively). A static library will be compiled into your go binary, and will be easier to deploy, but there may be legal issues with statically compiling in certain code, especially if you’re working on a corporate application. If you choose the shared library, the .so
file must exist in your LD_LIBRARY_PATH
at runtime.
Examples for setting both of these up using cmake exist here. An important note is that, if you choose the static library, you need to tell the compiler to resolve device symbols (set_target_properties(examplestatic PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
), otherwise you will get errors at link time.
Calling C From Golang
Ok, that’s the hard part done. Seriously, the rest is just normal cgo calls as if we weren’t using CUDA at all.
To call c code from cuda, we need a magic comment at the top of our go file (but after the package
declaration) that’s the rough equivalent of a header file in C. It’s going to look something like:
package static
/*
#cgo LDFLAGS: -lexamplestatic -L${SRCDIR}/../example/build -lcuda -lcudart -lm
#include <stdlib.h>
int add_wrapper(double *a, double *b, size_t len);
*/
import "C"
The first -l
option will be the name of your library without the preceding lib
and without the trailing .so
or .a
. The -L
option will be a path to the directory the .so
or .a
file is in.
If you have errors building, make sure there isn’t a blank line between the comment and the import "C"
. This is an error.
All we have to do now is call the C function from go, I like to make a little wrapper function to return errors instead of C-style error codes. That would look a little like this:
func cudaAdd(a, b []float64) error {
if res := C.add_wrapper((*C.double)(&a[0]), (*C.double)(&b[0]), C.size_t(len(a))); res != 0 {
return fmt.Errorf("got bad error code from C.add %d", int(res))
}
return nil
}
All we’re doing is passing the address of the first element of each slice. You may have some concern about casting a pointer to a float64
to a pointer to a C double
, as there aren’t a ton of guarantees about floating point format in C or C++, but nvidia adheres to the IEEE-754 floating point standard, which is the same standard go uses for its floating point numbers.
Bonus section: Testing
We’ve covered a lot, but I want to sneak in just one more topic. Writing raw CUDA (or C++ in general) can lead to tremendous performance gains, but working with CUDA can be unweildy and the exact semantics aren’t always obvious at a glance. Writing unit tests is imperative to have any confidence in your library. I’ve added an example of testing this code into my repo. Once you’ve wrapped the CUDA kernel in a C++ function, you can use any C/C++ testing library. I’ve opted for GoogleTest in this example project.
In order to provide a prod-like test environment, it is vital… nay, essential… that the tests suite in your CI/CD process use actual GPUs. Accept no substitute for true nvidia hardware. Without such hardware in your pipeline, subsequent commits will introduce bugs.
I think we’re finally done here. I hope I saved at least one of you several hours of debugging some error message you’re getting. See you next time.