CUDA Programming Model#

Overview

Questions:

  • What is thread hierarchy in CUDA?

  • How can threads be organized within blocks and grids?

  • How can the data be transferred between host and device memory?

  • How can we measure the wall-time of an operation in a program?

Objectives:

  • Learning about the basics of the device memory management

  • Understanding the concept of thread hierarchy in CUDA programming model

  • Familiarity with the logistics of a typical CUDA program

Our Hello World from previous lesson lacks two important aspects of a CUDA program that are crucial for CUDA programmers: thread hierarchy and memory management. In the next few examples, we will illustrate how threads can be organized within blocks of threads and grids of block in order to control the amount of parallelization in a CUDA program. Finally, we will employ CUDA’s main memory management functionalities to perform an array summation operation on a GPU device.

1. Parallelizing Loops: A Prelude to Thread Hierarchy#

Before getting into the formalities and complexities of the thread hierarchy, let us use a simple example to show its potential impact and usefulness in CUDA programming: parallelizing loops. The following code implements and calls cpuPrinter() function to print the loop indices on the host.

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    return(EXIT_SUCCESS);
}

copy the proceeding code block into a text file save it as cpu_printer.cu. Then, open a terminal and compile and run the program using the following commands

$ nvcc cpu_printer.cu -o cpu_printer
$ ./cpu_printer

Running these commands generates the following output

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

In the aforementioned code block, we fixed the loop iteration number (N=8) in the preprocessor statement and assumed that it is equal to the number of available threads. In the previous lesson, we demonstrated how to refactor a host function into a device kernel. The difference here is that according to our assumption, our kernel should run 8 times and print the same set of indices as that of the host function call, cpuPrinter().

Our general strategy for refactoring the loops in the host function calls into device kernels consists of two steps: (i) write the kernel for an individual thread (which in this case is responsible for performing one iteration), and (ii) launch the kernel with a total number of threads equal to the number of loop iterations. Let us call our refactored kernel gpuPrinter<<<>>>() and store the modified code in the gpu_printer_sb.cu script. The refactored code should look like the following

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);

    printf("\n");
}

__global__ void gpuPrinter(void) {
    int idx = threadIdx.x;
    printf("GPU Prints Idx: %d\n", idx);        /* Write the kernel for individual threads */
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    gpuPrinter<<<1,N>>>();      /*  Launch the kernel for many threads */
                                /*  CUDA will raise an error if N > 1024 */
    cudaDeviceSynchronize();

    return(EXIT_SUCCESS);
}

Compiling and running this code

$ nvcc gpu_printer_sb.cu -o gpu_printer_sb
$ ./gpu_printer_sb

gives us the desired result

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3
GPU Prints Idx: 4
GPU Prints Idx: 5
GPU Prints Idx: 6
GPU Prints Idx: 7

Pay attention to the two arguments in the execution configuration in the kernel launch: number of blocks in each grid and number of threads in each block, respectively. If the gpuPrinter<<<1,N>>>() kernel is supposed to print the same set of loop indices as that printed by cpuPrinter(N), the simplest approach in this case might be to use one block with N threads in it.

This strategy works well until the number of loop iterations becomes larger than a threshold value. This threshold is set by the hardware limit for the maximum number of threads in each block which is 1024. As such, one must resort to distributing the threads over multiple blocks while organizing them in the execution configuration in order to be able to match the number of loop iterations by the number of available threads.

Note

All thread blocks have equal number of threads.

Let us continue with our example and for the time being and keep the same number of loop iterations but distribute the threads among multiple blocks. Cope and paste the following code block in an empty text file and save it as gpu_printer_mb_local.cu.

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);

    printf("\n");
}

__global__ void gpuPrinter(void) {
    int idx = threadIdx.x;                                  /* Note that the local index (threadIdx.x) is used */
    printf("GPU Prints Idx: %d\n", idx);                    /* Write the kernel for individual threads */
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    gpuPrinter<<<2,N/2>>>();      /*  Launch the kernel with two blocks threads */
                                  
    cudaDeviceSynchronize();

    return(EXIT_SUCCESS);
}

After compilation and execution, we get

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3
GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3

The output illustrates that the local thread indices (threadIdx.x) on the GPU restart to zero going form one thread block to the next. In the schematics shown below, the thread and block indexing patterns for each grid of block is demonstrated for a case where 15 threads are equally distributed among three blocks.

_images/1D-block.png

Note

Each block has a unique zero-based index in a grid and each thread has a unique zero-based index within each block.

In order to reproduce the same set of indices printed by the CPU function, we need to translate the local thread indices to their global variant. As such, we use the following formula for this conversion

$$ \tag{1}\label{EQ:LOCALTOGLOBAL} (\text{globalThreadIdx})_q = \text{threadIdx}.q + \text{blockIdx}.q \times \text{blockDim}.q \qquad \quad \text{where} \qquad q = x, y, z $$

We now employ Eq. \ref{EQ:LOCALTOGLOBAL} in our code to convert the local thread indices to their global variant. After copying and pasting the following code block in a new text file, save it as gpu_printer_mb_global.cu.

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);

    printf("\n");
}

__global__ void gpuPrinter(void) {
    int idx = threadIdx.x +  blockIdx.x * blockDim.x;       /* The local thread index (threadIdx.x) in the right hand 
                                                               side should be shifted by an offset value 
                                                               (blockIdx.x * blockDim.x) to compensate translate it to
                                                               a global index */
    printf("GPU Prints Idx: %d\n", idx);                    /* Write the kernel for individual threads */
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    gpuPrinter<<<2,N/2>>>();      /*  Organizing eight threads in two blocks with four threads */
                                  
    cudaDeviceSynchronize();

    return(EXIT_SUCCESS);
}

Compiling and running the aforementioned script yields the expected result provided below

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

GPU Prints Idx: 4
GPU Prints Idx: 5
GPU Prints Idx: 6
GPU Prints Idx: 7
GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3

The output indicates although the same set of indices has been printed for both host function and device kernel, the order of indices is not preserved for the latter. The reason is that the sequential task of performing the same operation within a loop is now distributed among multiple threads running (almost) at the same time while there is no guarantee for their respective order when dispatched by the streaming multiprocessor.

In the aforementioned examples, we assumed the number of loop iterations was equal to the total number of threads. What happens if we dispatch more threads in the execution configuration than the number of loop iterations? How about the opposite case where the number of available threads is smaller than that of elements in the target data structure? We use two techniques presented by Mark Harris in his blog post to deal with the two aforementioned cases, respectively: monolithic kernels and grid-stride loops.

1.1. Monolithic Kernels#

The first scenario is the case where the number of available threads organized through kernel’s execution configuration and dispatched from kernel launch is larger than the number of elements in the target data structure. In this case, the extra threads will step over the memory addresses corresponding to the allocated data structure and attempt to access unallocated memory spots. This attempt might lead to incorrect results and undefined behavior. In order to avoid accessing the unallocated memory addresses, we add an if(numThreads < N) conditional statement to our kernel implementation. Making the aforementioned change in our last code and storing it in a script named gpu_printer_monolithic.cu, we should have

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);

    printf("\n");
}

__global__ void gpuPrinter(int nlim) {
    int idx = threadIdx.x +  blockIdx.x * blockDim.x;
    
    if(idx < nlim)                 /* Make sure the global index does not go beyond the limit */
        printf("GPU Prints Idx: %d\n", idx);
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    gpuPrinter<<<4,N>>>(N);      /*  Launch the kernel with 32 threads (4 blocks with 8 threads) */
                                /*  The number of dispatched threads (32) is greater than N */
    cudaDeviceSynchronize();

    return(EXIT_SUCCESS);
}

After compiling and running the code above, we get

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3
GPU Prints Idx: 4
GPU Prints Idx: 5
GPU Prints Idx: 6
GPU Prints Idx: 7

It might be interesting to mention that the name monolithic kernel is a reminder of the adopted conditional statement and our initial assumption that the total number of available threads is larger than the number of elements in the data structure.

1.2. Grid-Stride Loops#

In a situation when the number of available threads is less than what is necessary for a one-to-one mapping between the threads and data elements, each thread within a kernel might have to work on multiple data. In dealing with this case using grid-stride loop, we include our kernel operation in a for() loop where the thread indices will have to be incremented by the total number of dispatched threads (grid stride) for every round of work. For example, if our data structure has 50 elements but we have dispatched 25 threads in our kernel launch, each thread has to operate on two different data elements. Then, thread 0 performs work on element 0, thread 1 on element 1, …, thread 24 on element 24. Then, the for loop increments the thread indices by adding the grid size (the total number of dispatched threads, here, 25). As such, in the second iteration, thread 25 (0 + 25) operates on element 25, thread 26 (1 + 25) on element 26 … thread 49 (24 + 25) on element 49. Let us get back to the previous example and see how it changes in this scenario. Copy the following code block and save it in a script named gpu_printer_grid_stride_loop.cu.

#include <stdlib.h>
#include <stdio.h>

#define N 8

void cpuPrinter(int nlim) {
    for (int idx = 0; idx < nlim; idx++)
        printf("CPU Prints Idx: %d\n", idx);

    printf("\n");
}

__global__ void gpuPrinter(int nlim) {
    int idx = threadIdx.x +  blockIdx.x * blockDim.x;
    int stride = 2;                                 /* Manually initialized. Not good! */
    
    for(int i = idx; i < nlim; i+=stride)           /* grid-stride loop */
        printf("GPU Prints Idx: %d\n", i);
}

int main(int argc, char **argv) {
    cpuPrinter(N);

    gpuPrinter<<<1, 2>>>(N);      /*  Launch the kernel 2 threads which is less than 8 */

    cudaDeviceSynchronize();

    return(EXIT_SUCCESS);
}

After compilation and execution, once again we get the desired output

CPU Prints Idx: 0
CPU Prints Idx: 1
CPU Prints Idx: 2
CPU Prints Idx: 3
CPU Prints Idx: 4
CPU Prints Idx: 5
CPU Prints Idx: 6
CPU Prints Idx: 7

GPU Prints Idx: 0
GPU Prints Idx: 1
GPU Prints Idx: 2
GPU Prints Idx: 3
GPU Prints Idx: 4
GPU Prints Idx: 5
GPU Prints Idx: 6
GPU Prints Idx: 7

Note that in this example, we have adopted very small sizes for our data structure (8) and threads (2) for simplicity and clarity of demonstration.

In this case, each thread performs the print operation four times. In this case, initialized the the stride value in our kernel through hard-coding. This is not ideal as we might not know all necessary information for calculating the stride value before the runtime. In the next section, we provide a formal introduction to CUDA thread hierarchy and demonstrate how CUDA provides kernels with all necessary information regarding the thread organization at the runtime.

2. Thread Hierarchy in CUDA#

CUDA exposes a two-level thread hierarchy, consisting of block of threads and grids of blocks, to the programmer in order to allow for thread organization on GPU devices.

_images/grid_blocks.png

As figure demonstrates, each grid is often constructed from many thread blocks. Each block is a group of threads invoked by kernel to perform a specific task in parallel. Each thread in a block has its own private local memory space. However, threads in a block can cooperate to perform the same task in parallel thanks to the shared memory space in the block which makes data visible to all threads in the block for the life time of that block. The cooperation between threads not only can happen in terms of sharing the data and access to it within the block-local shared memory space but also can be realized in the form of block-level thread synchronization.

Within the aforementioned two-level thread hierarchy, each thread can be identified with two parameters:

  • threadIdx: which refers to the thread index within each block

  • blockIdx: which stands for the block index within each grid

Both threadIdx and blockIdx identifiers are built-in structure variables of integer-based vector-type, uint3, assigned to each thread by CUDA runtime application. The internal assignment of these variables are driven by kernel execution which makes them accessible to that kernel. Components of the threadIdx or blockIdx structure variables, i.e., threadIdx.x, threadIdx.y, and threadIdx.z as well as blockIdx.x, blockIdx.y, and blockIdx.z allow for a three-dimensional organization of blocks and grids in CUDA. The dimensions of grids of blocks and bocks of threads can be controlled via the following CUDA built-in variables, respectively

  • gridDim: which refers to the grids of block object dimension

  • blockDim: which indicates the block of threads’ dimension

The blockDim and gridDim variables are structures of dim3 type with x, y, z fields for Cartesian components.

Let’s write a simple kernel that shows how blocks of threads and grids of blocks can be organized and identified in a CUDA program:

#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>

__global__ void printThreadID() {
    /* For each thread, the kernel prints
     * the threadIdx, blockIdx, blockDim,
     * and gridDim, respectively.
     */
    printf("threadIdx:(%d, %d, %d), \
            blockIdx:(%d, %d, %d), \
            blockDim:(%d, %d, %d), \
            gridDim:(%d, %d, %d)\n", \
            threadIdx.x, threadIdx.y, threadIdx.z, \
            blockIdx.x, blockIdx.y, blockIdx.z, \
            blockDim.x, blockDim.y, blockDim.z, \
            gridDim.x, gridDim.y, gridDim.z);

}

int main(int argc, char **argv)
{
    /* Array size */
    int numArray  = 6;

    /* Number of threads in blocks */
    int numBlocks = 2;

    /* Organizing grids and blocks */
    dim3 block(numBlocks);
    dim3 grid((numArray + block.x - 1) / block.x);

    /* Let the user know that the dimensions will be printed from the host */
    printf("Printing from the host!\n");

    /* Print the grid and block dimensions from the host */
    printf("[grid.x, grid.y, grid.z]:    [%d, %d, %d]\n", grid.x, grid.y, grid.z);
    printf("[block.x, block.y, block.z]: [%d, %d, %d]\n\n", block.x, block.y, block.z);

    /* Indicate that the dimensions will now be printed from the device */
    printf("Printing from the device!\n");

    /* Print the grid and block dimensions from the device */
    printThreadID<<<grid, block>>>();

    /* Performing house-keeping for the device */
    cudaDeviceReset();

    return(EXIT_SUCCESS);
}

Running this code will generate the following output:

Printing from the host!
[grid.x, grid.y, grid.z]:    [3, 1, 1]
[block.x, block.y, block.z]: [2, 1, 1]

Printing from the device!
threadIdx:(0, 0, 0),             blockIdx:(0, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)
threadIdx:(1, 0, 0),             blockIdx:(0, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)
threadIdx:(0, 0, 0),             blockIdx:(2, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)
threadIdx:(1, 0, 0),             blockIdx:(2, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)
threadIdx:(0, 0, 0),             blockIdx:(1, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)
threadIdx:(1, 0, 0),             blockIdx:(1, 0, 0),             blockDim:(2, 1, 1),             gridDim:(3, 1, 1)

Now, let’s get back to our code and analyze it step by step in order to understand the mechanistic details of thread organization in CUDA programming. First, you might have noticed that we have included cuda_runtime.h header file in addition to stdio.h and stdlib.h that provide access to printf() functions and status macros in C, respectively. The CUDA runtime API manages the kernel loads, parameter passes and configurations before kernel execution. CUDA runtime consists of two main parts: (i) a C-style function interface (cuda_runtime_api.h), and (ii) a C++-style interface (cuda_runtime.h) built upon C-APIs as wrapper extensions for programming convenience. As long as our codes are compiled with nvcc, it manages the inclusion of CUDA runtime API headers for us. So, you can try even removing the cuda_runtime.h header from the code but it still compiles without any issues. The structure of the CUDA runtime API is detailed in the CUDA Toolkit documentation.

The next part of our code defines the printThreadID() kernel implementation which is comprised of a single function call to the formatted print function, printf(). Our code demonstrates that there are two different sets of grid and block identification variables: (i) user-defined variables of type dim3 that are defined and visible on the host side, only,

dim3 block(numBlocks);
dim3 grid((numArray + block.x - 1) / block.x);

and (ii) the built-in thread, block and grid identification variables of the type uint3 which will be visible on the device and therefore accessible within the kernel function.

We previously mentioned that the structures of dim3 type have three fields but in this case, only one value has been passed to both block and grid object constructors’ argument lists. As such, the other two undefined variables are automatically initialized to 1 and ignored (See the output above). It is important to note that the number of grids in each direction (i.e., x, y, z) is dependent on the number of threads in blocks through the following formula:

$$ \tag{2}\label{EQ:THRDORG} \text{grids}.q = \left(\frac{\text{numberOfElements} + \text{block}.q - 1}{\text{block}.q}\right), \qquad \quad \text{where} \qquad q = x, y, z $$

In the next part of our code, we then access the block and grid dimension variables within the main function to print them to the screen from the host.

printf("[grid.x, grid.y, grid.z]:    [%d, %d, %d]\n", grid.x, grid.y, grid.z);
printf("[block.x, block.y, block.z]: [%d, %d, %d]\n\n", block.x, block.y, block.z);

After that, the grid and block objects can be passed to the kernel execution configuration as arguments:

printThreadID<<<grid, block>>>();

The kernel execution triggers the initialization of the built-in thread, block and grid identification variables of the type uint3 by the CUDA runtime which will be visible and accessible to the kernel functions on the device side.

__global__ void printThreadID() {
    /* For each thread, the kernel prints
     * the threadIdx, blockIdx, blockDim,
     * and gridDim, respectively.
     */
    printf("threadIdx:(%d, %d, %d), \
            blockIdx:(%d, %d, %d), \
            blockDim:(%d, %d, %d), \
            gridDim:(%d, %d, %d)\n", \
            threadIdx.x, threadIdx.y, threadIdx.z, \
            blockIdx.x, blockIdx.y, blockIdx.z, \
            blockDim.x, blockDim.y, blockDim.z, \
            gridDim.x, gridDim.y, gridDim.z);
}

When each active thread runs the kernel, it has access to the aforementioned pre-initialized identification indices. Therefore, the kernel function printThreadID() can print the thread identifiers to the screen. Note that in order to improve readability, we have used backslashes, ‘', to split a long function argument list in printf() function call into multiple lines of code. Finally, we call the cudaDeviceReset() function to destroy all allocated memory addresses on the device and restart its state within the current process.

In the following, we present some of the most frequently used functions from CUDA runtime API collection that will be used in our array summation case study.

3. Basics of the Device Memory Management in CUDA#

In our array summation example, (and in many other scientific applications, in general), we will follow a typical pattern in CUDA programming which can be formulated in a series of steps as follows:

  1. Transferring the data from host to device

  2. Kernel execution on the device

  3. Moving the results back from device to host

As we mentioned previously, most CUDA programs have at least two code domains: (i) the host code domain which runs on the host (CPU and its memory), and (ii) the device code domain which is executed on the device (GPU and its memory). The separation (and localization) of data processes in each domain with different architecture type requires a specific strategy for memory management and data transfer between the two processing units. As such, CUDA provides convenient runtime APIs that allow the user to allocate or deallocate the device memory and transfer data between host and device memories.

C/C++

CUDA

Description

malloc()

cudaMalloc()

Allocate uninitialized memory

memset()

cudaMemset()

Initialize memory

free()

cudaFree()

Deallocate memory

memcpy()

cudaMemcpy()

Copy memory

As the table above demonstrates, CUDA adopts a convenient naming style for C/C++ functions and syntax extensions making it easier for the programmer to manage memory on GPU devices. NVIDIA adopts lowerCamelCase (Java style) naming style for its CUDA C/C++ extension APIs. Here, the cudaMalloc() function with the following syntax

__host__ __device__ cudaError_t cudaMalloc(void** devPtr, size_t size)

allocates size bytes of linear memory on the device pointed to by the devPtr double-pointer variable. As mentioned previously, the __host__ and __device__ qualifiers can be used together should the kernel be compiled for both host and device.

Note

All CUDA function APIs (except kernel launches) return an error value of enumerated type, cudaError_t.

With the memory being allocated on the device, the cudaMemcpy() function, with the following signature,

__host__ cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)

can be adopted to transfer count bytes of data from source memory, pointed to by src pointer, to the destination memory address, pointed to by dst. The direction of data transfer is inferred from the value of the variable, kind, of cuda memory enumeration type, cudaMemcpyKind, which can take one of the following values:

  • cudaMemcpyHostToHost

  • cudaMemcpyHostToDevice

  • cudaMemcpyDeviceToHost

  • cudaMemcpyDeviceToDevice

  • cudaMemcpyDefault

CUDA recommends passing cudaMemcpyDefault to cudaMemcpy() function call, in which case, the transfer direction is automatically chosen based upon the pointer values scr and dst. Note that cudaMemcpyDefault should only be adopted when unified virtual addressing (UVA) is supported.

Note

Most kernel launches we consider in this tutorial are asynchronous in their behavior in which case the control flow is immediately returned to the host after kernel execution. However, some function calls, such as cudaMemcpy(), are synchronous– the host application stops until the function completes its task.

4. Summation of Arrays on GPUs#

Copy the following code into an empty text file, rename it to gpuVectorSum.cu and save it.

/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <time.h>
#include <sys/time.h>
#include <cuda_runtime.h>

/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size) {
    /* Generating float-type random numbers 
     * between 0.0 and 1.0
     */
    time_t t;
    srand( (unsigned int) time(&t) );

    for (int i = 0; i < size; i++) {
        inputArray[i] = ( (float)rand() / (float)(RAND_MAX) ) * 1.0;
    }

    return;
}
/*-----------------------------------------------*/
void arraySumOnHost(float *A, float *B, float *C, const int size) {
    for (int i = 0; i < size; i++) {
        C[i] = A[i] + B[i];
    }
}
/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}
/*-----------------------------------------------*/
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size) {
    double tolerance = 1.0E-8;
    bool isEqual = true;

    for (int i = 0; i < size; i++) {
        if (abs(hostPtr[i] - devicePtr[i]) > tolerance) {
            isEqual = false;
            printf("Arrays are NOT equal because:\n");
            printf("at %dth index: hostPtr[%d] = %5.2f \
            and devicePtr[%d] = %5.2f;\n", \
            i, i, hostPtr[i], i, devicePtr[i]);
            break;
        }
    }

    if (isEqual) {
        printf("Arrays are equal.\n\n");
    }
}
/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s)\n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}

After saving the code, it can be compiled and run using the following two commands

$ nvcc gpuVectorSum.cu -o gpuVectorSum
$ ./gpuVectorSum

with the output similar to the following

Kicking off ./test

GPU device "GeForce GTX 1650" with index "0" is set!

Vector size: 16777216 floats (64 MB)

Elapsed time for dataInitializer: 0.757348 second(s)
Elapsed time for arraySumOnHost: 0.062009 second(s)
Elapsed time for arraySumOnDevice <<< 16384, 1024 >>>: 0.001885 second(s)

Arrays are equal.

In this case, wall-time measurements indicate that the arraySumOnDevice() executes more than ten times faster than arraySumOnHost(). Since the size of arrays we used are very small compared to what makes such measurements relevant and useful from practical perspective, we do not intend overstate the importance of these timings and observed savings. Another major problem with this code is its length. In the next lesson, we are going to restructure this code by breaking it into multiple source files when we talk a little bit about the nvcc compiler.

Now, it is time to focus on our code and start analyzing it from the beginning.

4.1. Header Files and Function Definitions#

Our vector addition code, involves three additional new headers: (i) stdbool.h for including the true and false boolean type definitions, (ii) time.h for including time_t type and time() function definitions, and (iii) sys/time.h for adding the timeval and timezone structure definitions.

In the next section of our code, separated by commented star-shaped lines, there are four C/C++ function definitions [chronometer(), dataInitializer(), arraySumOnHost(), and arrayEqualityCheck()] and one CUDA kernel implementation arraySumOnDevice().

The chronometer() uses timezone and timeval structures and gettimeofday() function – included by sys/time.h– to define a utility function that allow us measure the wall-clock time between any two points in our code.

Note

The gettimeofday() function is supported by GCC compilers and might not work on Windows. However, there are multiple ways to perform the timing task within C/C++. For further information, see here or take a glance at Advanced Linux Programming Book (section 8.7 gettimeofday: Wall-Clock Time).

The dataInitializer() function accepts pointers to the location of allocated memories for the input arrays (A and B) on the host and size of the arrays. The input arrays are then initialized and filled with float-type random numbers between zero and one.

The arraySumOnHost() function, as its name suggests, is responsible for performing an element-by-element summation of the two arrays, A and B, on the host and stores the results into array C. The kernel function arraySumOnDevice() works similar to its CPU counterpart, arraySumOnHost() except in its logistics. Although, arraySumOnDevice() kernel is written for a single thread specified by its thread index variable, idx, when the kernel is launched with a particular kernel execution configuration and thread organization, hundreds or thousands of active threads concurrently perform the same summation operation on individual elements of the arrays A and B and store the results in array C. This time, the A, B and C vectors are allocated on device’s global memory.

The arrayEqualityCheck() function performs an element-by-element comparison of the resulting vectors returned from arraySumOnHost() and arraySumOnDevice() functions on the host side.

4.2. Structure of the Program#

The next section of our code involves function calls in the main() driver function. After fetching the name of the program executable from the command line and printing it to the screen, CUDA runtime functions are adopted to set the device and then print its name. Setting the device index will be useful when the system has multiple devices installed. In a miscellaneous lesson, we will describe how to choose the best (most powerful) available GPU if the host system has multiple GPU accelerator devices installed.

Summation of arrays on the host side starts by specifying the array sizes using the left-shit bitwise operator and pointer declarations to be able to point to the memory addresses for arrays A and B when we allocate them on the host using the malloc() function. In order to be able to distinguish between pointers that refer to the memory addresses on the host and the device, we prepend the prefixes h_ (for host) or d_ (for device) to the variable names. Each of the arrays h_A and h_B are then passed to dataInitializer() function to be filled with random float numbers. After initialization step, the arraySumOnHost() function sums both arrays A and B together on the host and stores the resulting array in another allocated memory address on the host pointed to by the hostPtr variable. We adopted chronometer() to time the initialization and summations steps and printed those timings to the screen.

The next part of the code is similar to the what we did for the summation of arrays A and B on the host but this time, we adopt CUDA C APIs instead of those of C/C++. For example, for memory allocation on the device, we use cudaMalloc() instead of malloc().

Remember in the first section of this lesson, we mentioned that each CUDA program has a series of typical steps in order to perform a specific task. After memory allocation– for arrays d_A and d_B– on the device, we need to initialize them through copying the content of the corresponding arrays, h_A and h_B, from the host to their aforementioned counterparts on the device via adopting cudaMemcpy() functions. We have just finished the first (out of three) step(s) in a typical CUDA program where we transferred data from host to device (HtoD) using cudaMemcpy(). Note that the cudaMemcpy() is synchronous meaning that the flow of execution on the host will stop and wait for cudaMemcpy() to finish its job.

The next step is to execute the kernel arraySumOnDevice() on the device. In order to do that, we need to organize our threads in blocks and blocks in grids from the host so that we can access them on the device within the kernel as mentioned in the previous section. Thread organization is one of the most important aspects of CUDA programming model that has a major impact on the performance of our code. In our intermediate and advanced tutorials, we will constantly keep an eye on this factor and try to optimize the performance through finding the best possible configuration of threads in blocks and grids through a profiling-driven approach. As Eq. \eqref{EQ:THRDORG} suggests, fixing the number of threads in each block (numThreadsInBlocks) for each block dimension determines the number of blocks in the grid for that dimension. Once we have defined the block and grid variables on the host, we pass them to the execution configuration <<<grid, block>>> as arguments in order to launch the kernel on the device. Compared with synchronous function calls such as cudaMemcpy() where the host has to wait until its execution is completed, the kernel launches are asynchronous and the execution flow returns to host right after the kernel is executed. Since we intend to measure the execution wall-time for arraySumOnDevice() kernel launch from the host, we need to tell the host to wait for the kernel to complete its task. This is done by calling cudaDeviceSynchronize() from the host right after the kernel launch.

When the kernel is launched, thousands of threads gain access to data and resources on GPU and perform the array summation concurrently. This is the motivation behind the concept Single Instruction Multiple Threads (SIMT) which NVIDIA has coined for this type of architecture.

Note

The SIMT architecture is very similar to the Single Instruction Multiple Data (SIMD) variant in Flynn’s Taxonomy. However, their main difference is that in SIMD, all elements of an array should be operated upon simultaneously, whereas in SIMT architecture, multiple threads in the same group, while performing the same instruction, execute the operation independently on their own private data.

The result of array summation on GPU is stored in another array d_C on device’s global memory. The step three of a CUDA program is to transfer the resulting data back from device to host (DtoH). Before transferring the data, we make sure that there was no errors in the last CUDA runtime call through using cudaGetLastError(). If there was no errors, we transfer the results back to the host using cudaMemcpy() and store it in host’s allocated memory pointed to by devicePtr variable. Now that we have the summation results from CPU and GPU stored on the host memory variables, hostPtr and devicePtr, respectively, we can compare them using arrayEqualityCheck() function. The final stage of our program performs housekeeping by deallocating the memories on both device and host.

Key Points

  • CUDA programming model

  • Device memory management

  • Thread hierarchy

  • Typical operations in a CUDA program