CUDA GPU Compilation Model#

Overview

Questions:

  • What is NVCC compiler and why do we need it?

  • Can multiple GPU and CPU source code files be simultaneously compiled with NVCC?

  • How does NVCC distinguish between the host and device code domains and handle the compilation process?

  • How can runtime errors be handled during a CUDA program execution?

Objectives:

  • Understanding the basic mechanism of NVCC compilation phases

  • Learning about multiple source code compilation mode in NVCC compiler

  • Mastering the basics of error handling in a CUDA program using C/C++ wrapper marcos

1. NVIDIA’s CUDA Compiler#

NVIDIA’s CUDA compiler (NVCC) is distributed as part of CUDA Toolkit and is based upon the poplar LLVM open-source infrastructure. Each CUDA program is a combination of host code written in C/C++ standard semantics with some extensions within CUDA API as well as the GPU device kernel functions. The nvcc compiler driver separates the host code from that of the device. The host code is then pre-processed and compiled with host’s C++ compilers supported by nvcc. The nvcc compiler also pre-processes and compiles the device kernel functions using the proprietary NVIDIA assemblers and compilers. Then, nvcc embeds the GPU kernels as fatbinary images into the host object files. Finally, during the linking stage, CUDA runtime libraries are added for kernel procedure calls as well as memory and data transfer managements. The description of the exact details of the compilation phases is beyond the scope of this tutorial. The interested reader is referred to CUDA Toolkit documentation, parallel thread execution (PTX) compiler API and instruction set architecture (ISA) for further details.

2. Compiling Separate Source Files using NVCC#

In the [Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus) example, we criticized the length of the source code. Therefore, we need to break it into smaller source files according to the logical structure of our code. Before CUDA 5.0, splitting the CUDA source code into separate files and multiple-source-file compilation was not possible and the only available option was the whole-code compilation approach. Since we are using CUDA 11.2 here, this should not be a problem anymore. Let us now start breaking the code into separate source files by copying the C function signatures and pasting them into an empty file. Name this file as cCode.h and add the necessary include header pre-processor directives and header guards to it. The resulting header file’s content should be the same as the following code block

/*================================================*/
/*==================== cCode.h ===================*/
/*================================================*/
#ifndef CCODE_H
#define CCODE_H

#include <time.h>
#include <sys/time.h>

/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size);
void arraySumOnHost(float *A, float *B, float *C, const int size);
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size);

#endif // CCODE_H

Next, copy the C function definitions from the gpuVectorSum.cu file to a new file, add the include pre-processor directives and save it as cCode.c. The resulting source file should look like the following

/*================================================*/
/*==================== cCode.c ===================*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include "cCode.h"

void dataInitializer(float *inputArray, int size) {
    /* Generating float-type random numbers 
     * between 0.0 and 1.0
     */
    time_t t;
    srand( (unsigned int) time(&t) );

    for (int i = 0; i < size; i++) {
        inputArray[i] = ( (float)rand() / (float)(RAND_MAX) ) * 1.0;
    }

    return;
}
/*-----------------------------------------------*/
void arraySumOnHost(float *A, float *B, float *C, const int size) {
    for (int i = 0; i < size; i++) {
        C[i] = A[i] + B[i];
    }
}
/*-----------------------------------------------*/
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size) {
    double tolerance = 1.0E-8;
    bool isEqual = true;

    for (int i = 0; i < size; i++) {
        if (abs(hostPtr[i] - devicePtr[i]) > tolerance) {
            isEqual = false;
            printf("Arrays are NOT equal because:\n");
            printf("at %dth index: hostPtr[%d] = %5.2f \
            and devicePtr[%d] = %5.2f;\n", \
            i, i, hostPtr[i], i, devicePtr[i]);
            break;
        }
    }

    if (isEqual) {
        printf("Arrays are equal.\n\n");
    }

    return;
}

After moving all C-based functions into separate source and header files, we should add the function declarations in cCode.h to both cCode.c and gpuVectorSum.cu files. The latter file now should contain the following code

/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cCode.h"

/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}
/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s)\n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}

Now let’s try to compile our multiple source files into an executable using nvcc and run it using the following commands

$ nvcc gpuVectorSum.cu cCode.c -o gpuVectorSum
$ ./gpuVectorSum

After running the aforementioned commands, it is likely that you will get some error messages like the following

/tmp/tmpxft_000050f6_00000000-11_test.o: In function `main':
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x16a): undefined reference to `dataInitializer(float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x181): undefined reference to `dataInitializer(float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x227): undefined reference to `arraySumOnHost(float*, float*, float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x477): undefined reference to `arrayEqualityCheck(float*, float*, int)'
collect2: error: ld returned 1 exit status

The error comes from the ld (short for “load”) GNU linker which complains about receiving undefined references to a list of functions which we have put in cCode.c source file. Can you guess what is the problem?

The main job of ld linker is to combine the object and archive files, rearrange their data and manage their symbol references. Calling ld is usually the last step of the compilation process. So, you can guess there should not be any issues or bugs withing your code and it should be something else. The source of the problem might not be that trivial to many of the readers: nvcc compiler uses the host’s C++ compiler by default for compiling all non-GPU code. Therefore, when you try to link the symbol references in cCode.c file, the C++ compiler (and nvcc) do not know you that your are trying to embed the C (not C++) function definitions in your program.

The solution to this problem is familiar to programmers who have experience in calling C functions within C++ programs using the extern "C" { ... } snippet. By wrapping thin snippet around the #include cCode.h pre-processor directive, you should be able to successfully compile your code using the same command without any issues.

We have now, localized the C function definitions into their corresponding implementation source files and headers. However, we still have a CUDA kernel function definition left in the gpuVectorSum.cu file. Similar to what we have done for C function definitions, CUDA kernel definitions can also be moved to their pertinent source files. Therefore, the first step is to create a header file, cudaCode.h, which should include the declaration signature of the arraySumOnDevice() kernel function and look like the following code block

/*================================================*/
/*================== cudaCode.h ==================*/
/*================================================*/
#ifndef CUDACODE_H
#define CUDACODE_H

__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size);

#endif // CUDACODE_H

Next, create a source file for kernel definition and save it as cudaCode.cu. The contents of the cudaCode.cu file should be the same as the following code block

/*================================================*/
/*================== cudaCode.cu =================*/
/*================================================*/
#include "cudaCode.h"
#include <stdio.h>

/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}

Do not forget to add the cudaCode.h pre-processor #include directive in both cudaCode.cu and gpuVectorSum.cu files. The latter file now does not include any function or kernel function definitions for either host or device side code domains, respectively. The gpuVectorSum.cu file now has the following structure

/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cudaCode.h"
extern "C" {
    #include "cCode.h"
}

/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s)\n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}

The localization of each code domain into its own implementation source files allows for regimenting a logical hierarchy within our package structure which makes the debugging process more convenient and efficient as the pertinent source files are now shorter as well.

In order to compile our code which now includes an additional source file, we should run the following commands

$ nvcc gpuVectorSum.cu cCode.c cudaCode.cu -o gpuVectorSum
$ ./gpuVectorSum

There are still opportunities for us in the main() function within the gpuVectorSum.cu file for further encapsulation of code into new functions that can be subsequently transferred to the cCode.c or cudaCode.cu source files and their corresponding headers. The following exercise asks you to find these opportunities and use them to make the code even shorter and therefore more readable and easier to work with.

Exercise

Parts of our code can still be made shorter through encapsulation of groups of expressions and statements that can be incorporated into a function. See if you can find these code parts within the gpuVectorSum.cu file and transfer them into cCode.c or cudaCode.cu source files.

3. Error Handling#

Many of the CUDA function calls within each CUDA program are asynchronous– the execution flow returns to the host immediately after the function call. The asynchronous nature of these function calls makes it difficult to identify and troubleshoot the source of errors if several CUDA functions have been called consecutively. Fortunately, with the exception of kernel executions, CUDA functions return error codes of cudaError_t enumerated type. As such, we can define error handling macros to wrap around the CUDA function calls and check them for any possible errors.

In order to include a macro definition within our [Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus) code, open the cCode.h header file and add the following macro definition to it

/*================================================*/
/*==================== cCode.h ===================*/
/*================================================*/
#ifndef CCODE_H
#define CCODE_H

#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

#define ERRORHANDLER(funcCall) { \
    const cudaError_t error = funcCall; \
    const char *errorMessage = cudaGetErrorString(error); \
    if (error != cudaSuccess) { \
        printf("Error in file %s, line %d, code %d,  Message %s\n", \
        __FILE__, __LINE__, error, errorMessage); \
        exit(EXIT_FAILURE); \
    } \
}
/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size);
void arraySumOnHost(float *A, float *B, float *C, const int size);
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size);

#endif // CCODE_H

Note that backslashes (\) are used as the line continuation escape character since marcos should be defined in one line. Each backslash must be the last character on the line otherwise you will get an error. We can now wrap our CUDA function calls within ERRORHANDLER() macro in order to capture the errors. If an error happens, ERRORHANDLER() macro prints the error code in a human-readable format and terminates the program by calling the exit(EXIT_FAILURE) function.

Exercise

Try to wrap every CUDA function call in gpuVectorSum.cu file with the ERRORHANDLER() macro.

Key Points

  • The NVCC compiler

  • Compilation phases

  • Compiling multiple CPU and GPU source code files simultaneously

  • Error handling in a CUDA program