CUDA GPU Compilation Model#

Overview

Questions:

What is NVCC compiler and why do we need it?
Can multiple GPU and CPU source code files be simultaneously compiled with NVCC?
How does NVCC distinguish between the host and device code domains and handle the compilation process?
How can runtime errors be handled during a CUDA program execution?

Objectives:

Understanding the basic mechanism of NVCC compilation phases
Learning about multiple source code compilation mode in NVCC compiler
Mastering the basics of error handling in a CUDA program using C/C++ wrapper marcos

1. NVIDIA’s CUDA Compiler#

NVIDIA’s CUDA compiler (NVCC) is distributed as part of CUDA Toolkit and is based upon the poplar LLVM open-source infrastructure. Each CUDA program is a combination of host code written in C/C++ standard semantics with some extensions within CUDA API as well as the GPU device kernel functions. The nvcc compiler driver separates the host code from that of the device. The host code is then pre-processed and compiled with host’s C++ compilers supported by nvcc. The nvcc compiler also pre-processes and compiles the device kernel functions using the proprietary NVIDIA assemblers and compilers. Then, nvcc embeds the GPU kernels as fatbinary images into the host object files. Finally, during the linking stage, CUDA runtime libraries are added for kernel procedure calls as well as memory and data transfer managements. The description of the exact details of the compilation phases is beyond the scope of this tutorial. The interested reader is referred to CUDA Toolkit documentation, parallel thread execution (PTX) compiler API and instruction set architecture (ISA) for further details.

3. Error Handling#

Many of the CUDA function calls within each CUDA program are asynchronous– the execution flow returns to the host immediately after the function call. The asynchronous nature of these function calls makes it difficult to identify and troubleshoot the source of errors if several CUDA functions have been called consecutively. Fortunately, with the exception of kernel executions, CUDA functions return error codes of cudaError_t enumerated type. As such, we can define error handling macros to wrap around the CUDA function calls and check them for any possible errors.

In order to include a macro definition within our [Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus) code, open the cCode.h header file and add the following macro definition to it

CUDA

/*================================================*/
/*==================== cCode.h ===================*/
/*================================================*/
#ifndef CCODE_H
#define CCODE_H

#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

#define ERRORHANDLER(funcCall) { \
    const cudaError_t error = funcCall; \
    const char *errorMessage = cudaGetErrorString(error); \
    if (error != cudaSuccess) { \
        printf("Error in file %s, line %d, code %d,  Message %s\n", \
        __FILE__, __LINE__, error, errorMessage); \
        exit(EXIT_FAILURE); \
    } \
}
/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size);
void arraySumOnHost(float *A, float *B, float *C, const int size);
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size);

#endif // CCODE_H

Note that backslashes (\) are used as the line continuation escape character since marcos should be defined in one line. Each backslash must be the last character on the line otherwise you will get an error. We can now wrap our CUDA function calls within ERRORHANDLER() macro in order to capture the errors. If an error happens, ERRORHANDLER() macro prints the error code in a human-readable format and terminates the program by calling the exit(EXIT_FAILURE) function.

Exercise

Try to wrap every CUDA function call in gpuVectorSum.cu file with the ERRORHANDLER() macro.

Solution

CUDA

/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cudaCode.h"
extern "C" {
    #include "cCode.h"
}

/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    ERRORHANDLER(cudaSetDevice(deviceIdx));

    /* Device properties */
    deviceProperties(deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    ERRORHANDLER(cudaMalloc((float**)&d_A, vecSizeInBytes));
    ERRORHANDLER(cudaMalloc((float**)&d_B, vecSizeInBytes));
    ERRORHANDLER(cudaMalloc((float**)&d_C, vecSizeInBytes));

    /* Data transfer from host to device */
    ERRORHANDLER(cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice));
    ERRORHANDLER(cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice));
    ERRORHANDLER(cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice));

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    ERRORHANDLER(cudaDeviceSynchronize());
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s) \n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    ERRORHANDLER(cudaGetLastError());

    /* Data transfer back from device to host */
    ERRORHANDLER(cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost));

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    ERRORHANDLER(cudaFree(d_A));
    ERRORHANDLER(cudaFree(d_B));
    ERRORHANDLER(cudaFree(d_C));

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}

We should point out that our deviceProperties() function is not a CUDA API function. Since it encapsulates the cudaGetDeviceProperties() CUDA function within its implementation, we could wrap the ERRORHANDLER()macro directly around it within the deviceProperties() function definition. However, this will add a C-based header file within our device-based code which is in contradiction with our first intention of separating the device and the host side code domains within different source files. This is one of those situations that we might have to compromise somehow– either by adding the cCode.h header into our cudaCode.cu file or leaving the cudaGetDeviceProperties() CUDA function un-encapsulated within the main() function.

Key Points

The NVCC compiler
Compilation phases
Compiling multiple CPU and GPU source code files simultaneously
Error handling in a CUDA program

CUDA GPU Compilation Model#

1. NVIDIA’s CUDA Compiler#

2. Compiling Separate Source Files using NVCC#

3. Error Handling#