# CUDA GPU Compilation Model

````{admonition} Overview
:class: overview

Questions:
- What is NVCC compiler and why do we need it?
- Can multiple GPU and CPU source code files be simultaneously compiled with NVCC?
- How does NVCC distinguish between the host and device code domains and handle the compilation process?
- How can runtime errors be handled during a CUDA program execution?

Objectives:
- Understanding the basic mechanism of NVCC compilation phases
- Learning about multiple source code compilation mode in NVCC compiler
- Mastering the basics of error handling in a CUDA program using C/C++ wrapper marcos
````

## 1. NVIDIA's CUDA Compiler

**NVIDIA's CUDA compiler (NVCC)** is distributed as part of CUDA Toolkit and 
is based upon the poplar [*LLVM*](https://llvm.org/) open-source infrastructure.
Each CUDA program is a combination of host code written in C/C++ standard 
semantics with some extensions within CUDA API as well as the GPU device 
kernel functions. The nvcc compiler driver separates the host code from 
that of the device. The host code is then pre-processed and compiled with
host's [C++ compilers supported by nvcc](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#supported-host-compilers).
The nvcc compiler also pre-processes and compiles the device kernel functions
using the proprietary NVIDIA assemblers and compilers. Then, nvcc embeds 
the GPU kernels as [*fatbinary*](https://en.wikipedia.org/wiki/Fat_binary) 
images into the host object files. Finally, during the linking stage, CUDA
runtime libraries are added for kernel procedure calls as well as memory and data 
transfer managements. The description of the exact details of the 
[compilation phases](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#supported-phases) 
is beyond the scope of this tutorial. The interested reader is referred to 
[CUDA Toolkit documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#introduction),
[*parallel thread execution (PTX)* compiler API](https://docs.nvidia.com/cuda/ptx-compiler-api/index.html)
and [*instruction set architecture (ISA)*](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html)
for further details.

## 2. Compiling Separate Source Files using NVCC

In the [Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus)
example, we criticized the length of the source code. Therefore,
we need to break it into smaller source files according to the 
logical structure of our code.
Before CUDA 5.0, splitting the CUDA source code into separate files and 
[multiple-source-file compilation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda) 
was not possible and the only available option was the whole-code 
compilation approach. Since we are using CUDA 11.2 here, this should
not be a problem anymore.
Let us now start breaking the code into separate source files by copying 
the C *function signatures* and pasting them into an empty file. Name this 
file as ***cCode.h*** and add the necessary include header pre-processor 
directives and header guards to it. The resulting header file's content 
should be the same as the following code block

````{tab-set-code} 

```{code-block} c
/*================================================*/
/*==================== cCode.h ===================*/
/*================================================*/
#ifndef CCODE_H
#define CCODE_H

#include <time.h>
#include <sys/time.h>

/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size);
void arraySumOnHost(float *A, float *B, float *C, const int size);
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size);

#endif // CCODE_H
```
````


Next, copy the C *function definitions* from the ***gpuVectorSum.cu*** 
file to a new file, add the include pre-processor directives and save it
as ***cCode.c***. The resulting source file should look like the following

````{tab-set-code} 

```{code-block} c
/*================================================*/
/*==================== cCode.c ===================*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include "cCode.h"

void dataInitializer(float *inputArray, int size) {
    /* Generating float-type random numbers 
     * between 0.0 and 1.0
     */
    time_t t;
    srand( (unsigned int) time(&t) );

    for (int i = 0; i < size; i++) {
        inputArray[i] = ( (float)rand() / (float)(RAND_MAX) ) * 1.0;
    }

    return;
}
/*-----------------------------------------------*/
void arraySumOnHost(float *A, float *B, float *C, const int size) {
    for (int i = 0; i < size; i++) {
        C[i] = A[i] + B[i];
    }
}
/*-----------------------------------------------*/
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size) {
    double tolerance = 1.0E-8;
    bool isEqual = true;

    for (int i = 0; i < size; i++) {
        if (abs(hostPtr[i] - devicePtr[i]) > tolerance) {
            isEqual = false;
            printf("Arrays are NOT equal because:\n");
            printf("at %dth index: hostPtr[%d] = %5.2f \
            and devicePtr[%d] = %5.2f;\n", \
            i, i, hostPtr[i], i, devicePtr[i]);
            break;
        }
    }

    if (isEqual) {
        printf("Arrays are equal.\n\n");
    }

    return;
}
```
````


After moving all C-based functions into separate source and 
header files, we should add the function declarations in 
***cCode.h*** to both ***cCode.c*** and ***gpuVectorSum.cu*** files.
The latter file now should contain the following code

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cCode.h"

/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}
/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s)\n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}
```
````


Now let's try to compile our multiple source files into an executable
using nvcc and run it using the following commands

````{tab-set-code} 

```{code-block} shell
$ nvcc gpuVectorSum.cu cCode.c -o gpuVectorSum
$ ./gpuVectorSum
```
````


After running the aforementioned commands, it is likely that you will get
some error messages like the following

````{tab-set-code} 

```{code-block} error
/tmp/tmpxft_000050f6_00000000-11_test.o: In function `main':
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x16a): undefined reference to `dataInitializer(float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x181): undefined reference to `dataInitializer(float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x227): undefined reference to `arraySumOnHost(float*, float*, float*, int)'
tmpxft_000050f6_00000000-6_test.cudafe1.cpp:(.text+0x477): undefined reference to `arrayEqualityCheck(float*, float*, int)'
collect2: error: ld returned 1 exit status
```
````


The error comes from the `ld` (short for "load") GNU linker which
complains about receiving undefined references to a list of functions
which we have put in ***cCode.c*** source file. Can you guess what is
the problem?

The main job of `ld` linker is to combine the object and archive files,
rearrange their data and manage their symbol references. Calling `ld` 
is usually the last step of the compilation process. So, you can guess there
should not be any issues or bugs withing your code and it should be something
else. The source of the problem might not be that trivial to many of the
readers: nvcc compiler uses the host's C++ compiler by default for compiling all
non-GPU code. Therefore, when you try to link the symbol references in
***cCode.c*** file, the C++ compiler (and nvcc) do not know you that your are trying
to embed the C (not C++) function definitions in your program.

The solution to this problem is familiar to programmers who have experience
in calling C functions within C++ programs using the `extern "C" { ... }`
snippet. By wrapping thin snippet around the `#include cCode.h` pre-processor
directive, you should be able to successfully compile your code using the 
same command without any issues.

We have now, localized the C function definitions into their corresponding
implementation source files and headers. However, we still have a CUDA kernel
function definition left in the ***gpuVectorSum.cu*** file. Similar to what
we have done for C function definitions, CUDA kernel definitions can also be
moved to their pertinent source files. Therefore, the first step is to
create a header file, ***cudaCode.h***, which should include the declaration
signature of the `arraySumOnDevice()` kernel function and look like 
the following code block

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================== cudaCode.h ==================*/
/*================================================*/
#ifndef CUDACODE_H
#define CUDACODE_H

__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size);

#endif // CUDACODE_H
```
````


Next, create a source file for kernel definition and save it as
***cudaCode.cu***. The contents of the ***cudaCode.cu*** file should be
the same as the following code block

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================== cudaCode.cu =================*/
/*================================================*/
#include "cudaCode.h"
#include <stdio.h>

/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}
```
````


Do not forget to add the ***cudaCode.h*** pre-processor `#include` directive
in both ***cudaCode.cu*** and ***gpuVectorSum.cu*** files. The latter
file now does not include any function or kernel function definitions 
for either host or device side code domains, respectively. 
The ***gpuVectorSum.cu*** file now has the following structure

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cudaCode.h"
extern "C" {
    #include "cCode.h"
}

/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s)\n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}
```
````


The localization of each code domain into its own implementation source files
allows for regimenting a logical hierarchy within our package structure which
makes the debugging process more convenient and efficient as the pertinent 
source files are now shorter as well.

In order to compile our code which now includes an additional source file,
we should run the following commands

````{tab-set-code} 

```{code-block} shell
$ nvcc gpuVectorSum.cu cCode.c cudaCode.cu -o gpuVectorSum
$ ./gpuVectorSum
```
````


There are still opportunities for us in the `main()` function within the 
***gpuVectorSum.cu*** file for further encapsulation of code into new functions 
that can be subsequently transferred to the ***cCode.c*** or 
***cudaCode.cu*** source files and their corresponding headers.
The following exercise asks you to find these opportunities and
use them to make the code even shorter and therefore more readable and
easier to work with.


``````{admonition} Exercise
:class: exercise

Parts of our code can still be made shorter through
encapsulation of groups of expressions and statements that can be
incorporated into a function. See if you can find these
code parts within the ***gpuVectorSum.cu*** file
and transfer them into ***cCode.c*** or 
***cudaCode.cu*** source files.

`````{admonition} Solution
:class: solution dropdown

The three lines under `/* Device properties */` comment
can be transferred into a function called `deviceProperties(deviceIdx)`
which takes the device index `deviceIdx` as its input argument.
So, the new ***cudaCode.cu*** file should take the following form


````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================== cudaCode.cu =================*/
/*================================================*/
#include "cudaCode.h"
#include <stdio.h>

/*-----------------------------------------------*/
__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) { 
        C[idx] = A[idx] + B[idx];
    }
}
/*-----------------------------------------------*/
__host__ void deviceProperties(int deviceIdx) {
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, deviceIdx);
    printf("GPU device %s with index (%d) is set!\n\n", \
    deviceProp.name, deviceIdx);   
}
```
````

and the corresponding header file, ***cudaCode.h***, should be
the same as the following code block

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================== cudaCode.h ==================*/
/*================================================*/
#ifndef CUDACODE_H
#define CUDACODE_H

__global__ void arraySumOnDevice(float *A, float *B, float *C, const int size);
__host__ void deviceProperties(int deviceIdx);

#endif // CUDACODE_H
```
````

Note that we have used `__host__` declaration expression qualifier
for our device function, `deviceProperties()`, since it will run from
host and therefore, no specific memory management on the device will
required.
With the aforementioned changes, the ***gpuVectorSum.cu*** file
takes the following form 

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cudaCode.h"
extern "C" {
    #include "cCode.h"
}

/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    cudaSetDevice(deviceIdx);

    /* Device properties */
    deviceProperties(deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, vecSizeInBytes);
    cudaMalloc((float**)&d_B, vecSizeInBytes);
    cudaMalloc((float**)&d_C, vecSizeInBytes);

    /* Data transfer from host to device */
    cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice);

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    cudaDeviceSynchronize();
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s) \n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    cudaGetLastError();

    /* Data transfer back from device to host */
    cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost);

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}
```
````
`````
``````

## 3. Error Handling

Many of the CUDA function calls within each CUDA program are asynchronous--
the execution flow returns to the host immediately after the function call.
The asynchronous nature of these function calls makes it difficult to identify
and troubleshoot the source of errors if several CUDA functions have been
called consecutively. Fortunately, with the exception of kernel executions,
CUDA functions return error codes of [`cudaError_t`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1gf599e5b8b829ce7db0f5216928f6ecb6) 
enumerated type. As such, we can define error handling macros to wrap
around the CUDA function calls and check them for any possible errors.

In order to include a macro definition within our 
[Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus)
code, open the ***cCode.h*** header file and add the following macro definition to it

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*==================== cCode.h ===================*/
/*================================================*/
#ifndef CCODE_H
#define CCODE_H

#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

#define ERRORHANDLER(funcCall) { \
    const cudaError_t error = funcCall; \
    const char *errorMessage = cudaGetErrorString(error); \
    if (error != cudaSuccess) { \
        printf("Error in file %s, line %d, code %d,  Message %s\n", \
        __FILE__, __LINE__, error, errorMessage); \
        exit(EXIT_FAILURE); \
    } \
}
/*************************************************/
inline double chronometer() {
    struct timezone tzp;
    struct timeval tp;
    int tmp = gettimeofday(&tp, &tzp);
    return ((double)tp.tv_sec + (double)tp.tv_usec * 1.e-6);
}
/*-----------------------------------------------*/
void dataInitializer(float *inputArray, int size);
void arraySumOnHost(float *A, float *B, float *C, const int size);
void arrayEqualityCheck(float *hostPtr, float *devicePtr, const int size);

#endif // CCODE_H
```
````


Note that backslashes (`\`) are used as the 
*line continuation escape character* since marcos should be defined 
in one line. Each backslash must be the last character on the line 
otherwise you will get an error.
We can now wrap our CUDA function calls within `ERRORHANDLER()` macro
in order to capture the errors. If an error happens, `ERRORHANDLER()` 
macro prints the error code in a human-readable format and terminates 
the program by calling the `exit(EXIT_FAILURE)` function.

``````{admonition} Exercise
:class: exercise

Try to wrap every CUDA function call in ***gpuVectorSum.cu*** file
with the `ERRORHANDLER()` macro.

`````{admonition} Solution
:class: solution dropdown

````{tab-set-code} 

```{code-block} cuda
/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include <stdlib.h>
#include <stdio.h>
#include <cuda_runtime.h>
#include "cudaCode.h"
extern "C" {
    #include "cCode.h"
}

/*************************************************/
int main(int argc, char **argv) {
    printf("Kicking off %s\n\n", argv[0]);

    /* Device setup */
    int deviceIdx = 0;
    ERRORHANDLER(cudaSetDevice(deviceIdx));

    /* Device properties */
    deviceProperties(deviceIdx);
/*-----------------------------------------------*/
    /* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
    int vecSize = 1 << 24;
    size_t vecSizeInBytes = vecSize * sizeof(float);
    printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);

    /* Memory allocation on the host */
    float *h_A, *h_B, *hostPtr, *devicePtr;
    h_A     = (float *)malloc(vecSizeInBytes);
    h_B     = (float *)malloc(vecSizeInBytes);
    hostPtr = (float *)malloc(vecSizeInBytes);
    devicePtr  = (float *)malloc(vecSizeInBytes);

    double tStart, tElapsed;

    /* Vector initialization on the host */
    tStart = chronometer();
    dataInitializer(h_A, vecSize);
    dataInitializer(h_B, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for dataInitializer: %f second(s)\n", tElapsed);
    memset(hostPtr, 0, vecSizeInBytes);
    memset(devicePtr,  0, vecSizeInBytes);

    /* Vector summation on the host */
    tStart = chronometer();
    arraySumOnHost(h_A, h_B, hostPtr, vecSize);
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnHost: %f second(s)\n", tElapsed);
/*-----------------------------------------------*/
    /* (Global) memory allocation on the device */
    float *d_A, *d_B, *d_C;
    ERRORHANDLER(cudaMalloc((float**)&d_A, vecSizeInBytes));
    ERRORHANDLER(cudaMalloc((float**)&d_B, vecSizeInBytes));
    ERRORHANDLER(cudaMalloc((float**)&d_C, vecSizeInBytes));

    /* Data transfer from host to device */
    ERRORHANDLER(cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice));
    ERRORHANDLER(cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice));
    ERRORHANDLER(cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice));

    /* Organizing grids and blocks */
    int numThreadsInBlocks = 1024;
    dim3 block (numThreadsInBlocks);
    dim3 grid  ((vecSize + block.x - 1) / block.x);

    /* Execute the kernel from the host*/
    tStart = chronometer();
    arraySumOnDevice<<<grid, block>>>(d_A, d_B, d_C, vecSize);
    ERRORHANDLER(cudaDeviceSynchronize());
    tElapsed = chronometer() - tStart;
    printf("Elapsed time for arraySumOnDevice <<< %d, %d >>>: %f second(s) \n\n", \
    grid.x, block.x, tElapsed);
/*-----------------------------------------------*/
    /* Returning the last error from a runtime call */
    ERRORHANDLER(cudaGetLastError());

    /* Data transfer back from device to host */
    ERRORHANDLER(cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost));

    /* Check to see if the array summations on 
     * CPU and GPU yield the same results 
     */
    arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
    /* Free the allocated memory on the device */
    ERRORHANDLER(cudaFree(d_A));
    ERRORHANDLER(cudaFree(d_B));
    ERRORHANDLER(cudaFree(d_C));

    /* Free the allocated memory on the host */
    free(h_A);
    free(h_B);
    free(hostPtr);
    free(devicePtr);

    return(EXIT_SUCCESS);
}
```
````

We should point out that our `deviceProperties()` function is not
a CUDA API function. Since it encapsulates the `cudaGetDeviceProperties()`
CUDA function within its implementation, we could wrap the 
`ERRORHANDLER()`macro directly around it within the `deviceProperties()` 
function definition. However, this will add a C-based header file within
our device-based code which is in contradiction with our first intention
of separating the device and the host side code domains within different source
files. This is one of those situations that we might have to compromise
somehow-- either by adding the ***cCode.h*** header into our 
***cudaCode.cu*** file or leaving the `cudaGetDeviceProperties()`
CUDA function un-encapsulated within the `main()` function.
`````
``````


````{admonition} Key Points
:class: key

- The NVCC compiler
- Compilation phases
- Compiling multiple CPU and GPU source code files simultaneously
- Error handling in a CUDA program
````