# CUDA Execution Model
````{admonition} Overview
:class: overview
Questions:
* What is CUDA execution model?
* How insights from GPU architecture helps CUDA programmers to write more efficient software?
* What are streaming multiprocessors and thread warps?
* What is profiling and why is it important to a programmer?
* How many profiling tools for CUDA programming are available and which one(s) should I choose?
Objectives:
* Understanding the fundamentals of the CUDA execution model
* Establishing the importance of knowledge from GPU architecture and its impacts on the efficiency of a CUDA program
* Learning about the building blocks of GPU architecture: streaming multiprocessors and thread warps
* Mastering the basics of profiling and becoming proficient in adopting profiling tools in CUDA programming
````
## 1. GPU Architecture
We previously mentioned that the sole knowledge of the semantics in CUDA
programming will not probably be enough for achieving the best possible
performance in practical applications. Today's smart compilers
can largely mitigate this issue by optimizing our codes. Nevertheless,
writing a CUDA program based on insights about the GPU architecture
and limitations of the device in use, offers significant advantages
over blind coding and allows for the best possible thread organization
and kernel configuration to maximize the performance.
As such, CUDA execution model provides a logical view of thread concurrency
within the SIMT framework and bridges between the **streaming multiprocessors (SMs)**,
as the central building block of GPU architecture, and the improvement of memory access
and instruction throughput.
The SMs partition the thread blocks into the units of 32 consecutive threads
called **warps** which can be scheduled for execution by *warp schedulers*.
Within a warp, the consecutive threads in blocks have unique
IDs and are indexed in ascending order starting from 0.
For a given thread block, the number of warps can be calculated using the
following formula
$$ \tag{1}\label{EQ:WARPPERBLOCK}
\text{WarpsPerBlock} = \bigg\lceil\frac{\text{threadsPerBlock}}{\text{warpSize}}\bigg\rceil,
$$
where $$ \lceil \cdot \rceil $$ in Eq. \eqref{EQ:WARPPERBLOCK} is the
[ceiling function](https://dlmf.nist.gov/front/introduction#common.p1.t1.r18)
and *threadsPerBlock* is defined as
$$ \tag{2}\label{EQ:THRDPERBLOCK}
\text{threadsPerBlock} = \sum_q \text{block.}q \qquad \quad \text{where} \qquad q \in \{x, y, z \}
$$
According to Eqs. \eqref{EQ:WARPPERBLOCK} and \eqref{EQ:THRDPERBLOCK},
a warp is never distributed between different thread blocks.
CUDA manages valuable and limited resources such as registers
and shared memory on SMs and distributes them among all threads.
The limitations of these resources affects the number of active warps
that can be created and the level of SIMT parallelism that
can be realized in a particular SM as a result. As such, in order to
maximize the GPU utilization and achieve the best performance through
increasing the number of active warps with respect to the limited
resources, a proper thread organization and execution configuration
would be of imperative. For example, if we organize a $4 \times 20$
array of threads in a 2-dimensional block (total of 80 threads/block)
using CUDA on the software side, the device will allocate 3 warps for
this block on the hardware side resulting in the allocation of resources
targeted for supporting 96 threads. As such, 16 threads in one of the
three warps remain inactive while still consuming (wasting) the allocated resources.
```{admonition} Note
:class: note
Although from the logical point of view, threads
can be organized within 1-, 2- and 3-dimensional blocks,
from hardware perspective, all threads can only exist in a one-dimensional
world.
```
All threads in a warp execute the same instruction on their own private data.
Therefore, maximum efficiency can be realized if all 32 threads are on
the same execution path. However, the execution of different instructions
(different execution paths) by threads *within a warp* causes
**warp divergence**. The diverged threads within a warp will execute
each execution path in the serial mode (loss of parallelism) and disable those
threads that took a different execution path. As such, warp divergence should be
avoided at all costs because it results in serious performance deteriorations.
Note that warp divergence can only happen among the threads of the same warp and
would be irrelevant if threads of different warps are considered.
This fact about the relation between GPU architecture and software performance
should provide a hint regarding the importance of having some knowledge about the
hardware aspects and CUDA execution model.
Let us look at the main parts of a Fermi SM from a logical perspective
to be able to gain some insights about the hardware micro-architectures
in NVDIA GPU devices.
```{image} ../fig/SM.png
:align: center
```
With each architecture release, NVIDIA has attempted to introduce new technological
breakthroughs or major improvements over predecessor models. Therefore, although
the essential parts of the Fermi architecture illustrated above remains somewhat
similar across different architectures, each architecture might offer a very different
hardware design for its SMs. For example, in
[Turing SM](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
series, 16 trillion concurrent floating-point operations and integer operations per second are
supported. Turing also features a unified architecture for enhanced shared memory, L1, and
texture memory cache. Tensor core processors in Turing architecture accelerate
deep learning training and inference up to 500 trillion tensor operations per second.
Finally, the evolutionary real-time ray tracing (RT) core performance in Turing architecture
was improved by about 8 times over its predecessor architecture, Pascal.
## 2. Profiling Tools
The profiling tools help CUDA programmers to better understand, analyze and
find opportunities to improve the performance and efficiency of a CUDA program
in a systematic way. NVIDIA provides three major tools for profiling a CUDA program:
1. **NVIDIA Visual Profiler (NVVP)**: A graphical software that provides a timeline for
analyzing the CPU/GPU execution activities and identifying the optimization opportunities
for improving the performance of the CUDA program
2. **NVIDIA Profiler (NVPROF)**: A command-line-based profiling tool for collecting the profiling
data and the CUDA-related activities on both CPU and GPU
3. **NVIDIA Nsight Tools**: An interactive next-generation kernel profiler tool for CUDA applications
Since both **nvvp** and **nvprof** profilers are going to be deprecated in future CUDA releases,
NVIDIA recommends [migrating](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#migrating-to-nsight-tools)
to [**NVIDIA Nsight System**](https://developer.nvidia.com/nsight-systems) and
[**NVIDIA Nsight Compute**](https://developer.nvidia.com/nsight-compute) for GPU/CPU sampling,
and tracing purposes, and GPU kernel profiling, respectively. As such,
in this tutorial we are going to present a brief overview the basics of
traditional profilers, nvvp and nvprof. We will adopt the Nsight Compute
tools in our future intermediate and advanced level tutorials which will
be based on profiling-driven optimization approaches toward CUDA programming.
### 2.1. Command-Line NVIDIA Profiler
Let us adopt nvprof to analyze the timings in our
[Summation of Arrays on GPUs]({{site.baseurl}}{% link _episodes/03-cuda-program-model.md %}#3-summation-of-arrays-on-gpus)
program, which we had broken it down into multiple source files
in the previous [lesson]({{site.baseurl}}{% link _episodes/04-gpu-compilation-model.md %}#2-compiling-separate-source-files-using-nvcc).
The nvprof profiler can be easily customized to provide different contents and ways
of collecting data on various GPU/CPU activities through adopting the
following [semantics](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview)
````{tab-set-code}
```{code-block} shell
$ nvprof [options] [application] [appOptions]
```
````
Since nvprof can give the time spent on each GPU kernel and CUDA API
function calls, we will not need to use the `chronometer()` C-based function.
Therefore, let us comment out or remove those function
calls and their corresponding print functions from our ***gpuVectorSum.cu***
source file which should look like the following
````{tab-set-code}
```{code-block} cuda
/*================================================*/
/*================ gpuVectorSum.cu ===============*/
/*================================================*/
#include
#include
#include
#include "cudaCode.h"
extern "C" {
#include "cCode.h"
}
/*************************************************/
int main(int argc, char **argv) {
printf("Kicking off %s\n\n", argv[0]);
/* Device setup */
int deviceIdx = 0;
ERRORHANDLER(cudaSetDevice(deviceIdx));
/* Device properties */
deviceProperties(deviceIdx);
/*-----------------------------------------------*/
/* Fixing the vector size to 1 * 2^24 = 16777216 (64 MB) */
int vecSize = 1 << 24;
size_t vecSizeInBytes = vecSize * sizeof(float);
printf("Vector size: %d floats (%lu MB)\n\n", vecSize, vecSizeInBytes/1024/1024);
/* Memory allocation on the host */
float *h_A, *h_B, *hostPtr, *devicePtr;
h_A = (float *)malloc(vecSizeInBytes);
h_B = (float *)malloc(vecSizeInBytes);
hostPtr = (float *)malloc(vecSizeInBytes);
devicePtr = (float *)malloc(vecSizeInBytes);
/* Vector initialization on the host */
dataInitializer(h_A, vecSize);
dataInitializer(h_B, vecSize);
memset(hostPtr, 0, vecSizeInBytes);
memset(devicePtr, 0, vecSizeInBytes);
/* Vector summation on the host */
arraySumOnHost(h_A, h_B, hostPtr, vecSize);
/*-----------------------------------------------*/
/* (Global) memory allocation on the device */
float *d_A, *d_B, *d_C;
ERRORHANDLER(cudaMalloc((float**)&d_A, vecSizeInBytes));
ERRORHANDLER(cudaMalloc((float**)&d_B, vecSizeInBytes));
ERRORHANDLER(cudaMalloc((float**)&d_C, vecSizeInBytes));
/* Data transfer from host to device */
ERRORHANDLER(cudaMemcpy(d_A, h_A, vecSizeInBytes, cudaMemcpyHostToDevice));
ERRORHANDLER(cudaMemcpy(d_B, h_B, vecSizeInBytes, cudaMemcpyHostToDevice));
ERRORHANDLER(cudaMemcpy(d_C, devicePtr, vecSizeInBytes, cudaMemcpyHostToDevice));
/* Organizing grids and blocks */
int numThreadsInBlocks = 1024;
dim3 block (numThreadsInBlocks);
dim3 grid ((vecSize + block.x - 1) / block.x);
/* Execute the kernel from the host*/
arraySumOnDevice<<>>(d_A, d_B, d_C, vecSize);
ERRORHANDLER(cudaDeviceSynchronize());
/*-----------------------------------------------*/
/* Returning the last error from a runtime call */
ERRORHANDLER(cudaGetLastError());
/* Data transfer back from device to host */
ERRORHANDLER(cudaMemcpy(devicePtr, d_C, vecSizeInBytes, cudaMemcpyDeviceToHost));
/* Check to see if the array summations on
* CPU and GPU yield the same results
*/
arrayEqualityCheck(hostPtr, devicePtr, vecSize);
/*-----------------------------------------------*/
/* Free the allocated memory on the device */
ERRORHANDLER(cudaFree(d_A));
ERRORHANDLER(cudaFree(d_B));
ERRORHANDLER(cudaFree(d_C));
/* Free the allocated memory on the host */
free(h_A);
free(h_B);
free(hostPtr);
free(devicePtr);
return(EXIT_SUCCESS);
}
```
````
After re-compiling our code, prepend the bash running command
with nvprof as follows
````{tab-set-code}
```{code-block} shell
$ nvcc gpuVectorSum.cu cCode.c cudaCode.cu -o gpuVectorSum
$ nvprof ./gpuVectorSum
```
````
The resulting output will be similar to the following
````{tab-set-code}
```{code-block} output
Kicking off ./test
==5906== NVPROF is profiling process 5906, command: ./test
GPU device GeForce GTX 1650 with index (0) is set!
Vector size: 16777216 floats (64 MB)
Arrays are equal.
==5906== Profiling application: ./test
==5906== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 71.57% 36.369ms 3 12.123ms 12.075ms 12.160ms [CUDA memcpy HtoD]
24.93% 12.669ms 1 12.669ms 12.669ms 12.669ms [CUDA memcpy DtoH]
3.50% 1.7770ms 1 1.7770ms 1.7770ms 1.7770ms arraySumOnDevice(float*, float*, float*, int)
API calls: 78.54% 196.20ms 3 65.401ms 190.15us 195.82ms cudaMalloc
19.84% 49.555ms 4 12.389ms 12.210ms 12.840ms cudaMemcpy
0.74% 1.8394ms 1 1.8394ms 1.8394ms 1.8394ms cudaDeviceSynchronize
0.24% 591.60us 1 591.60us 591.60us 591.60us cudaGetDeviceProperties
0.23% 570.08us 101 5.6440us 495ns 254.44us cuDeviceGetAttribute
0.20% 490.82us 3 163.61us 146.21us 195.84us cudaFree
0.16% 387.42us 1 387.42us 387.42us 387.42us cuDeviceTotalMem
0.04% 110.59us 1 110.59us 110.59us 110.59us cuDeviceGetName
0.01% 26.566us 1 26.566us 26.566us 26.566us cudaLaunchKernel
0.00% 9.2890us 1 9.2890us 9.2890us 9.2890us cuDeviceGetPCIBusId
0.00% 9.1680us 1 9.1680us 9.1680us 9.1680us cudaSetDevice
0.00% 5.6390us 3 1.8790us 686ns 4.2030us cuDeviceGetCount
0.00% 3.5500us 2 1.7750us 550ns 3.0000us cuDeviceGet
0.00% 1.5830us 1 1.5830us 1.5830us 1.5830us cuDeviceGetUuid
0.00% 401ns 1 401ns 401ns 401ns cudaGetLastError
```
````
The output is a mixture of C-based `printf()` function results and
what nvprof has printed. In this output, `==5906==`, shows the process
ID (PID) which is assigned by the operating system to the application's
execution. The `== ==` also signifies the starting point in every
nvprof message section. The timing units are given in seconds (s),
milliseconds (*m*s), microseconds (*u*s) or nanoseconds (*n*s).
The *GPU activities* part within the *Profiling results* section of the
nvprof output demonstrates that about 96% of GPU's activities were focused
on performing data transfer and non-computational tasks while around only
4% of the device's efforts were directed to the actual computation.
The columns *Calls*, *Avg*, *Min*, and *Max* show the number of calls,
average, minimum and maximum timings for these number of function calls,
respectively. Clearly, if there is only one call to a specific function,
Min, Max, and Avg columns show the same number.
This simple example is only a toy model, but enough to illustrate the typical
conclusions one can reach when analyzing a CUDA program using profiling
tools. For example, it shows that the task at hand is not
"data-intensive enough" to be suitable for heterogeneous GPU+CPU
parallelization due to the large ratio of non-computational tasks' overhead
such as data transfer and memory allocation over the target computational
instruction (*i.e.*, the array summation). The simplicity of this example
also rules out the possibility of making major changes to the algorithmic
aspects of the code which might lead to significant performance improvements.
The *API calls* part of the *Profiling results* section of the nvprof output
focuses on CUDA (runtime and driver) API calls which allows us to have a
fine-grid look at the events happening behind the scene.
```{admonition} Note
:class: note
The profiling results of nvprof are [directed](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#redirecting-output)
to `stderr` by default. You can use the `--log-file ` option
within the nvprof command line to save the output in a file named
``.
```
### 2.2. NVIDIA Visual Profiler
After the compilation step, instead of running the executable with
nvprof profiler, one can profile it using nvvp by creating an (executable)
[session](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#sessions).
Upon opening the executable file from nvvp, we can choose from a list of profiling
options and methods such as *guided analysis* to be able to collect data from
our CUDA program. Guided analysis gives us systematic suggestions which help us find
opportunities within our code for performance improvements.
The main graphical user interface for nvvp is illustrated in the following
figure where the main screen is split into separate panels and organized
as *views*
```{image} ../fig/nvvp.png
:align: center
```
The main section at the top of our screen is the *timeline view* detailing
the GPU and CPU activities in your program as a function of execution time.
Timelines consist of timeline rows, each of which shows the beginning and
the end of the lifetime of activities indicated by the corresponding row label.
Sub-rows might also be used for overlapping activities.
The *analysis view* panel manages the application analysis and
presents the results. There are two modes of analysis: (i) *guided*,
and (ii) *unguided*. In the guided analysis mode, nvvp walks you through
a list of analysis stages to clarify what factors limit the performance of your
application and where you can find opportunities in your program for improvement.
In the unguided mode, the profiling results are collected
and presented to you in such a way that you, instead of system, choose
the focus stages and decide which one takes the priority to be further
investigated. In the figure presented above, the bottom-left and
middle panels, *i.e.*, the analysis and results views represent the
profiling analysis stages and results, respectively in the unguided mode.
The *Data Movement and Concurrency* stage in this case, gives us four
possible places that need improvements for performance optimization.
The first result, for example, is what we had inferred from timing results of
nvprof: the ratio of actual computation on the device over the GPU time
spent on non-computational tasks is small.
The *Dependency Analysis* stage, which we is shown in the following figure,
provides profiling information on the timings as well as correlation or
dependency of various activities during the program lifetime. Please refer
to the [CUDA Toolkit documentation](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual)
for further details about other view types and different aspects of profiling
in nvvp.
```{image} ../fig/dependencyAnalysis.png
:align: center
```
````{admonition} Key Points
:class: key
- CUDA execution model
- Streaming multiprocessors and thread warps
- Profiling tools for CUDA programming
````