This lesson is still being designed and assembled (Pre-Alpha version)

Performance Guidelines and Optimization Strategies


Teaching: 60 min
Exercises: 0 min

NVIDIA Performance Guidelines offers the following basic strategies for performance optimization of an application:

The maximum performance gains are usually program/system dependent. For example, attempts to improve the performance of a kernel which is mostly limited by its memory access will not be possibly impactful. As such, all performance optimization efforts should be guided by quantitative analysis tools such as NVIDIA Nsight Systems and Nsight Compute profilers offering a wide variety of performance metrics for CUDA parallel programs. For instance, Nsight Compute profiler offers GPU Speed of Light section consisting of metrics which provide a high-level overview of GPU’s memory and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical limit of the metric being measured. As such these metrics offer a great deal of information indicating how much performance improvement is possible for a kernel.

In the following sections, let us briefly overview the performance optimization strategies mentioned above.

1.1. Maximization of the Device Utilization

In order to maximize the utilization of resources on the device, the developer must expose the program’s code to as much parallelism across different logical levels of the system as possible. These levels involve: (i) the application, (ii) the device, and (iii) the microprocessor.

Adopting asynchronous CUDA APIs and streams through The main goal at the application level is to maximize concurrency in parallel execution between the host, device(s). As such, one attempts to allocate as much parallel work to the device and serial work to the host as possible.


Sometimes the parallelism must be broken for threads to synchronize and share the data among themselves. If the threads belong to the same thread-block, the synchronization can be performed via __syncthreads() and the data is shared through the shared memory within a single kernel execution. However, threads from separate blocks must share the data via different kernel executions through lower band-width global memory. Thus, the second less-performant scenario should be minimized due to the kernel execution overheads and slower global memory transfers.

The following list of asynchronous CUDA operations can be performed independently and concurrently

The CUDA library’s asynchronous function calls allows users to dispatch multiple device operations and distribute them in queues based on the resource availability. Decreasing the device management workload pressure on the host though benefiting from concurrency makes it available for taking part in other simultaneous tasks which might improve the performance in general.

Some GPUs with compute capability of 2.0 and higher can launch multiple kernels, concurrently. The possibility of concurrent kernel execution can be queried from the device’s property enum variable concurrentKernels. The maximum number of concurrent kernel execution is also dependent on the device’s compute capability and can be found in CUDA Toolkit’s documentation. In addition to the concurrent execution of multiple kernels, the data transfer/memory copy between the host and the device as well as intra-device operations can also be executed asynchronously among themselves or with kernel launches. Device’s property enumeration variable asyncEngineCount can be queried to see whether the concurrent kernel execution and data transfer is supported on the available device(s).


The host memory must be page-locked if involved in the overlapped memory copy/data transfer operations.

In CUDA applications, concurrent operations including data transfers and kernel executions can be handled through streams. Streams are sequences of instructions which execute in order. The completion of independent instructions launched in different streams can be guaranteed via synchronization commands.

1.2. Maximization of the Memory Throughput

1.3. Maximization of the Instruction Throughput

1.4. Minimization of the Memory Thrashing

Key Points