Performance Guidelines and Optimization Strategies
Overview
Teaching: 60 min
Exercises: 0 minQuestions
Objectives
1. Recommended Strategies for Performance Optimization
NVIDIA Performance Guidelines offers the following basic strategies for performance optimization of an application:
- Maximization of parallel execution in order to achieve maximum utilization of resources on the device(s)
- Optimization of the device memory usage in order to maximize the memory throughput
- Improvement of instruction usage in order to gain maximum instruction throughput, and
- Minimization of memory thrashing
The maximum performance gains are usually program/system dependent. For example, attempts to improve the performance of a kernel which is mostly limited by its memory access will not be possibly impactful. As such, all performance optimization efforts should be guided by quantitative analysis tools such as NVIDIA Nsight Systems and Nsight Compute profilers offering a wide variety of performance metrics for CUDA parallel programs. For instance, Nsight Compute profiler offers GPU Speed of Light section consisting of metrics which provide a high-level overview of GPU’s memory and compute throughput in terms of achieved utilization percentage with respect to the maximum theoretical limit of the metric being measured. As such these metrics offer a great deal of information indicating how much performance improvement is possible for a kernel.
In the following sections, let us briefly overview the performance optimization strategies mentioned above.
1.1. Maximization of the Device Utilization
In order to maximize the utilization of resources on the device, the developer must expose the program’s code to as much parallelism across different logical levels of the system as possible. These levels involve: (i) the application, (ii) the device, and (iii) the microprocessor.
Adopting asynchronous CUDA APIs and streams through The main goal at the application level is to maximize concurrency in parallel execution between the host, device(s). As such, one attempts to allocate as much parallel work to the device and serial work to the host as possible.
Note
Sometimes the parallelism must be broken for threads to synchronize and share the data among themselves. If the threads belong to the same thread-block, the synchronization can be performed via
__syncthreads()
and the data is shared through the shared memory within a single kernel execution. However, threads from separate blocks must share the data via different kernel executions through lower band-width global memory. Thus, the second less-performant scenario should be minimized due to the kernel execution overheads and slower global memory transfers.
The following list of asynchronous CUDA operations can be performed independently and concurrently
- host computations
- device computations
- HtoD memory transfer operations
- DtoH memory transfer operations
- memory transfer operations in individual devices
- memory transfer operations between two or multiple devices
The CUDA library’s asynchronous function calls allows users to dispatch multiple device operations and distribute them in queues based on the resource availability. Decreasing the device management workload pressure on the host though benefiting from concurrency makes it available for taking part in other simultaneous tasks which might improve the performance in general.
Some GPUs with compute capability of 2.0 and higher can launch multiple kernels, concurrently. The possibility of concurrent
kernel execution can be queried from the device’s property enum variable concurrentKernels
. The maximum number of
concurrent kernel execution is also dependent on the device’s compute capability and can be found in
CUDA Toolkit’s documentation. In addition to the concurrent
execution of multiple kernels, the data transfer/memory copy between the host and the device as well as intra-device operations
can also be executed asynchronously among themselves or with kernel launches. Device’s property enumeration variable
asyncEngineCount
can be queried to see whether the concurrent kernel execution and
data transfer is supported on the available device(s).
Note
The host memory must be page-locked if involved in the overlapped memory copy/data transfer operations.
In CUDA applications, concurrent operations including data transfers and kernel executions can be handled through streams. Streams are sequences of instructions which execute in order. The completion of independent instructions launched in different streams can be guaranteed via synchronization commands.
1.2. Maximization of the Memory Throughput
1.3. Maximization of the Instruction Throughput
1.4. Minimization of the Memory Thrashing
Key Points