Fundamentals of Heterogeneous Parallel Programming with CUDA C/C++#

This course by The Molecular Sciences Software Institute (MolSSI) overviews the fundamentals of heterogeneous parallel programming with CUDA C/C++ at the beginner level.


  • Previous knowledge of High-performance Computing (HPC) basic concepts are helpful but not required for starting this course.

Nevertheless, we encourage students to take a glance at our Parallel Programming tutorial, specifically, Chapters 1, 2 and 5 for a brief overview of some of the fundamental concepts in HPC.

  • Basic familiarity with Bash, C and C++ programming languages is required.

Software/Hardware Specifications

The following NVIDIA CUDA-enabled GPU devices have been used throughout this tutorial:

  • Device 0: GeForce GTX 1650 with Turing architecture (Compute Capability = 7.5)

  • Device 1: GeForce GT 740M with Kepler architecture (Compute Capability = 3.5)

Linux 18.04 (Bionic Beaver) OS is the target platform for CUDA Toolkit v11.2.0 on the two host machines armed with devices 0 and 1.

Lesson Title




  • How do I set up my computer for these lessons?


  • What is heterogeneous parallel programming? Where did it come from and how did it evolve?

  • What are the main differences between CPU and GPU architectures and their relation to parallel programming paradigms?

  • What is CUDA? Why do I need to know about it?

  • Understanding the fundamentals of heterogeneous parallel programming

  • Learning the basic aspects of GPU architectures and software models for heterogeneous parallel programming

  • An initial overview of CUDA as a programming platform and model

Basic Concepts in CUDA Programming

  • How to write, compile and run a basic CUDA program?

  • What is the structure of a CUDA program?

  • How to write and launch a CUDA kernel function?

  • Understanding the basics of the CUDA programming model

  • The ability to write, compile and run a basic CUDA program

  • Recognition of similarities between the semantics of C and those of CUDA C

CUDA Programming Model

  • What is thread hierarchy in CUDA?

  • How can threads be organized within blocks and grids?

  • How can the data be transferred between host and device memory?

  • How can we measure the wall-time of An operation in a program?

  • Learning about the basics of the device memory management

  • Understanding the concept of thread hierarchy in CUDA programming model

  • Familiarity with the logistics of a typical CUDA program

CUDA GPU Compilation Model

  • What is NVCC compiler and Why do we need it?

  • Can multiple GPU and CPU source code files be simultaneously compiled with NVCC?

  • How does NVCC distinguish between the host and device code domains and handle the compilation process?

  • how can runtime errors be handled during a CUDA program execution?

  • Understanding the basic mechanism of NVCC compilation phases

  • Learning about multiple source code compilation mode in NVCC compiler

  • Mastering the basics of error handling in a CUDA program using C/C++ wrapper marcos

CUDA Execution Model

  • What is CUDA execution model?

  • How insights from GPU architecture helps CUDA programmers to write more efficient software?

  • What are streaming multiprocessors and thread warps?

  • What is profiling and why is it important to a programmer?

  • How many profiling tools for CUDA programming are available and which one(s) should I choose?

  • Understanding the fundamentals of the CUDA execution model

  • Establishing the importance of knowledge from GPU architecture and its impacts on the efficiency of a CUDA program

  • Learning about the building blocks of GPU architecture: streaming multiprocessors and thread warps

  • Mastering the basics of profiling and becoming proficient in adopting profiling tools in CUDA programming