This lesson is still being designed and assembled (Pre-Alpha version)

Fundamentals of Heterogeneous Parallel Programming with CUDA C/C++

This tutorial by the Molecular Sciences Software Institute (MolSSI) overviews the fundamentals of heterogeneous parallel programming with CUDA C/C++ at the beginner level.

The MolSSI’s full education mission statement can be found here.


  • Previous knowledge of High-performance Computing (HPC) basic concepts are helpful but not required for starting this course. Nevertheless, we encourage students to take a glance at our Parallel Programming tutorial, specifically, Chapters 1, 2 and 5 for a brief overview of some of the fundamental concepts in HPC.
  • Basic familiarity with Bash, C and C++ programming languages is required.

Software/Hardware Specifications

The following NVIDIA CUDA-enabled GPU devices have been used throughout this tutorial:

  • Device 0: GeForce GTX 1650 with Turing architecture (Compute Capability = 7.5)
  • Device 1: GeForce GT 740M with Kepler architecture (Compute Capability = 3.5)

Linux 18.04 (Bionic Beaver) OS is the target platform for CUDA Toolkit v11.2.0 on the two host machines armed with devices 0 and 1.


Setup Download files required for the lesson
00:00 1. Introduction What is heterogeneous parallel programming? Where did it come from and how did it evolve?
What are the main differences between CPU and GPU architectures and their relation to parallel programming paradigms?
What is CUDA? Why do I need to know about it?
00:30 2. Basic Concepts in CUDA Programming How to write, compile and run a basic CUDA program?
What is the structure of a CUDA program?
How to write and launch a CUDA kernel function?
01:00 3. CUDA Programming Model What is thread hierarchy in CUDA?
How can threads be organized within blocks and grids?
How can the data be transferred between host and device memory?
How can we measure the wall-time of an operation in a program?
02:00 4. CUDA GPU Compilation Model What is NVCC compiler and why do we need it?
Can multiple GPU and CPU source code files be simultaneously compiled with NVCC?
How does NVCC distinguish between the host and device code domains and handle the compilation process?
How can runtime errors be handled during a CUDA program execution?
02:47 5. CUDA Execution Model What is CUDA execution model?
How insights from GPU architecture helps CUDA programmers to write more efficient software?
What are streaming multiprocessors and thread warps?
What is profiling and why is it important to a programmer?
How many profiling tools for CUDA programming are available and which one(s) should I choose?
03:22 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.