Let’s take a small tour through what actually happens when a CUDA program runs. I’ll use a simple example, computing reciprocals of numbers( a simple standard default example) but the real goal here is to understand how CUDA turns a normal-looking function into massively parallel execution on a GPU. Checkout at the end of the post there is also a video explainig how it works. Lets do a simple math operation both on CPU and GPU and see what happens, we can consider a mathematical operation like adding reciprocal of a numbers that are stored in an array. some thing like the below operation.
data[idx] = 1.0f / data[idx];
Take every number in an array and compute 1/x for it. We do it twice, once on the CPU, once on the GPU, and then compare the results.
The code mentoined first of all might look a bit daunting although it is simple code that does some additons and array operations. I will try to break it down piece by piece. Stay tuned and there is also a nice visual explantion for this as you keep continue reading the post below.
The Full Source Code - The code that is generated by default cuda runtime example project from the nsight eclipse edition that nvidia ships.
Let’s start by looking at the whole thing:
#include <iostream>
#include <numeric>
#include <stdlib.h>
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
/**
* CUDA kernel — runs on the GPU, one thread per element
*/
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < vectorSize)
data[idx] = 1.0 / data[idx];
}
/**
* Host function — sets up GPU memory and launches the kernel
*/
float *gpuReciprocal(float *data, unsigned size)
{
float *rc = new float[size];
float *gpuData;
CUDA_CHECK_RETURN(cudaMalloc((void **)&gpuData, sizeof(float)*size));
CUDA_CHECK_RETURN(cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice));
static const int BLOCK_SIZE = 256;
const int blockCount = (size + BLOCK_SIZE - 1) / BLOCK_SIZE;
reciprocalKernel<<<blockCount, BLOCK_SIZE>>>(gpuData, size);
CUDA_CHECK_RETURN(cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost));
CUDA_CHECK_RETURN(cudaFree(gpuData));
return rc;
}
float *cpuReciprocal(float *data, unsigned size)
{
float *rc = new float[size];
for (unsigned cnt = 0; cnt < size; ++cnt) rc[cnt] = 1.0 / data[cnt];
return rc;
}
void initialize(float *data, unsigned size)
{
for (unsigned i = 0; i < size; ++i)
data[i] = .5 * (i + 1);
}
int main(void)
{
static const int WORK_SIZE = 65530;
float *data = new float[WORK_SIZE];
initialize(data, WORK_SIZE);
float *recCpu = cpuReciprocal(data, WORK_SIZE);
float *recGpu = gpuReciprocal(data, WORK_SIZE);
float cpuSum = std::accumulate(recCpu, recCpu + WORK_SIZE, 0.0);
float gpuSum = std::accumulate(recGpu, recGpu + WORK_SIZE, 0.0);
std::cout << "gpuSum = " << gpuSum << " cpuSum = " << cpuSum << std::endl;
delete[] data;
delete[] recCpu;
delete[] recGpu;
return 0;
}
static void CheckCudaErrorAux(const char *file, unsigned line, const char *statement, cudaError_t err)
{
if (err == cudaSuccess) return;
std::cerr << statement << " returned " << cudaGetErrorString(err)
<< "(" << err << ") at " << file << ":" << line << std::endl;
exit(1);
}
The Big Picture
Before we get into the details, here’s what the program actually does at a high level:
┌─────────────────────────────────────────────────────────┐
│ CPU (Host) │
│ │
│ main() │
│ │ │
│ ▼ │
│ Allocate array → Initialize values │
│ │ │
│ ├──────────────────────┐ │
│ ▼ ▼ │
│ cpuReciprocal() gpuReciprocal() │
│ (serial loop) (parallel CUDA) │
│ │ │ │
│ └──────────┬───────────┘ │
│ ▼ │
│ Compare sums → Print result │
└─────────────────────────────────────────────────────────┘
│ PCIe Bus │
═══════════════════════╪════════════╪═════════════════════
┌──────────────────────▼────────────▼─────────────────────┐
│ GPU (Device) │
│ │
│ reciprocalKernel<<<blocks, threads>>>() │
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Block 0 │ │Block 1 │ │Block 2 │ │Block… │ │
│ │T0 T1… │ │T0 T1… │ │T0 T1… │ │T0 T1… │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
│ │
│ Every thread computes: data[idx] = 1.0f / data[idx] │
└──────────────────────────────────────────────────────────┘
Think of it like the CPU is the manager — it manages everything. The GPU is the factory floor — thousands of workers doing the actual math in parallel.
Step 1 — Initialize the Data
static const int WORK_SIZE = 65530;
float *data = new float[WORK_SIZE];
initialize(data, WORK_SIZE);
To learn about what and how the WORK_SIZE is refer to know about some NVIDIA GPU concepts Here NVIDIA GPU Concepts->.
The initialize() function fills the array like this:
Index │ Value
──────┼───────
0 │ 0.5
1 │ 1.0
2 │ 1.5
3 │ 2.0
4 │ 2.5
… │ …
Simple formula: data[i] = 0.5 * (i + 1).
In memory it looks like:
CPU RAM
┌─────┬─────┬─────┬─────┬─────┬─────┐
data ──▶│ 0.5 │ 1.0 │ 1.5 │ 2.0 │ 2.5 │ … │
└─────┴─────┴─────┴─────┴─────┴─────┘
Step 2 — CPU Version (Serial)
float *recCpu = cpuReciprocal(data, WORK_SIZE);
This is the classic approach — just a loop:
for (unsigned cnt = 0; cnt < size; ++cnt)
rc[cnt] = 1.0 / data[cnt];
Visually, the CPU works through each element one at a time:
Time ──────────────────────────────────────────▶
[1/0.5] → [1/1.0] → [1/1.5] → [1/2.0] → [1/2.5] → …
2.0 1.0 0.66 0.50 0.40
Nothing wrong with this — it works perfectly. But it’s sequential. For 65,530 elements, that’s 65,530 operations done one after the other.
Step 3 — GPU Version (Parallel)
Now here’s where CUDA comes in. The same computation, but we send the data to the GPU and run thousands of operations simultaneously.
3a. Allocate GPU Memory
float *gpuData;
cudaMalloc((void **)&gpuData, sizeof(float) * size);
This is just like malloc() on the CPU, except the memory lives on the GPU’s VRAM — physically separate from system RAM. The CPU can’t directly read or write it.
GPU VRAM
┌─────┬─────┬─────┬─────┬─────┐
gpuData▶│ ? │ ? │ ? │ ? │ ? │ ← uninitialized
└─────┴─────┴─────┴─────┴─────┘
3b. Copy Data: CPU → GPU
cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice);
CPU RAM GPU VRAM
┌─────┬─────┬─────┬─────┐ ┌─────┬─────┬─────┬─────┐
│ 0.5 │ 1.0 │ 1.5 │ 2.0 │ ──────▶ │ 0.5 │ 1.0 │ 1.5 │ 2.0 │
└─────┴─────┴─────┴─────┘ └─────┴─────┴─────┴─────┘
PCIe Bus transfer
This crosses the PCIe bus — the highway between your CPU and GPU. It’s not free, but for large workloads the GPU speedup more than makes up for it.
3c. Launch the Kernel
static const int BLOCK_SIZE = 256;
const int blockCount = (65530 + 256 - 1) / 256; // = 256 blocks
reciprocalKernel<<<blockCount, BLOCK_SIZE>>>(gpuData, size);
That <<<blockCount, BLOCK_SIZE>>> syntax is unique to CUDA. It’s how you tell the GPU:
- How many blocks to use
- How many threads per block
Thread Hierarchy
CUDA has a neat layered structure:
GRID (the entire launch)
│
├── Block 0 ──── 256 threads (Thread 0 … Thread 255)
├── Block 1 ──── 256 threads (Thread 0 … Thread 255)
├── Block 2 ──── 256 threads (Thread 0 … Thread 255)
│ …
└── Block 255 ─── 256 threads
Total threads launched: 256 × 256 = 65,536 Actual work: 65,530 elements → 6 threads just idle (handled by the bounds check).
How Each Thread Knows Its Index
unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
| Variable | Meaning |
|---|---|
blockIdx.x | Which block is this? (0, 1, 2…) |
blockDim.x | Threads per block (256) |
threadIdx.x | Thread ID within the block |
So the mapping looks like:
Block 0 → Thread 0: idx = 0 * 256 + 0 = 0
Block 0 → Thread 1: idx = 0 * 256 + 1 = 1
…
Block 0 → Thread 255: idx = 0 * 256 + 255 = 255
Block 1 → Thread 0: idx = 1 * 256 + 0 = 256
Block 1 → Thread 1: idx = 1 * 256 + 1 = 257
…
Every thread gets a unique global index — and that index maps directly to one element of the array.
3d. The Kernel Itself
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < vectorSize)
data[idx] = 1.0 / data[idx];
}
The __global__ keyword means: “callable from CPU, runs on GPU.”
Every single thread runs this exact same code — but with a different idx. So:
Thread 0 → data[0] = 1/0.5 = 2.00
Thread 1 → data[1] = 1/1.0 = 1.00
Thread 2 → data[2] = 1/1.5 = 0.66
Thread 3 → data[3] = 1/2.0 = 0.50
… … …
And crucially — all of this happens at the same time:
CPU (serial) GPU (parallel)
───────────────── ─────────────────────────────────
Time ──▶ Time ──▶
[op0][op1][op2][op3][op4]… [op0]
[op1]
[op2] ← all at once!
[op3]
[op4]
This is the whole point of CUDA.
Step 4 — Copy Results Back
cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost);
GPU VRAM CPU RAM
┌─────┬─────┬─────┬─────┐ ┌─────┬─────┬─────┬─────┐
│ 2.0 │ 1.0 │0.66 │ 0.5 │ ──────▶ │ 2.0 │ 1.0 │0.66 │ 0.5 │
└─────┴─────┴─────┴─────┘ └─────┴─────┴─────┴─────┘
Step 5 — Free GPU Memory
cudaFree(gpuData);
Just like free() on the CPU. Always do this — GPU memory leaks are a real thing.
Step 6 — Verify the Results
float cpuSum = std::accumulate(recCpu, recCpu + WORK_SIZE, 0.0);
float gpuSum = std::accumulate(recGpu, recGpu + WORK_SIZE, 0.0);
std::cout << "gpuSum = " << gpuSum << " cpuSum = " << cpuSum << std::endl;
We sum both result arrays and compare. If gpuSum ≈ cpuSum, the GPU computed the right answers. (Small floating-point differences are normal and expected.)
Full Execution Timeline
main() starts
│
├─▶ allocate data[]
│
├─▶ initialize() → [0.5, 1.0, 1.5, 2.0, …]
│
├─▶ cpuReciprocal() → serial loop → recCpu[]
│
└─▶ gpuReciprocal()
│
├─▶ cudaMalloc() — reserve GPU VRAM
│
├─▶ cudaMemcpy() — send data to GPU (Host → Device)
│
├─▶ kernel launch — 256 blocks × 256 threads = 65,536 threads
│ │
│ └─▶ each thread: data[idx] = 1.0 / data[idx]
│
├─▶ cudaMemcpy() — get results back (Device → Host)
│
└─▶ cudaFree() — release GPU VRAM
│
├─▶ accumulate CPU result
├─▶ accumulate GPU result
└─▶ print & compare
CUDA Runtime Functions — Quick Reference
| Function | What it does |
|---|---|
cudaMalloc() | Allocate memory on the GPU |
cudaMemcpy() | Copy data between CPU and GPU |
cudaFree() | Release GPU memory |
cudaGetErrorString() | Turn a CUDA error code into text |
cudaDeviceSynchronize() | Wait for all GPU work to finish |
CUDA Keywords — Quick Reference
| Keyword | What it means |
|---|---|
__global__ | Function runs on GPU, called from CPU |
blockIdx.x | Current block’s ID |
blockDim.x | Number of threads in each block |
threadIdx.x | Current thread’s ID within its block |
Improvement
After the kernel launch, you should add:
CUDA_CHECK_RETURN(cudaGetLastError());
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Why? Because kernel launches are asynchronous — the CPU fires off the launch and immediately moves on. Without cudaDeviceSynchronize(), you might copy results back before the GPU has finished computing them. That’s a nasty bug to track down.
Think of it this way:
CPU = manager. Plans the work, moves data around, checks results.
GPU = a factory with thousands of workers. Each worker handles exactly one element. They all work at the same time.
CPU array: [0.5][1.0][1.5][2.0] …
│
cudaMemcpy (Host → Device)
│
▼
GPU array: [0.5][1.0][1.5][2.0] …
│
reciprocalKernel (parallel!)
│
▼
GPU result: [2.0][1.0][0.66][0.5] …
│
cudaMemcpy (Device → Host)
│
▼
CPU result: [2.0][1.0][0.66][0.5] …
🏁 Wrapping Up
In one sentence:
The CPU sends an array to the GPU, launches thousands of parallel threads that each compute one reciprocal simultaneously, copies the results back, and verifies them against a CPU implementation.
What makes this problem a great fit for the GPU is that computing 1/x for element i is completely independent from element j. No shared state, no dependencies. That’s the golden rule for GPU acceleration — if your work is independent per element, the GPU will love it.
Hope this helped demystify the CPU ↔ GPU workflow. CUDA felt intimidating at first but once you see the pattern — allocate, copy, launch, copy back, free — it starts to feel very mechanical and predictable.
Thanks for reading!