Let’s take a small tour through what actually happens when a CUDA program runs. I’ll use a simple example, computing reciprocals of numbers( a simple standard default example) but the real goal here is to understand how CUDA turns a normal-looking function into massively parallel execution on a GPU. Checkout at the end of the post there is also a video explainig how it works. Lets do a simple math operation both on CPU and GPU and see what happens, we can consider a mathematical operation like adding reciprocal of a numbers that are stored in an array. some thing like the below operation.

data[idx] = 1.0f / data[idx];

Take every number in an array and compute 1/x for it. We do it twice, once on the CPU, once on the GPU, and then compare the results.

The code mentoined first of all might look a bit daunting although it is simple code that does some additons and array operations. I will try to break it down piece by piece. Stay tuned and there is also a nice visual explantion for this as you keep continue reading the post below.


The Full Source Code - The code that is generated by default cuda runtime example project from the nsight eclipse edition that nvidia ships.

Let’s start by looking at the whole thing:

#include <iostream>
#include <numeric>
#include <stdlib.h>

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * CUDA kernel — runs on the GPU, one thread per element
 */
__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
    unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < vectorSize)
        data[idx] = 1.0 / data[idx];
}

/**
 * Host function — sets up GPU memory and launches the kernel
 */
float *gpuReciprocal(float *data, unsigned size)
{
    float *rc = new float[size];
    float *gpuData;

    CUDA_CHECK_RETURN(cudaMalloc((void **)&gpuData, sizeof(float)*size));
    CUDA_CHECK_RETURN(cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice));

    static const int BLOCK_SIZE = 256;
    const int blockCount = (size + BLOCK_SIZE - 1) / BLOCK_SIZE;
    reciprocalKernel<<<blockCount, BLOCK_SIZE>>>(gpuData, size);

    CUDA_CHECK_RETURN(cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost));
    CUDA_CHECK_RETURN(cudaFree(gpuData));
    return rc;
}

float *cpuReciprocal(float *data, unsigned size)
{
    float *rc = new float[size];
    for (unsigned cnt = 0; cnt < size; ++cnt) rc[cnt] = 1.0 / data[cnt];
    return rc;
}

void initialize(float *data, unsigned size)
{
    for (unsigned i = 0; i < size; ++i)
        data[i] = .5 * (i + 1);
}

int main(void)
{
    static const int WORK_SIZE = 65530;
    float *data = new float[WORK_SIZE];

    initialize(data, WORK_SIZE);

    float *recCpu = cpuReciprocal(data, WORK_SIZE);
    float *recGpu = gpuReciprocal(data, WORK_SIZE);

    float cpuSum = std::accumulate(recCpu, recCpu + WORK_SIZE, 0.0);
    float gpuSum = std::accumulate(recGpu, recGpu + WORK_SIZE, 0.0);

    std::cout << "gpuSum = " << gpuSum << "  cpuSum = " << cpuSum << std::endl;

    delete[] data;
    delete[] recCpu;
    delete[] recGpu;

    return 0;
}

static void CheckCudaErrorAux(const char *file, unsigned line, const char *statement, cudaError_t err)
{
    if (err == cudaSuccess) return;
    std::cerr << statement << " returned " << cudaGetErrorString(err)
              << "(" << err << ") at " << file << ":" << line << std::endl;
    exit(1);
}

The Big Picture

Before we get into the details, here’s what the program actually does at a high level:

┌─────────────────────────────────────────────────────────┐
│                      CPU (Host)                         │
│                                                         │
│   main()                                                │
│     │                                                   │
│     ▼                                                   │
│   Allocate array  →  Initialize values                  │
│     │                                                   │
│     ├──────────────────────┐                            │
│     ▼                      ▼                            │
│  cpuReciprocal()      gpuReciprocal()                   │
│  (serial loop)        (parallel CUDA)                   │
│     │                      │                            │
│     └──────────┬───────────┘                            │
│                ▼                                        │
│          Compare sums → Print result                    │
└─────────────────────────────────────────────────────────┘
                    │  PCIe Bus  │
═══════════════════════╪════════════╪═════════════════════
┌──────────────────────▼────────────▼─────────────────────┐
│                      GPU (Device)                        │
│                                                          │
│   reciprocalKernel<<<blocks, threads>>>()                │
│                                                          │
│   ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐        │
│   │Block 0 │  │Block 1 │  │Block 2 │  │Block…  │        │
│   │T0 T1…  │  │T0 T1…  │  │T0 T1…  │  │T0 T1…  │        │
│   └────────┘  └────────┘  └────────┘  └────────┘        │
│                                                          │
│   Every thread computes:  data[idx] = 1.0f / data[idx]  │
└──────────────────────────────────────────────────────────┘

Think of it like the CPU is the manager — it manages everything. The GPU is the factory floor — thousands of workers doing the actual math in parallel.


Step 1 — Initialize the Data

static const int WORK_SIZE = 65530;
float *data = new float[WORK_SIZE];
initialize(data, WORK_SIZE);

To learn about what and how the WORK_SIZE is refer to know about some NVIDIA GPU concepts Here NVIDIA GPU Concepts->.

The initialize() function fills the array like this:

Index │  Value
──────┼───────
  0   │  0.5
  1   │  1.0
  2   │  1.5
  3   │  2.0
  4   │  2.5
  …   │  …

Simple formula: data[i] = 0.5 * (i + 1).

In memory it looks like:

CPU RAM
        ┌─────┬─────┬─────┬─────┬─────┬─────┐
data ──▶│ 0.5 │ 1.0 │ 1.5 │ 2.0 │ 2.5 │ … │
        └─────┴─────┴─────┴─────┴─────┴─────┘

Step 2 — CPU Version (Serial)

float *recCpu = cpuReciprocal(data, WORK_SIZE);

This is the classic approach — just a loop:

for (unsigned cnt = 0; cnt < size; ++cnt)
    rc[cnt] = 1.0 / data[cnt];

Visually, the CPU works through each element one at a time:

Time ──────────────────────────────────────────▶

[1/0.5] → [1/1.0] → [1/1.5] → [1/2.0] → [1/2.5] → …
  2.0       1.0       0.66      0.50       0.40

Nothing wrong with this — it works perfectly. But it’s sequential. For 65,530 elements, that’s 65,530 operations done one after the other.


Step 3 — GPU Version (Parallel)

Now here’s where CUDA comes in. The same computation, but we send the data to the GPU and run thousands of operations simultaneously.

3a. Allocate GPU Memory

float *gpuData;
cudaMalloc((void **)&gpuData, sizeof(float) * size);

This is just like malloc() on the CPU, except the memory lives on the GPU’s VRAM — physically separate from system RAM. The CPU can’t directly read or write it.

GPU VRAM
        ┌─────┬─────┬─────┬─────┬─────┐
gpuData▶│  ?  │  ?  │  ?  │  ?  │  ?  │  ← uninitialized
        └─────┴─────┴─────┴─────┴─────┘

3b. Copy Data: CPU → GPU

cudaMemcpy(gpuData, data, sizeof(float)*size, cudaMemcpyHostToDevice);
  CPU RAM                              GPU VRAM

  ┌─────┬─────┬─────┬─────┐           ┌─────┬─────┬─────┬─────┐
  │ 0.5 │ 1.0 │ 1.5 │ 2.0 │  ──────▶  │ 0.5 │ 1.0 │ 1.5 │ 2.0 │
  └─────┴─────┴─────┴─────┘           └─────┴─────┴─────┴─────┘

                          PCIe Bus transfer

This crosses the PCIe bus — the highway between your CPU and GPU. It’s not free, but for large workloads the GPU speedup more than makes up for it.

3c. Launch the Kernel

static const int BLOCK_SIZE = 256;
const int blockCount = (65530 + 256 - 1) / 256;  // = 256 blocks

reciprocalKernel<<<blockCount, BLOCK_SIZE>>>(gpuData, size);

That <<<blockCount, BLOCK_SIZE>>> syntax is unique to CUDA. It’s how you tell the GPU:

  • How many blocks to use
  • How many threads per block

Thread Hierarchy

CUDA has a neat layered structure:

GRID  (the entire launch)
│
├── Block 0  ──── 256 threads (Thread 0 … Thread 255)
├── Block 1  ──── 256 threads (Thread 0 … Thread 255)
├── Block 2  ──── 256 threads (Thread 0 … Thread 255)
│    …
└── Block 255 ─── 256 threads

Total threads launched: 256 × 256 = 65,536 Actual work: 65,530 elements → 6 threads just idle (handled by the bounds check).

How Each Thread Knows Its Index

unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
VariableMeaning
blockIdx.xWhich block is this? (0, 1, 2…)
blockDim.xThreads per block (256)
threadIdx.xThread ID within the block

So the mapping looks like:

Block 0 → Thread 0:  idx = 0 * 256 + 0  =   0
Block 0 → Thread 1:  idx = 0 * 256 + 1  =   1
   …
Block 0 → Thread 255: idx = 0 * 256 + 255 = 255
Block 1 → Thread 0:  idx = 1 * 256 + 0  = 256
Block 1 → Thread 1:  idx = 1 * 256 + 1  = 257
   …

Every thread gets a unique global index — and that index maps directly to one element of the array.

3d. The Kernel Itself

__global__ void reciprocalKernel(float *data, unsigned vectorSize) {
    unsigned idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < vectorSize)
        data[idx] = 1.0 / data[idx];
}

The __global__ keyword means: “callable from CPU, runs on GPU.”

Every single thread runs this exact same code — but with a different idx. So:

Thread 0  →  data[0] = 1/0.5  = 2.00
Thread 1  →  data[1] = 1/1.0  = 1.00
Thread 2  →  data[2] = 1/1.5  = 0.66
Thread 3  →  data[3] = 1/2.0  = 0.50
  …            …                  …

And crucially — all of this happens at the same time:

CPU (serial)                       GPU (parallel)
─────────────────                  ─────────────────────────────────
Time ──▶                           Time ──▶

[op0][op1][op2][op3][op4]…         [op0]
                                   [op1]
                                   [op2]   ← all at once!
                                   [op3]
                                   [op4]

This is the whole point of CUDA.


Step 4 — Copy Results Back

cudaMemcpy(rc, gpuData, sizeof(float)*size, cudaMemcpyDeviceToHost);
  GPU VRAM                             CPU RAM

  ┌─────┬─────┬─────┬─────┐           ┌─────┬─────┬─────┬─────┐
  │ 2.0 │ 1.0 │0.66 │ 0.5 │  ──────▶  │ 2.0 │ 1.0 │0.66 │ 0.5 │
  └─────┴─────┴─────┴─────┘           └─────┴─────┴─────┴─────┘

Step 5 — Free GPU Memory

cudaFree(gpuData);

Just like free() on the CPU. Always do this — GPU memory leaks are a real thing.


Step 6 — Verify the Results

float cpuSum = std::accumulate(recCpu, recCpu + WORK_SIZE, 0.0);
float gpuSum = std::accumulate(recGpu, recGpu + WORK_SIZE, 0.0);

std::cout << "gpuSum = " << gpuSum << "  cpuSum = " << cpuSum << std::endl;

We sum both result arrays and compare. If gpuSum ≈ cpuSum, the GPU computed the right answers. (Small floating-point differences are normal and expected.)


Full Execution Timeline

main() starts
    │
    ├─▶ allocate data[]
    │
    ├─▶ initialize()  →  [0.5, 1.0, 1.5, 2.0, …]
    │
    ├─▶ cpuReciprocal()  →  serial loop  →  recCpu[]
    │
    └─▶ gpuReciprocal()
             │
             ├─▶ cudaMalloc()       — reserve GPU VRAM
             │
             ├─▶ cudaMemcpy()       — send data to GPU   (Host → Device)
             │
             ├─▶ kernel launch      — 256 blocks × 256 threads = 65,536 threads
             │        │
             │        └─▶ each thread: data[idx] = 1.0 / data[idx]
             │
             ├─▶ cudaMemcpy()       — get results back   (Device → Host)
             │
             └─▶ cudaFree()         — release GPU VRAM

    │
    ├─▶ accumulate CPU result
    ├─▶ accumulate GPU result
    └─▶ print & compare

CUDA Runtime Functions — Quick Reference

FunctionWhat it does
cudaMalloc()Allocate memory on the GPU
cudaMemcpy()Copy data between CPU and GPU
cudaFree()Release GPU memory
cudaGetErrorString()Turn a CUDA error code into text
cudaDeviceSynchronize()Wait for all GPU work to finish

CUDA Keywords — Quick Reference

KeywordWhat it means
__global__Function runs on GPU, called from CPU
blockIdx.xCurrent block’s ID
blockDim.xNumber of threads in each block
threadIdx.xCurrent thread’s ID within its block

Improvement

After the kernel launch, you should add:

CUDA_CHECK_RETURN(cudaGetLastError());
CUDA_CHECK_RETURN(cudaDeviceSynchronize());

Why? Because kernel launches are asynchronous — the CPU fires off the launch and immediately moves on. Without cudaDeviceSynchronize(), you might copy results back before the GPU has finished computing them. That’s a nasty bug to track down.


Think of it this way:

CPU = manager. Plans the work, moves data around, checks results.
GPU = a factory with thousands of workers. Each worker handles exactly one element. They all work at the same time.

CPU array:     [0.5][1.0][1.5][2.0] …
                         │
                  cudaMemcpy (Host → Device)
                         │
                         ▼
GPU array:     [0.5][1.0][1.5][2.0] …
                         │
               reciprocalKernel (parallel!)
                         │
                         ▼
GPU result:    [2.0][1.0][0.66][0.5] …
                         │
                  cudaMemcpy (Device → Host)
                         │
                         ▼
CPU result:    [2.0][1.0][0.66][0.5] …

🏁 Wrapping Up

In one sentence:

The CPU sends an array to the GPU, launches thousands of parallel threads that each compute one reciprocal simultaneously, copies the results back, and verifies them against a CPU implementation.

What makes this problem a great fit for the GPU is that computing 1/x for element i is completely independent from element j. No shared state, no dependencies. That’s the golden rule for GPU acceleration — if your work is independent per element, the GPU will love it.

Hope this helped demystify the CPU ↔ GPU workflow. CUDA felt intimidating at first but once you see the pattern — allocate, copy, launch, copy back, free — it starts to feel very mechanical and predictable.


Thanks for reading!