A 2D D2Q9 solver for flow past a cylinder, taken from a serial reference to MPI (distributed CPUs) and a hand-written CUDA kernel on an NVIDIA V100 — and every result traced back to one thing: memory bandwidth.

Vorticity field at Re = 100 — the Kármán vortex street the solver reproduces and is validated against (St = 0.1600).
The D2Q9 BGK update is purely local — collide, bounce-back, stream, boundaries — and does only ~0.9 FLOP per byte moved. That sits far to the left of the machine's roofline ridge, so the kernel is starved of bytes, not flops. Everything downstream follows from this.
Arithmetic intensity ≈ 0.9 FLOP/B vs the 8.3 FLOP/B ridge.
Two local loops dominate; there is essentially no serial fraction (α ≈ 0.999).
Fixed grid, more ranks (single node, median of 3). Three problem sizes span the cache-to-DRAM regime. The dashed line is the Amdahl bound at the measured α = 0.999.
Fixed 1024×256 cells per rank; the global grid grows with P. Per-rank work and halo are both constant, so any efficiency loss is not communication.
One thread per cell, SoA layout. HBM2 ≈ 900 GB/s sets the roofline: 3125 MLUPS for the two-pass kernels (288 B/update), 6250 for the fused kernel (144 B/update).
Small grids can't fill the 80 SMs; saturates near 77% from 1024².
The single most important decision — coalescing is worth 3.8×.
A second-order knob — flat for any tpb ≥ 64.
Strong evidence the implementation is bandwidth-bound; fusion's ~2× is halved traffic.
Both bonus studies say the same thing a third and fourth way: a change helps exactly in proportion to how bandwidth-bound the run is.
float halves the bytes. GPU gets ~2×; the CPU gain grows with core count.
1 core is latency-bound; only a saturated socket makes bytes binding.
One y-slab seam crosses InfiniBand; scales to 385 MLUPS at P = 40.
Splitting 10+10 across two nodes beats packing 20 — a second node doubles bandwidth.
9·nx-double halo messages per step, regardless of P.__constant__ memory.Each parallel version is self-contained with its own Makefile + README.
make mpi · make mpi_single | make cuda · make cuda_aos · make cuda_single
srun -n 16 ./lbm_mpi nx=1024 ny=512 steps=2000
srun --gres=gpu:1 ./lbm_cuda nx=2048 ny=2048 fused=1 tpb=256
Run on EPFL's Izar cluster — 2× Xeon Gold 6230 + NVIDIA V100, Slurm.
warmup argument and the single-precision / AoS build targets, and
structuring the write-up. Every performance number here is from my own Slurm jobs on Izar;
physics was checked independently with strouhal.py (St = 0.1600) and a bit-for-bit
serial/MPI comparison, and all kernel code was reviewed line by line.