EPFL · MATH-454 Parallel & High-Performance Computing

Lattice Boltzmann,
parallelised & memory-bound.

A 2D D2Q9 solver for flow past a cylinder, taken from a serial reference to MPI (distributed CPUs) and a hand-written CUDA kernel on an NVIDIA V100 — and every result traced back to one thing: memory bandwidth.

Vorticity field — Kármán vortex street at Re=100

Vorticity field at Re = 100 — the Kármán vortex street the solver reproduces and is validated against (St = 0.1600).

77%
of the V100 roofline, both kernels
3.8×
SoA vs AoS coalescing speedup
8791
MLUPS — fused, single precision
0.1600
Strouhal number, every variant
The thesis

A memory-bound stencil

The D2Q9 BGK update is purely local — collide, bounce-back, stream, boundaries — and does only ~0.9 FLOP per byte moved. That sits far to the left of the machine's roofline ridge, so the kernel is starved of bytes, not flops. Everything downstream follows from this.

Roofline — single Cascade-Lake core

Arithmetic intensity ≈ 0.9 FLOP/B vs the 8.3 FLOP/B ridge.

Where the runtime goes (gprof, -O3)

Two local loops dominate; there is essentially no serial fraction (α ≈ 0.999).

Consequence: with α ≈ 0.999, Amdahl predicts near-ideal speedup — so wherever scaling falls short, the cause is the memory system (shared socket bandwidth), not serial code.
MPI · slide 1

01Strong scaling vs Amdahl

Fixed grid, more ranks (single node, median of 3). Three problem sizes span the cache-to-DRAM regime. The dashed line is the Amdahl bound at the measured α = 0.999.

Speedup S(P)

Efficiency E = S/P

Near-ideal to P = 4 (E ≥ 0.93), then it bends as the shared 76.8 GB/s socket saturates. The largest grid keeps the best efficiency — the halo cost scales as P/nᵧ, so more rows per rank dilutes it.
MPI · slide 2

02Weak scaling vs Gustafson

Fixed 1024×256 cells per rank; the global grid grows with P. Per-rank work and halo are both constant, so any efficiency loss is not communication.

Throughput & efficiency

MLUPS rises 22 → 199 (~9×) but only sub-linearly; efficiency falls to 0.57 at P = 16. The aggregate demand hits the STREAM ceiling — Gustafson holds only while bandwidth lasts.
CUDA · slide 3

03CUDA on the V100

One thread per cell, SoA layout. HBM2 ≈ 900 GB/s sets the roofline: 3125 MLUPS for the two-pass kernels (288 B/update), 6250 for the fused kernel (144 B/update).

Problem-size scaling vs roofline

Small grids can't fill the 80 SMs; saturates near 77% from 1024².

Memory layout: SoA vs AoS

The single most important decision — coalescing is worth 3.8×.

Threads-per-block sweep (2048²)

A second-order knob — flat for any tpb ≥ 64.

Both kernels sit at the same 77%

Strong evidence the implementation is bandwidth-bound; fusion's ~2× is halved traffic.

Going further · slide 4

04Two more bandwidth stories

Both bonus studies say the same thing a third and fourth way: a change helps exactly in proportion to how bandwidth-bound the run is.

Single precision — GPU vs CPU

float halves the bytes. GPU gets ~2×; the CPU gain grows with core count.

CPU single/double speedup vs ranks

1 core is latency-bound; only a saturated socket makes bytes binding.

Inter-node strong scaling (to 40 ranks)

One y-slab seam crosses InfiniBand; scales to 385 MLUPS at P = 40.

P = 20: packed vs split

Splitting 10+10 across two nodes beats packing 20 — a second node doubles bandwidth.

The surprise: the split placement is ~24% faster despite being the only one that pays the InfiniBand cost. For a memory-bound kernel, more sockets beats keeping the halo local.
Under the hood

Implementation

MPI — 1D row decomposition

  • Horizontal row chunks, one band per rank, single ghost row above/below.
  • Two fixed-size 9·nx-double halo messages per step, regardless of P.
  • The y-split keeps each chunk contiguous in SoA and parallelises the boundary kernels too.
  • Bit-for-bit identical to the serial reference.

CUDA — one thread per cell

  • Structure-of-arrays → coalesced 128-byte transactions.
  • Lattice constants in __constant__ memory.
  • Fused stream+collide kernel (register-resident) halves global traffic.
  • Device-side probe → zero per-step PCIe traffic.

Build & run

Each parallel version is self-contained with its own Makefile + README.

make mpi · make mpi_single  |  make cuda · make cuda_aos · make cuda_single
srun -n 16 ./lbm_mpi nx=1024 ny=512 steps=2000
srun --gres=gpu:1 ./lbm_cuda nx=2048 ny=2048 fused=1 tpb=256

Run on EPFL's Izar cluster — 2× Xeon Gold 6230 + NVIDIA V100, Slurm.

AI usage

Claude / Claude Code assisted with the benchmark Slurm scripts, the figure-plotting code, wiring the CUDA warmup argument and the single-precision / AoS build targets, and structuring the write-up. Every performance number here is from my own Slurm jobs on Izar; physics was checked independently with strouhal.py (St = 0.1600) and a bit-for-bit serial/MPI comparison, and all kernel code was reviewed line by line.