This chapter demonstrates how to solve a simple 2D rotation problem (Section 1) using both CPU and GPU implementations. The problem serves as an excellent example of GPU acceleration using CUDA for computational speedup and Performance comparison between CPU and GPU implementations
"""
Solve the 2D rotation problem on CPU using Fourier spectral method
with Strang splitting
Algorithm:
The rotation problem couples advection in x and y directions.
We use Strang splitting to decompose into simpler steps:
1. Half-step in y-direction (Fourier method)
2. Full-step in x-direction (Fourier method)
3. Half-step in y-direction (Fourier method)
For the advection equation ∂f/∂t + u·∂f/∂x = 0 with periodic BCs,
the Fourier spectral solution in one time step is:
f̂_k(t+Δt) = exp(-i k u Δt) f̂_k(t)
where f̂_k is the Fourier transform and k is the wavenumber.
Parameters:
- mesh: 2D mesh structure
- nt: number of time steps
- tf: final time
Returns:
- f: final distribution function (real-valued)
"""
function rotation_on_cpu(mesh::Mesh, nt::Int64, tf::Float64)
# Calculate time step
dt = tf / nt
# Allocate distribution function (use complex for FFT compatibility)
f = zeros(ComplexF64, (mesh.nx, mesh.ny))
# Initialize with exact solution at t=0
f .= exact(0.0, mesh)
# Pre-compute exponential factors for each direction
# These encode the spectral solution of each advection step
# y-direction: u_y = -x, so phase factor is exp(i·tan(dt/2)·x·k_y)
exky = exp.(1im * tan(dt / 2) .* mesh.x .* mesh.ky')
# x-direction: u_x = y, so phase factor is exp(-i·sin(dt)·y·k_x)
ekxy = exp.(-1im * sin(dt) .* mesh.y' .* mesh.kx)
# Time stepping loop
@showprogress for n = 1:nt
# === STEP 1: Half-step in y-direction ===
# FFT in y-direction (dimension 2)
fft!(f, 2)
# Multiply by phase factor (spectral multiplication)
f .= exky .* f
# Inverse FFT back to physical space
ifft!(f, 2)
# === STEP 2: Full-step in x-direction ===
# FFT in x-direction (dimension 1)
fft!(f, 1)
# Multiply by phase factor
f .= ekxy .* f
# Inverse FFT back to physical space
ifft!(f, 1)
# === STEP 3: Half-step in y-direction ===
# FFT in y-direction
fft!(f, 2)
# Multiply by phase factor
f .= exky .* f
# Inverse FFT back to physical space
ifft!(f, 2)
end
# Return real part (imaginary parts are numerical errors)
real(f)
end
# Set up the mesh: 1024×1024 grid in [-π, π]²
mesh = Mesh(-π, π, 1024, -π, π, 1024)
# Parameters
nt, tf = 100, 20. # 100 time steps to t=20
# Quick test with 1 step
rotation_on_cpu(mesh, 1, 0.1)
# Time the CPU computation and compare with exact solution
@time norm(rotation_on_cpu(mesh, nt, tf) .- exact(tf, mesh))
Output interpretation:
@time macro reports computation time and memory allocationnorm(solution - exact) gives the L² error between numerical and exact solutionsThe same algorithm can be implemented on GPU for massive speedup. Modern GPUs can achieve 10-100× speedup for FFT-heavy operations.
using CUDA
using CuArrays
using CuArrays.CUFFT
CuArrays: Provides GPU array types (CuArray) that work like regular Julia arrays but live on GPU memoryCuArrays.CUFFT: CUDA-accelerated FFT library (much faster than CPU FFTW for large arrays)"""
Solve the 2D rotation problem on GPU using CUDA acceleration
The algorithm is identical to the CPU version, but:
1. Arrays are allocated in GPU memory (CuArray)
2. FFT plans are created for GPU execution
3. All operations happen on the GPU
4. Only the final result is transferred back to CPU
This is MUCH faster for large arrays because:
- GPU FFTs are highly optimized
- GPU memory bandwidth is much higher
- No CPU-GPU communication during the time loop
Parameters:
- mesh: 2D mesh structure (used for dimensions and wavenumbers)
- nt: number of time steps
- tf: final time
Returns:
- f: final distribution function on CPU (real-valued)
"""
function rotation_on_gpu(mesh::Mesh, nt::Int64, tf::Float64)
# Calculate time step
dt = tf / nt
# Initialize on CPU
f = zeros(ComplexF64, (mesh.nx, mesh.ny))
f .= exact(0.0, mesh)
# === Transfer to GPU ===
# Allocate f on GPU memory
d_f = CuArray(f)
# === Create FFT Plans ===
# Creating FFT plans once is more efficient than planning each FFT call
# Plans are cached and reused
# Forward FFT along dimension 1 (x-direction)
p_x = plan_fft!(d_f, [1])
# Inverse FFT along dimension 1
pinv_x = plan_ifft!(d_f, [1])
# Forward FFT along dimension 2 (y-direction)
p_y = plan_fft!(d_f, [2])
# Inverse FFT along dimension 2
pinv_y = plan_ifft!(d_f, [2])
# === Transfer phase factors to GPU ===
# Pre-compute and transfer exponential factors to GPU memory
# This avoids transferring them from CPU every iteration
d_exky = CuArray(exp.(1im * tan(dt / 2) .* mesh.x .* mesh.ky'))
d_ekxy = CuArray(exp.(-1im * sin(dt) .* mesh.y' .* mesh.kx))
# === Time stepping loop (all on GPU) ===
@showprogress for n = 1:nt
# === STEP 1: Half-step in y-direction ===
# Forward FFT in y-direction (using GPU plan)
p_y * d_f
# Multiply by phase factor on GPU
d_f .*= d_exky
# Inverse FFT in y-direction
pinv_y * d_f
# === STEP 2: Full-step in x-direction ===
# Forward FFT in x-direction
p_x * d_f
# Multiply by phase factor on GPU
d_f .*= d_ekxy
# Inverse FFT in x-direction
pinv_x * d_f
# === STEP 3: Half-step in y-direction ===
# Forward FFT in y-direction
p_y * d_f
# Multiply by phase factor on GPU
d_f .*= d_exky
# Inverse FFT in y-direction
pinv_y * d_f
end
# === Transfer result back to CPU ===
# collect() transfers the GPU array back to CPU memory
f .= collect(d_f)
# Return real part
real(f)
end
GPU vs CPU Key Differences:
| Aspect | CPU | GPU |
|---|---|---|
| Array type | Array{ComplexF64} |
CuArray{ComplexF64} |
| FFT library | FFTW | CUFFT |
| Memory | CPU RAM | GPU VRAM |
| FFT Planning | Done before loops | Critical for performance |
| Data Transfer | Implicit | Explicit with CuArray |
| Speedup | Baseline | 10-100× for large arrays |
Optimization Principles Applied:
fft! modifies arrays in place, saving memory# Same mesh and parameters
nt, tf = 100, 20.
# Quick test with 1 step
rotation_on_gpu(mesh, 1, 0.1)
# Time the GPU computation and compare with exact solution
@time norm(rotation_on_gpu(mesh, nt, tf) .- exact(tf, mesh))
Expected Results: - GPU version should be significantly faster than CPU (roughly 10-100× depending on GPU) - Numerical error should be similar to CPU version (same algorithm) - GPU shines with larger arrays and more time steps
To properly compare CPU and GPU:
# CPU timing
time_cpu = @elapsed rotation_on_cpu(mesh, nt, tf)
# GPU timing (excluding first run which includes compilation)
rotation_on_gpu(mesh, 1, 0.1) # Warmup
time_gpu = @elapsed rotation_on_gpu(mesh, nt, tf)
# Speedup factor
speedup = time_cpu / time_gpu
println("CPU time: $(time_cpu) seconds")
println("GPU time: $(time_gpu) seconds")
println("Speedup: $(speedup)×")
Where GPU Wins:
Where CPU is Competitive: