3 2D Rotation Problem: CPU vs GPU Implementation

This chapter demonstrates how to solve a simple 2D rotation problem (Section 1) using both CPU and GPU implementations. The problem serves as an excellent example of GPU acceleration using CUDA for computational speedup and Performance comparison between CPU and GPU implementations

3.0.1 CPU Implementation with Strang Splitting

"""
Solve the 2D rotation problem on CPU using Fourier spectral method
with Strang splitting

Algorithm:
The rotation problem couples advection in x and y directions.
We use Strang splitting to decompose into simpler steps:

1. Half-step in y-direction (Fourier method)
2. Full-step in x-direction (Fourier method)
3. Half-step in y-direction (Fourier method)

For the advection equation ∂f/∂t + u·∂f/∂x = 0 with periodic BCs,
the Fourier spectral solution in one time step is:

f̂_k(t+Δt) = exp(-i k u Δt) f̂_k(t)

where f̂_k is the Fourier transform and k is the wavenumber.

Parameters:
- mesh: 2D mesh structure
- nt: number of time steps
- tf: final time

Returns:
- f: final distribution function (real-valued)
"""
function rotation_on_cpu(mesh::Mesh, nt::Int64, tf::Float64)
    
    # Calculate time step
    dt = tf / nt
    
    # Allocate distribution function (use complex for FFT compatibility)
    f = zeros(ComplexF64, (mesh.nx, mesh.ny))
    
    # Initialize with exact solution at t=0
    f .= exact(0.0, mesh)
    
    # Pre-compute exponential factors for each direction
    # These encode the spectral solution of each advection step
    
    # y-direction: u_y = -x, so phase factor is exp(i·tan(dt/2)·x·k_y)
    exky = exp.(1im * tan(dt / 2) .* mesh.x .* mesh.ky')
    
    # x-direction: u_x = y, so phase factor is exp(-i·sin(dt)·y·k_x)
    ekxy = exp.(-1im * sin(dt) .* mesh.y' .* mesh.kx)
    
    # Time stepping loop
    @showprogress for n = 1:nt
        
        # === STEP 1: Half-step in y-direction ===
        # FFT in y-direction (dimension 2)
        fft!(f, 2)
        
        # Multiply by phase factor (spectral multiplication)
        f .= exky .* f
        
        # Inverse FFT back to physical space
        ifft!(f, 2)
        
        # === STEP 2: Full-step in x-direction ===
        # FFT in x-direction (dimension 1)
        fft!(f, 1)
        
        # Multiply by phase factor
        f .= ekxy .* f
        
        # Inverse FFT back to physical space
        ifft!(f, 1)
        
        # === STEP 3: Half-step in y-direction ===
        # FFT in y-direction
        fft!(f, 2)
        
        # Multiply by phase factor
        f .= exky .* f
        
        # Inverse FFT back to physical space
        ifft!(f, 2)
        
    end
    
    # Return real part (imaginary parts are numerical errors)
    real(f)
    
end

3.0.2 CPU Test and Benchmarking

# Set up the mesh: 1024×1024 grid in [-π, π]²
mesh = Mesh(-π, π, 1024, -π, π, 1024)

# Parameters
nt, tf = 100, 20.      # 100 time steps to t=20

# Quick test with 1 step
rotation_on_cpu(mesh, 1, 0.1)

# Time the CPU computation and compare with exact solution
@time norm(rotation_on_cpu(mesh, nt, tf) .- exact(tf, mesh))

Output interpretation:

The @time macro reports computation time and memory allocation
norm(solution - exact) gives the L² error between numerical and exact solutions
This error should be small (order of machine precision if working correctly)

3.0.3 GPU Acceleration with CUDA

The same algorithm can be implemented on GPU for massive speedup. Modern GPUs can achieve 10-100× speedup for FFT-heavy operations.

using CUDA
using CuArrays
using CuArrays.CUFFT

CuArrays: Provides GPU array types (CuArray) that work like regular Julia arrays but live on GPU memory
CuArrays.CUFFT: CUDA-accelerated FFT library (much faster than CPU FFTW for large arrays)

"""
Solve the 2D rotation problem on GPU using CUDA acceleration

The algorithm is identical to the CPU version, but:
1. Arrays are allocated in GPU memory (CuArray)
2. FFT plans are created for GPU execution
3. All operations happen on the GPU
4. Only the final result is transferred back to CPU

This is MUCH faster for large arrays because:
- GPU FFTs are highly optimized
- GPU memory bandwidth is much higher
- No CPU-GPU communication during the time loop

Parameters:
- mesh: 2D mesh structure (used for dimensions and wavenumbers)
- nt: number of time steps
- tf: final time

Returns:
- f: final distribution function on CPU (real-valued)
"""
function rotation_on_gpu(mesh::Mesh, nt::Int64, tf::Float64)
    
    # Calculate time step
    dt = tf / nt
    
    # Initialize on CPU
    f = zeros(ComplexF64, (mesh.nx, mesh.ny))
    f .= exact(0.0, mesh)
    
    # === Transfer to GPU ===
    # Allocate f on GPU memory
    d_f = CuArray(f)
    
    # === Create FFT Plans ===
    # Creating FFT plans once is more efficient than planning each FFT call
    # Plans are cached and reused
    
    # Forward FFT along dimension 1 (x-direction)
    p_x = plan_fft!(d_f, [1])
    
    # Inverse FFT along dimension 1
    pinv_x = plan_ifft!(d_f, [1])
    
    # Forward FFT along dimension 2 (y-direction)
    p_y = plan_fft!(d_f, [2])
    
    # Inverse FFT along dimension 2
    pinv_y = plan_ifft!(d_f, [2])
    
    # === Transfer phase factors to GPU ===
    # Pre-compute and transfer exponential factors to GPU memory
    # This avoids transferring them from CPU every iteration
    d_exky = CuArray(exp.(1im * tan(dt / 2) .* mesh.x .* mesh.ky'))
    d_ekxy = CuArray(exp.(-1im * sin(dt) .* mesh.y' .* mesh.kx))
    
    # === Time stepping loop (all on GPU) ===
    @showprogress for n = 1:nt
        
        # === STEP 1: Half-step in y-direction ===
        # Forward FFT in y-direction (using GPU plan)
        p_y * d_f
        
        # Multiply by phase factor on GPU
        d_f .*= d_exky
        
        # Inverse FFT in y-direction
        pinv_y * d_f
        
        # === STEP 2: Full-step in x-direction ===
        # Forward FFT in x-direction
        p_x * d_f
        
        # Multiply by phase factor on GPU
        d_f .*= d_ekxy
        
        # Inverse FFT in x-direction
        pinv_x * d_f
        
        # === STEP 3: Half-step in y-direction ===
        # Forward FFT in y-direction
        p_y * d_f
        
        # Multiply by phase factor on GPU
        d_f .*= d_exky
        
        # Inverse FFT in y-direction
        pinv_y * d_f
        
    end
    
    # === Transfer result back to CPU ===
    # collect() transfers the GPU array back to CPU memory
    f .= collect(d_f)
    
    # Return real part
    real(f)
    
end

GPU vs CPU Key Differences:

Aspect	CPU	GPU
Array type	`Array{ComplexF64}`	`CuArray{ComplexF64}`
FFT library	FFTW	CUFFT
Memory	CPU RAM	GPU VRAM
FFT Planning	Done before loops	Critical for performance
Data Transfer	Implicit	Explicit with `CuArray`
Speedup	Baseline	10-100× for large arrays

Optimization Principles Applied:

Pre-compute on CPU, transfer once: Exponential factors computed once and transferred
Avoid CPU-GPU transfers in loops: All computation happens on GPU
Pre-plan FFTs: Creating plans once and reusing them is much faster
Use in-place FFTs: fft! modifies arrays in place, saving memory

3.0.4 GPU Test and Benchmarking

# Same mesh and parameters
nt, tf = 100, 20.

# Quick test with 1 step
rotation_on_gpu(mesh, 1, 0.1)

# Time the GPU computation and compare with exact solution
@time norm(rotation_on_gpu(mesh, nt, tf) .- exact(tf, mesh))

Expected Results: - GPU version should be significantly faster than CPU (roughly 10-100× depending on GPU) - Numerical error should be similar to CPU version (same algorithm) - GPU shines with larger arrays and more time steps

3.0.5 Performance Analysis and Interpretation

To properly compare CPU and GPU:

# CPU timing
time_cpu = @elapsed rotation_on_cpu(mesh, nt, tf)

# GPU timing (excluding first run which includes compilation)
rotation_on_gpu(mesh, 1, 0.1)  # Warmup
time_gpu = @elapsed rotation_on_gpu(mesh, nt, tf)

# Speedup factor
speedup = time_cpu / time_gpu
println("CPU time: $(time_cpu) seconds")
println("GPU time: $(time_gpu) seconds")
println("Speedup: $(speedup)×")

Where GPU Wins:

Large arrays: For this 1024×1024 mesh, GPU acceleration is substantial
Many time steps: GPU amortizes startup costs over many iterations
FFT-heavy computations: GPUs are optimized for FFT operations

Where CPU is Competitive:

Small arrays: Overhead of GPU transfer dominates
Few time steps: GPU initialization not worth it
Memory-limited problems: Some GPUs have limited VRAM

2.2 Simulation ← → 4 Deep Water Wave Problem: C..

CC BY-NC-SA 4.0 Pierre Navaro