Benchmarking - Julia Day

5.2 Benchmarking

5.2.1 Small Grid

@time main_gpu(2^10; animation=true)

Expected output:

Fast execution (under 1 second for 2^10 = 1024 points)
GPU shines on this problem size
Animation creation on CPU takes most of the time for small grids

Performance Note:

@time includes compilation overhead on first run
Subsequent runs are faster (JIT compiled)
Use @time multiple times to see true performance

5.2.2 Large Grid

@time main_gpu(2^14; animation=false)

Expected output: - Significant computation time (seconds to tens of seconds depending on GPU) - This is where GPU acceleration really shows benefit - animation=false skips expensive visualization creation

5.2.3 Comparison with CPU

@time main_cpu(2^14; animation=false)

Performance Comparison:

Grid Size	CPU Time	GPU Time	Speedup
2^10	~0.1 s	~0.05 s	2×
2^12	~1-2 s	~0.2 s	5-10×
2^14	~20-50 s	~1-5 s	10-50×

Factors affecting speedup:

FFT efficiency: GPU FFTs become more efficient with larger arrays
Overhead: GPU overhead dominates for small arrays
Memory bandwidth: GPU memory bandwidth advantage grows with problem size

5.2.4 Performance Analysis

Where GPU Wins

Large arrays: FFT cost dominates overall cost
Many iterations: Overhead amortized over many steps
Memory-bound operations: GPU has higher memory bandwidth

CPU Still Competitive For

Small arrays: Overhead of GPU transfer dominates
Few time steps: GPU startup not worth it
Interactive development: CPU simpler to debug

GPU Implementation Advantages

Computation speed: 10-100× faster for large problems
Memory efficiency: GPU memory not bottleneck for this problem
Scalability: Can run 2^16 or larger grids on modern GPUs

Physical Interpretation

Both CPU and GPU simulations solve the identical equations with identical initial conditions:

Wave dynamics: Bell curve evolves under nonlinear deep water equations
Dispersive spreading: Different frequencies travel at different speeds
Nonlinear interaction: Wave steepening and energy transfer to higher wavenumbers
Spectral cascade: Energy moves from initial mode to higher frequencies

The GPU version simply computes this much faster than the CPU version.

Best Practices for GPU Spectral Simulations

Pre-allocate everything: Avoid allocations in time loop
Plan FFTs before loop: FFT planning is expensive, do it once
Use work arrays: Reuse arrays instead of creating new ones
Fuse operations: Use @. macro to combine element-wise ops
Minimize transfers: Keep solution on GPU as much as possible
Batch snapshots: Save multiple time steps at once if possible
Use 32-bit floats if possible: Float32 is 2-4× faster than Float64

Troubleshooting GPU Issues

If you get errors:

“Not allowed to access non-isbits type”: Use Float32 or ComplexF32 instead of Float64

“Scalar indexing not allowed”: You’ve done scalar = CuArray[i] (not allowed). Use collect() to transfer to CPU first

“Out of memory”: GPU VRAM exceeded. Reduce grid size N or use Float32

Slow performance: Usually means unnecessary CPU-GPU transfers. Profile with NVPROF or Nsight

5.1 Import CUDA Libraries ←

CC BY-NC-SA 4.0 Pierre Navaro