5.2 Benchmarking

5.2.1 Small Grid

@time main_gpu(2^10; animation=true)

Expected output:

Performance Note:

5.2.2 Large Grid

@time main_gpu(2^14; animation=false)

Expected output: - Significant computation time (seconds to tens of seconds depending on GPU) - This is where GPU acceleration really shows benefit - animation=false skips expensive visualization creation

5.2.3 Comparison with CPU

@time main_cpu(2^14; animation=false)

Performance Comparison:

Grid Size CPU Time GPU Time Speedup
2^10 ~0.1 s ~0.05 s
2^12 ~1-2 s ~0.2 s 5-10×
2^14 ~20-50 s ~1-5 s 10-50×

Factors affecting speedup:


5.2.4 Performance Analysis

Where GPU Wins

  1. Large arrays: FFT cost dominates overall cost
  2. Many iterations: Overhead amortized over many steps
  3. Memory-bound operations: GPU has higher memory bandwidth

CPU Still Competitive For

  1. Small arrays: Overhead of GPU transfer dominates
  2. Few time steps: GPU startup not worth it
  3. Interactive development: CPU simpler to debug

GPU Implementation Advantages

  1. Computation speed: 10-100× faster for large problems
  2. Memory efficiency: GPU memory not bottleneck for this problem
  3. Scalability: Can run 2^16 or larger grids on modern GPUs

Physical Interpretation

Both CPU and GPU simulations solve the identical equations with identical initial conditions:

  1. Wave dynamics: Bell curve evolves under nonlinear deep water equations
  2. Dispersive spreading: Different frequencies travel at different speeds
  3. Nonlinear interaction: Wave steepening and energy transfer to higher wavenumbers
  4. Spectral cascade: Energy moves from initial mode to higher frequencies

The GPU version simply computes this much faster than the CPU version.

Best Practices for GPU Spectral Simulations

  1. Pre-allocate everything: Avoid allocations in time loop
  2. Plan FFTs before loop: FFT planning is expensive, do it once
  3. Use work arrays: Reuse arrays instead of creating new ones
  4. Fuse operations: Use @. macro to combine element-wise ops
  5. Minimize transfers: Keep solution on GPU as much as possible
  6. Batch snapshots: Save multiple time steps at once if possible
  7. Use 32-bit floats if possible: Float32 is 2-4× faster than Float64

Troubleshooting GPU Issues

If you get errors:

“Not allowed to access non-isbits type”: Use Float32 or ComplexF32 instead of Float64

“Scalar indexing not allowed”: You’ve done scalar = CuArray[i] (not allowed). Use collect() to transfer to CPU first

“Out of memory”: GPU VRAM exceeded. Reduce grid size N or use Float32

Slow performance: Usually means unnecessary CPU-GPU transfers. Profile with NVPROF or Nsight



CC BY-NC-SA 4.0 Pierre Navaro