Benchmarking
Small Grid
@time main_gpu(2^10; animation=true)
Expected output:
- Fast execution (under 1 second for 2^10 = 1024 points)
- GPU shines on this problem size
- Animation creation on CPU takes most of the time for small grids
Performance Note:
@time includes compilation overhead on first run
- Subsequent runs are faster (JIT compiled)
- Use
@time multiple times to see true performance
Large Grid
@time main_gpu(2^14; animation=false)
Expected output: - Significant computation time (seconds to tens of seconds depending on GPU) - This is where GPU acceleration really shows benefit - animation=false skips expensive visualization creation
Comparison with CPU
@time main_cpu(2^14; animation=false)
Performance Comparison:
| Grid Size |
CPU Time |
GPU Time |
Speedup |
| 2^10 |
~0.1 s |
~0.05 s |
2× |
| 2^12 |
~1-2 s |
~0.2 s |
5-10× |
| 2^14 |
~20-50 s |
~1-5 s |
10-50× |
Factors affecting speedup:
- FFT efficiency: GPU FFTs become more efficient with larger arrays
- Overhead: GPU overhead dominates for small arrays
- Memory bandwidth: GPU memory bandwidth advantage grows with problem size
Where GPU Wins
- Large arrays: FFT cost dominates overall cost
- Many iterations: Overhead amortized over many steps
- Memory-bound operations: GPU has higher memory bandwidth
CPU Still Competitive For
- Small arrays: Overhead of GPU transfer dominates
- Few time steps: GPU startup not worth it
- Interactive development: CPU simpler to debug
GPU Implementation Advantages
- Computation speed: 10-100× faster for large problems
- Memory efficiency: GPU memory not bottleneck for this problem
- Scalability: Can run 2^16 or larger grids on modern GPUs
Physical Interpretation
Both CPU and GPU simulations solve the identical equations with identical initial conditions:
- Wave dynamics: Bell curve evolves under nonlinear deep water equations
- Dispersive spreading: Different frequencies travel at different speeds
- Nonlinear interaction: Wave steepening and energy transfer to higher wavenumbers
- Spectral cascade: Energy moves from initial mode to higher frequencies
The GPU version simply computes this much faster than the CPU version.
Best Practices for GPU Spectral Simulations
- Pre-allocate everything: Avoid allocations in time loop
- Plan FFTs before loop: FFT planning is expensive, do it once
- Use work arrays: Reuse arrays instead of creating new ones
- Fuse operations: Use
@. macro to combine element-wise ops
- Minimize transfers: Keep solution on GPU as much as possible
- Batch snapshots: Save multiple time steps at once if possible
- Use 32-bit floats if possible: Float32 is 2-4× faster than Float64
Troubleshooting GPU Issues
If you get errors:
“Not allowed to access non-isbits type”: Use Float32 or ComplexF32 instead of Float64
“Scalar indexing not allowed”: You’ve done scalar = CuArray[i] (not allowed). Use collect() to transfer to CPU first
“Out of memory”: GPU VRAM exceeded. Reduce grid size N or use Float32
Slow performance: Usually means unnecessary CPU-GPU transfers. Profile with NVPROF or Nsight