Performance

Performance#

Julia avec la compilation Just-In-Time est un langage naturellement performant. Il n’est pas allergique aux boucles comme le sont les langages Python et R. Les opérations vectorisées fonctionnent également très bien à condition d’être attentifs aux allocations mémoire et aux vues explicites.

Allocations#

using Random, LinearAlgebra, BenchmarkTools

function test(A, B, C)
    C = C - A * B
    return C
end

A = rand(1024, 256); B = rand(256, 1024); C = rand(1024, 1024)

@btime test($A, $B, $C);

  5.443 ms (6 allocations: 16.00 MiB)

Dans l’appel de la macro @benchmark on interpole les arguments avec le signe $ pour être sur que les fonctions rand aient déjà été evaluées avant l’appel de la fonction test. La matrice C est modifiée dans la fonction suivante donc par convention on ajoute un ! au nom de la fonction. Par convention également, l’argument modifié se placera en premier. Comme dans la fonction push! par exemple.

function test!(C, A, B)
    C .= C .- A * B
    return C
end

@btime test!( $C, $A, $B);

  4.884 ms (3 allocations: 8.00 MiB)

En effectuant une opération “en place”, on supprime une allocation mais celle pour effectuer l’opération A * B est toujours nécessaire. On peut supprimer cette allocation en utilisant la bibliothèque BLAS, cependant le code perd en lisibilité ce qu’il a gagné en performance.

function test_opt!(C, A, B)
    BLAS.gemm!('N','N', -1., A, B, 1., C)
    return C
end

@btime test_opt!($C, $A, $B);

  3.276 ms (0 allocations: 0 bytes)

function test_opt_mul!(C, A, B)
    mul!(C, A, B, -1, 1) # mul!(C, A, B, α, β) -> ABα + Cβ
    return C
end

@btime test_opt_mul!($C, $A, $B);

  3.353 ms (0 allocations: 0 bytes)

Alignement de la mémoire#

Les opérations le long des premiers indices seront plus rapides.

using FFTW

T = randn(1024, 1024)

@btime fft($T, 1);

  8.142 ms (11 allocations: 32.00 MiB)

@btime fft($T, 2);

  27.778 ms (11 allocations: 32.00 MiB)

Voici un autre exemple ou on calcule la dérivée de la quantité $f$ suivant la coordonnée $y$ en passant par l’espace de Fourier

using FFTW

xmin, xmax, nx = 0, 4π, 1024
ymin, ymax, ny = 0, 4π, 1024
x = LinRange(xmin, xmax, nx+1)[1:end-1]
y = LinRange(ymin, ymax, ny+1)[1:end-1]
ky  = 2π ./ (ymax-ymin) .* [0:ny÷2-1;ny÷2-ny:-1]
exky = exp.( 1im .* ky' .* x)
function df_dy( f, exky )
    ifft(exky .* fft(f, 2), 2)
end
f = sin.(x) .* cos.(y') # f is a 2d array created by broadcasting
@btime df_dy($f, $exky);

  57.281 ms (22 allocations: 64.00 MiB)

En utilisant les “plans” de FFTW qui permettent de pré-allouer la mémoire nécessaire et le calcul “en place”. On peut améliorer les performances. On réutilise le même tableau pour la valeur de $f$ et sa transformée de Fourier. On prend soin également de respecter l’alignement de la mémoire en transposant le tableau contenant $f$ pour calculer la FFT. On utilise plus de mémoire, on fait plus de calcul en ajoutant les transpositions mais finalement le calcul va 3 fois plus vite car on évite les allocations et on limite les accès mémoire.

f  = zeros(ComplexF64, (nx,ny))
fᵗ = zeros(ComplexF64, reverse(size(f)))
f̂ᵗ = zeros(ComplexF64, reverse(size(f)))
f .= sin.(x) .* cos.(y')
fft_plan = plan_fft(fᵗ, 1, flags=FFTW.PATIENT)
function df_dy!( f, fᵗ, f̂ᵗ, exky )
    transpose!(fᵗ,f)
    mul!(f̂ᵗ,  fft_plan, fᵗ)
    f̂ᵗ .= f̂ᵗ .* exky
    ldiv!(fᵗ, fft_plan, f̂ᵗ)
    transpose!(f, fᵗ)
end
@btime df_dy!($f, $fᵗ, $f̂ᵗ, $exky );

  17.955 ms (2 allocations: 64 bytes)

Vues explicites#

@btime sum(T[:,1]) # Somme de la première colonne

  963.840 ns (5 allocations: 8.09 KiB)

-42.08322017066688

@btime sum(view(T,:,1))  

  274.426 ns (3 allocations: 80 bytes)

-42.08322017066688

Eviter les calculs dans l’environnement global.#

v = rand(1000)

function somme()
    acc = 0
    for i in eachindex(v) 
        acc += v[i]
    end
    acc
end

@btime somme()

  97.145 μs (3978 allocations: 77.77 KiB)

490.3821622748302

@code_lowered somme()

CodeInfo(
1 ─       acc = 0
│   %2  = Main.eachindex
│   %3  = (%2)(Main.v)
│         @_2 = Base.iterate(%3)
│   %5  = @_2
│   %6  = %5 === nothing
│   %7  = Base.not_int(%6)
└──       goto #4 if not %7
2 ┄ %9  = @_2
│         i = Core.getfield(%9, 1)
│   %11 = Core.getfield(%9, 2)
│   %12 = Main.:+
│   %13 = acc
│   %14 = i
│   %15 = Base.getindex(Main.v, %14)
│         acc = (%12)(%13, %15)
│         @_2 = Base.iterate(%3, %11)
│   %18 = @_2
│   %19 = %18 === nothing
│   %20 = Base.not_int(%19)
└──       goto #4 if not %20
3 ─       goto #2
4 ┄ %23 = acc
└──       return %23
)

Il faut écrire des fonctions avec les variables utilisées en argument

function somme( x )
    acc = 0
    for i in eachindex(x) 
        acc += x[i]
    end
    acc
end

@btime somme( v )
    

  1.163 μs (1 allocation: 16 bytes)

490.3821622748302

typeof(v)

Vector{Float64} (alias for Array{Float64, 1})

@code_lowered somme(v)

CodeInfo(
1 ─       acc = 0
│   %2  = Main.eachindex(x)
│         @_3 = Base.iterate(%2)
│   %4  = @_3
│   %5  = %4 === nothing
│   %6  = Base.not_int(%5)
└──       goto #4 if not %6
2 ┄ %8  = @_3
│         i = Core.getfield(%8, 1)
│   %10 = Core.getfield(%8, 2)
│   %11 = Main.:+
│   %12 = acc
│   %13 = i
│   %14 = Base.getindex(x, %13)
│         acc = (%11)(%12, %14)
│         @_3 = Base.iterate(%2, %10)
│   %17 = @_3
│   %18 = %17 === nothing
│   %19 = Base.not_int(%18)
└──       goto #4 if not %19
3 ─       goto #2
4 ┄ %22 = acc
└──       return %22
)

Pour comprendre pourquoi l’utilisation de variable global influence les performances, prenons un exemple simple d’une fonction additionnant deux nombres:

variable = 10

function addition_variable_globale(x)
    x + variable
end

@btime addition_variable_globale(10)

  21.508 ns (0 allocations: 0 bytes)

Comparons la performance avec cette fonction qui retourne la somme de ses deux arguments

function addition_deux_arguments(x, y)
    x + y
end

@btime addition_deux_arguments(10, 10)

  1.661 ns (0 allocations: 0 bytes)

On remarque que la deuxième fonction est 300 fois plus rapide que la première. Pour comprendre pourquoi elle est plus rapide, on peut regarder le code généré avant la compilation. On s’appercoit que le code est relativement simple avec une utilisation unique de l’instruction add.

@code_llvm addition_deux_arguments(10, 10)

; Function Signature: addition_deux_arguments(Int64, Int64)
;  @ In[22]:1 within `addition_deux_arguments`
define i64 @julia_addition_deux_arguments_12926(i64 signext %"x::Int64", i64 signext %"y::Int64") #0 {
top:
;  @ In[22]:2 within `addition_deux_arguments`
; ┌ @ int.jl:87 within `+`
   %0 = add i64 %"y::Int64", %"x::Int64"
   ret i64 %0
; └
}

Si on regarde le code généré utilisant la variable globale, on comprend rapidement pourquoi c’est plus long. Pourquoi le code est-il si compliqué ? Ici le langage ne connait pas le type de variable, il doit donc prendre en compte le fait que ce type puisse être modifié à tout moment. Comme tous les cas sont envisagés, cela provoque un surcoût important.

@code_llvm addition_variable_globale(10)

; Function Signature: addition_variable_globale(Int64)
;  @ In[21]:3 within `addition_variable_globale`
define nonnull ptr @julia_addition_variable_globale_13008(i64 signext %"x::Int64") #0 {
top:
  %jlcallframe1 = alloca [2 x ptr], align 8
  %gcframe2 = alloca [4 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 32, i1 true)
  %pgcstack = call ptr inttoptr (i64 140703173070805 to ptr)(i64 261) #9
  store i64 8, ptr %gcframe2, align 16
  %task.gcstack = load ptr, ptr %pgcstack, align 8
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %pgcstack, align 8
;  @ In[21]:4 within `addition_variable_globale`
  %variable.checked = load atomic ptr, ptr @"*Main.variable#13010.jit" unordered, align 64
  %.not = icmp eq ptr %variable.checked, null
  br i1 %.not, label %err, label %ok

err:                                              ; preds = %top
  call void @ijl_undefined_var_error(ptr nonnull @"jl_sym#variable#13011.jit", ptr nonnull @"jl_global#13012.jit")
  unreachable

ok:                                               ; preds = %top
  %gc_slot_addr_1 = getelementptr inbounds ptr, ptr %gcframe2, i64 3
  store ptr %variable.checked, ptr %gc_slot_addr_1, align 8
  %box_Int64 = call nonnull align 8 dereferenceable(8) ptr @ijl_box_int64(i64 signext %"x::Int64") #3
  %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
  store ptr %box_Int64, ptr %gc_slot_addr_0, align 16
  store ptr %box_Int64, ptr %jlcallframe1, align 8
  %0 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
  store ptr %variable.checked, ptr %0, align 8
  %1 = call nonnull ptr @ijl_apply_generic(ptr nonnull @"jl_global#13014.jit", ptr nonnull %jlcallframe1, i32 2)
  %frame.prev6 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev6, ptr %pgcstack, align 8
  ret ptr %1
}

Il est donc possible d’améliorer la performance en fixant la valeur de la variable globale avec l’instruction const.

const constante = 10

function addition_variable_constante(x)
    x + constante
end

@btime addition_variable_constante(10)

  1.610 ns (0 allocations: 0 bytes)

On peut également fixer le type de cette variable. C’est mieux mais cela reste éloigné, en terme de performance, du résultat précedent.

function addition_variable_typee(x)
    x + variable::Int
end

@btime addition_variable_typee(10)

  3.195 ns (0 allocations: 0 bytes)

Pour régler notre problème de performance avec une variable globale, il faut la passer en argument dans la fonction.

function addition_variable_globale_en_argument(x, v)
    x + v
end

addition_variable_globale_en_argument (generic function with 1 method)

@btime addition_variable_globale_en_argument(10, $variable)

  2.901 ns (0 allocations: 0 bytes)

Instabilité de type#

Une fonction est de type stable lorsque vous pouvez déduire ce que doit être la sortie de la fonction. L’exemple ci-dessous rendra les choses plus claires. En règle générale, les fonctions de type stable sont plus rapides.

function carre_plus_un(v::T) where T <:Number
    g = v * v
    return g+1
end

carre_plus_un (generic function with 1 method)

v = rand()

0.3933831280400072

@code_warntype carre_plus_un(v)

MethodInstance for carre_plus_un(::Float64)
 @ Main In[29]:1Number
Static Parameters
  T = Float64
Arguments
  #self#::Core.Const(Main.carre_plus_un)
  v::Float64
Locals
  g::Float64
Body::Float64
      (g = v * v)
│   %2 = g::Float64
│   %3 = (%2 + 1)::Float64
└──      return %3

w = 5

@code_warntype carre_plus_un(w)

MethodInstance for carre_plus_un(::Int64)
  from carre_plus_un(v::T) where T<:Number @ Main In[29]:1
Static Parameters
  T = Int64
Arguments
  #self#::Core.Const(Main.carre_plus_un)
  v::Int64
Locals
  g::Int64
Body::Int64
1 ─      (g = v * v)
│   %2 = g::Int64
│   %3 = (%2 + 1)::Int64
└──      return %3

Sur les deux exemples précedents on peut déduire le type de sortie de la fonction.

function carre_plus_un(v::T) where T <:Number
    g = v*v         # Type(T * T) ==> T
    return g+1      # Type(T + Int)) ==> "max" (T,Int)
end

Le type de la valeur de retour peut être différent: Float64 ou Int64. Mais la fonction est toujours stable.

Créons maintenant un nouveau type:

mutable struct Cube
    length
    width
    height
end

volume(c::Cube) = c.length*c.width*c.height

volume (generic function with 1 method)

mutable struct Cube_typed
    length::Float64
    width::Float64
    height::Float64
end
volume(c::Cube_typed) = c.length*c.width*c.height

volume (generic function with 2 methods)

mutable struct Cube_parametric_typed{T <: Real}
    length :: T
    width :: T
    height :: T
end
volume(c::Cube_parametric_typed) = c.length*c.width*c.height

volume (generic function with 3 methods)

c1 = Cube(1.1,1.2,1.3)
c2 = Cube_typed(1.1,1.2,1.3)
c3 = Cube_parametric_typed(1.1,1.2,1.3)
@show volume(c1) == volume(c2) == volume(c3)

volume(c1) == volume(c2) == volume(c3) = true

true

using BenchmarkTools
@btime volume($c1) # not typed
@btime volume($c2) # typed float
@btime volume($c3) # typed parametric

  24.748 ns (1 allocation: 16 bytes)
 (0 allocations: 0 bytes)
 (0 allocations: 0 bytes)

1.7160000000000002

c4 = Cube_parametric_typed{Float64}(1.1,1.2,1.3)
@btime volume($c4) 

  2.617 ns (0 allocations: 0 bytes)

1.7160000000000002

The second and the third function calls are faster! Let’s call @code_warntype and check type stability

@code_warntype volume(c1)

MethodInstance for volume(::Cube)
  from volume(c::Cube) @ Main In[44]:1
Arguments
  #self#::Core.Const(volume)
  c::Cube
Body::Any
1 ─ %1 = Base.getproperty(c, :length)::Any
│   %2 = Base.getproperty(c, :width)::Any
│   %3 = Base.getproperty(c, :height)::Any
│   %4 = (%1 * %2 * %3)::Any
└──      return %4

@code_warntype volume(c2)

MethodInstance for volume(::Cube_typed)
  from volume(c::Cube_typed) @ Main In[45]:6
Arguments
  #self#::Core.Const(volume)
  c::Cube_typed
Body::Float64
1 ─ %1 = Base.getproperty(c, :length)::Float64
│   %2 = Base.getproperty(c, :width)::Float64
│   %3 = Base.getproperty(c, :height)::Float64
│   %4 = (%1 * %2 * %3)::Float64
└──      return %4

@code_warntype volume(c3)

MethodInstance for volume(::Cube_parametric_typed{Float64})
  from volume(c::Cube_parametric_typed) @ Main In[46]:6
Arguments
  #self#::Core.Const(volume)
  c::Cube_parametric_typed{Float64}
Body::Float64
1 ─ %1 = Base.getproperty(c, :length)::Float64
│   %2 = Base.getproperty(c, :width)::Float64
│   %3 = Base.getproperty(c, :height)::Float64
│   %4 = (%1 * %2 * %3)::Float64
└──      return %4

Conclusion: Les types en Julia sont importants donc si vous les connaissez, ajoutez-les, cela peut améliorer les performances.

function zero_or_val(x::Real)
    if x >= 0
        return x
    else
        return 0
    end
end
@code_warntype zero_or_val(0.2)

MethodInstance for zero_or_val(::Float64)
  from zero_or_val(x::Real) @ Main In[62]:1
Arguments
  #self#::Core.Const(zero_or_val)
  x::Float64
Body::Union{Float64, Int64}
1 ─ %1 = (x >= 0)::Bool
└──      goto #3 if not %1
2 ─      return x
3 ─      return 0

function zero_or_val_stable(x::Real)
    T = typeof(x)
    if x >= 0
        y = x
    else
        y = zero(T)
    end
    
    return y
end
@code_warntype zero_or_val_stable(0.2)

MethodInstance for zero_or_val_stable(::Float64)
  from zero_or_val_stable(x::Real) @ Main In[65]:1
Arguments
  #self#::Core.Const(zero_or_val_stable)
  x::Float64
Locals
  y::Float64
  T::Type{Float64}
Body::Float64
1 ─      Core.NewvarNode(:(y))
│        (T = Main.typeof(x))
│   %3 = (x >= 0)::Bool
└──      goto #3 if not %3
2 ─      (y = x)
└──      goto #4
3 ─      (y = Main.zero(T::Core.Const(Float64)))
4 ┄      return y

Conclusion: promote_type peut vous permettre de supprimer une instabilité de type en utilisant la réprésentation la plus haute dans l’abre des types.

Je vous propose le jeu suivant: Soit un vecteur de nombres. Calculons la somme comme suit. Pour chaque nombre du vecteur, on lance une pièce de monnaie (rand()), si c’est face (>=0.5), vous ajoutez 1. Sinon, vous ajoutez le nombre lui-même.

function flipcoin_then_add(v::Vector{T}) where T <: Real
    s = 0
    for vi in v
        r = rand()
        if r >=0.5
            s += 1
        else
            s += vi
        end
    end
end

function flipcoin_then_add_typed(v::Vector{T}) where T <: Real
    s = zero(T)
    for vi in v
        r = rand()
        if r >=0.5
            s += one(T)
        else
            s += vi
        end
    end
end
myvec = rand(1000)
@show flipcoin_then_add(myvec) == flipcoin_then_add_typed(myvec)

flipcoin_then_add(myvec) == flipcoin_then_add_typed(myvec) = true

true

x = rand(1000)
@btime flipcoin_then_add($x)
@btime flipcoin_then_add_typed($x)

  5.391 μs (0 allocations: 0 bytes)
  764.425 ns (0 allocations: 0 bytes)

Conclusion: Think about the variables you are declaring. Do you know their types? If so, specify the type somehow.

@code_XXX#

Nous avons vu durant ce chapitre que regarder le code généré peut nous aider à améliorer les performances. Voici toutes les macros à votre disposition:

# @code_llvm 
# @code_lowered 
# @code_native 
# @code_typed 
# @code_warntype

function flipcoin(randval::Float64)
    if randval<0.5
        return "H"
    else
        return "T"
    end
end

flipcoin (generic function with 1 method)

@code_lowered flipcoin(rand()) # syntax tree

CodeInfo(
1 ─ %1 = randval < 0.5
└──      goto #3 if not %1
2 ─      return "H"
3 ─      return "T"
)

@code_warntype flipcoin(rand()) # try @code_typed

MethodInstance for flipcoin(::Float64)
  from flipcoin(randval::Float64) @ Main In[69]:7
Arguments
  #self#::Core.Const(flipcoin)
  randval::Float64
Body::String
1 ─ %1 = (randval < 0.5)::Bool
└──      goto #3 if not %1
2 ─      return "H"
3 ─      return "T"

@code_llvm flipcoin(rand()) # this and code_warntype are probably the most relevant

;  @ In[69]:7 within `flipcoin`
define nonnull {}* @julia_flipcoin_2741(double %0) #0 {
top:
;  @ In[69]:8 within `flipcoin`
; ┌ @ float.jl:536 within `<`
   %1 = fcmp uge double %0, 5.000000e-01
   %. = select i1 %1, {}* inttoptr (i64 5478482832 to {}*), {}* inttoptr (i64 5478482776 to {}*)
; └
;  @ In[69] within `flipcoin`
  ret {}* %.
}

@code_native flipcoin(rand())

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0
	.section	__TEXT,__literal8,8byte_literals
	.p2align	3                               ## -- Begin function julia_flipcoin_2757
LCPI0_0:
	.quad	0x3fe0000000000000              ## double 0.5
	.section	__TEXT,__text,regular,pure_instructions
	.globl	_julia_flipcoin_2757
	.p2align	4, 0x90
_julia_flipcoin_2757:                   ## @julia_flipcoin_2757
; ┌ @ In[69]:7 within `flipcoin`
## %bb.0:                               ## %top
	movabs	rax, offset LCPI0_0
	vmovsd	xmm1, qword ptr [rax]           ## xmm1 = mem[0],zero
; │ @ In[69]:8 within `flipcoin`
; │┌ @ float.jl:536 within `<`
	vucomisd	xmm1, xmm0
	movabs	rcx, 5478482832
	movabs	rax, 5478482776
	cmovbe	rax, rcx
; │└
; │ @ In[69] within `flipcoin`
	ret
; └
                                        ## -- End function
.subsections_via_symbols