# Performance

*Julia* avec la compilation *Just-In-Time* est un langage naturellement performant. Il n'est pas allergique aux boucles comme le sont les langages Python et R. Les opérations vectorisées fonctionnent également très bien à condition d'être attentifs aux allocations mémoire et aux vues explicites.

## Allocations

In [1]:
using Random, LinearAlgebra, BenchmarkTools

function test(A, B, C)
    C = C - A * B
    return C
end

A = rand(1024, 256); B = rand(256, 1024); C = rand(1024, 1024)

@btime test($A, $B, $C);

  2.903 ms (4 allocations: 16.00 MiB)


Dans l'appel de la macro `@benchmark` on interpole les arguments avec le signe `$` pour être sur que les fonctions
    `rand` aient déjà été evaluées avant l'appel de la fonction `test`. La matrice `C` est modifiée dans la fonction suivante donc par convention on ajoute un `!` au nom de la fonction. Par convention également, l'argument modifié se placera en premier. Comme dans la fonction `push!` par exemple.

In [2]:
function test!(C, A, B)
    C .= C .- A * B
    return C
end

@btime test!( $C, $A, $B);

  2.119 ms (2 allocations: 8.00 MiB)


En effectuant une opération "en place", on supprime une allocation mais celle pour effectuer l'opération `A * B` est toujours nécessaire. On peut supprimer cette allocation en utilisant la bibliothèque `BLAS`, cependant le code perd en lisibilité ce qu'il a gagné en performance.

In [3]:
function test_opt!(C, A, B)
    BLAS.gemm!('N','N', -1., A, B, 1., C)
    return C
end

@btime test_opt!($C, $A, $B);

  954.786 μs (0 allocations: 0 bytes)


In [4]:
function test_opt_mul!(C, A, B)
    mul!(C, A, B, -1, 1) # mul!(C, A, B, α, β) -> ABα + Cβ
    return C
end

@btime test_opt_mul!($C, $A, $B);

  955.358 μs (0 allocations: 0 bytes)


## Alignement de la mémoire

Les opérations le long des premiers indices seront plus rapides.

In [5]:
using FFTW

T = randn(1024, 1024)

@btime fft($T, 1);

  10.439 ms (7 allocations: 32.00 MiB)


In [6]:
@btime fft($T, 2);

  28.566 ms (7 allocations: 32.00 MiB)


Voici un autre exemple ou on calcule la dérivée de la quantité $f$ suivant la coordonnée $y$ en passant par l'espace de Fourier

In [7]:
using FFTW

xmin, xmax, nx = 0, 4π, 1024
ymin, ymax, ny = 0, 4π, 1024
x = LinRange(xmin, xmax, nx+1)[1:end-1]
y = LinRange(ymin, ymax, ny+1)[1:end-1]
ky  = 2π ./ (ymax-ymin) .* [0:ny÷2-1;ny÷2-ny:-1]
exky = exp.( 1im .* ky' .* x)
function df_dy( f, exky )
    ifft(exky .* fft(f, 2), 2)
end
f = sin.(x) .* cos.(y') # f is a 2d array created by broadcasting
@btime df_dy($f, $exky);

  59.388 ms (14 allocations: 64.00 MiB)


En utilisant les "plans" de FFTW qui permettent de pré-allouer la mémoire nécessaire et le calcul "en place". On peut améliorer les performances. On réutilise le même tableau pour la valeur de $f$ et sa transformée de Fourier. On prend soin également de respecter l'alignement de la mémoire en transposant le tableau contenant $f$ pour calculer la FFT. On utilise plus de mémoire, on fait plus de calcul en ajoutant les transpositions mais finalement le calcul va 3 fois plus vite car on évite les allocations et on limite les accès mémoire.

In [8]:
f  = zeros(ComplexF64, (nx,ny))
fᵗ = zeros(ComplexF64, reverse(size(f)))
f̂ᵗ = zeros(ComplexF64, reverse(size(f)))
f .= sin.(x) .* cos.(y')
fft_plan = plan_fft(fᵗ, 1, flags=FFTW.PATIENT)
function df_dy!( f, fᵗ, f̂ᵗ, exky )
    transpose!(fᵗ,f)
    mul!(f̂ᵗ,  fft_plan, fᵗ)
    f̂ᵗ .= f̂ᵗ .* exky
    ldiv!(fᵗ, fft_plan, f̂ᵗ)
    transpose!(f, fᵗ)
end
@btime df_dy!($f, $fᵗ, $f̂ᵗ, $exky );

  21.431 ms (2 allocations: 112 bytes)


## Vues explicites

In [9]:
@btime sum(T[:,1]) # Somme de la première colonne

  1.879 μs (3 allocations: 8.16 KiB)


19.351104431816317

In [10]:
@btime sum(view(T,:,1))  

  242.526 ns (3 allocations: 80 bytes)


19.351104431816317

## Eviter les calculs dans l'environnement global.

In [11]:
v = rand(1000)

function somme()
    acc = 0
    for i in eachindex(v) 
        acc += v[i]
    end
    acc
end

@btime somme()


  77.261 μs (3978 allocations: 77.77 KiB)


514.0748698860377

In [12]:
@code_lowered somme()

CodeInfo(
[90m1 ─[39m       acc = 0
[90m│  [39m %2  = Main.eachindex(Main.v)
[90m│  [39m       @_2 = Base.iterate(%2)
[90m│  [39m %4  = @_2 === nothing
[90m│  [39m %5  = Base.not_int(%4)
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2
[90m│  [39m       i = Core.getfield(%7, 1)
[90m│  [39m %9  = Core.getfield(%7, 2)
[90m│  [39m %10 = acc
[90m│  [39m %11 = Base.getindex(Main.v, i)
[90m│  [39m       acc = %10 + %11
[90m│  [39m       @_2 = Base.iterate(%2, %9)
[90m│  [39m %14 = @_2 === nothing
[90m│  [39m %15 = Base.not_int(%14)
[90m└──[39m       goto #4 if not %15
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return acc
)

Il faut écrire des fonctions avec les variables utilisées en argument

In [13]:
function somme( x )
    acc = 0
    for i in eachindex(x) 
        acc += x[i]
    end
    acc
end

@btime somme( v )
    

  3.461 μs (1 allocation: 16 bytes)


514.0748698860377

In [14]:
@code_lowered somme(v)

CodeInfo(
[90m1 ─[39m       acc = 0
[90m│  [39m %2  = Main.eachindex(x)
[90m│  [39m       @_3 = Base.iterate(%2)
[90m│  [39m %4  = @_3 === nothing
[90m│  [39m %5  = Base.not_int(%4)
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3
[90m│  [39m       i = Core.getfield(%7, 1)
[90m│  [39m %9  = Core.getfield(%7, 2)
[90m│  [39m %10 = acc
[90m│  [39m %11 = Base.getindex(x, i)
[90m│  [39m       acc = %10 + %11
[90m│  [39m       @_3 = Base.iterate(%2, %9)
[90m│  [39m %14 = @_3 === nothing
[90m│  [39m %15 = Base.not_int(%14)
[90m└──[39m       goto #4 if not %15
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return acc
)

Pour comprendre pourquoi l'utilisation de variable global influence les performances, prenons un exemple simple d'une fonction additionnant deux nombres:

In [15]:
variable = 10

function addition_variable_globale(x)
    x + variable
end

@btime addition_variable_globale(10)

  16.569 ns (0 allocations: 0 bytes)


20

Comparons la performance avec cette fonction qui retourne la somme de ses deux arguments

In [16]:
function addition_deux_arguments(x, y)
    x + y
end

@btime addition_deux_arguments(10, 10)

  1.286 ns (0 allocations: 0 bytes)


20

On remarque que la deuxième fonction est 300 fois plus rapide que la première. Pour comprendre pourquoi elle est plus rapide, on peut regarder le code généré avant la compilation. On s'appercoit que le code est relativement simple avec une utilisation unique de l'instruction `add`.

In [17]:
@code_llvm addition_deux_arguments(10, 10)

[90m;  @ In[16]:1 within `addition_deux_arguments`[39m
[95mdefine[39m [36mi64[39m [93m@julia_addition_deux_arguments_1863[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[0m, [36mi64[39m [95msignext[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ In[16]:2 within `addition_deux_arguments`[39m
[90m; ┌ @ int.jl:87 within `+`[39m
   [0m%2 [0m= [96m[1madd[22m[39m [36mi64[39m [0m%1[0m, [0m%0
[90m; └[39m
  [96m[1mret[22m[39m [36mi64[39m [0m%2
[33m}[39m


Si on regarde le code généré utilisant la variable globale, on comprend rapidement pourquoi c'est plus long. Pourquoi le code est-il si compliqué ? Ici le langage ne connait pas le type de `variable`, il doit donc prendre en compte le fait que ce type puisse être modifié à tout moment. Comme tous les cas sont envisagés, cela provoque un surcoût important.

In [18]:
@code_llvm addition_variable_globale(10)

[90m;  @ In[15]:3 within `addition_variable_globale`[39m
[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_addition_variable_globale_1886[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%1 [0m= [96m[1malloca[22m[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m8[39m
  [0m%gcframe2 [0m= [96m[1malloca[22m[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m16[39m
  [0m%gcframe2.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%gcframe2[0m, [36mi64[39m [33m0[39m[0m, [36mi64[39m [33m0[39m
  [0m%.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, 

Il est donc possible d'améliorer la performance en fixant la valeur de la variable globale avec l'instruction `const`.

In [19]:
const constante = 10

function addition_variable_constante(x)
    x + constante
end

@btime addition_variable_constante(10)

  1.271 ns (0 allocations: 0 bytes)


20

On peut également fixer le type de cette variable. C'est mieux mais cela reste éloigné, en terme de performance, du résultat précedent.

In [20]:
function addition_variable_typee(x)
    x + variable::Int
end

@btime addition_variable_typee(10)

  2.277 ns (0 allocations: 0 bytes)


20

Pour régler notre problème de performance avec une variable globale, il faut la passer en argument dans la fonction.

In [21]:
function addition_variable_globale_en_argument(x, v)
    x + v
end

addition_variable_globale_en_argument (generic function with 1 method)

In [22]:
@btime addition_variable_globale_en_argument(10, $variable)

  2.781 ns (0 allocations: 0 bytes)


20

## Instabilité de type

Une fonction est de type stable lorsque vous pouvez déduire ce que doit être la sortie de la fonction. L'exemple ci-dessous rendra les choses plus claires. En règle générale, les fonctions de type stable sont plus rapides.


In [23]:
function carre_plus_un(v::T) where T <:Number
    g = v * v
    return g+1
end

carre_plus_un (generic function with 1 method)

In [24]:
v = rand()

0.2692504830309297

In [25]:
@code_warntype carre_plus_un(v)

MethodInstance for carre_plus_un(::Float64)
  from carre_plus_un([90mv[39m::[1mT[22m) where T<:Number[90m @[39m [90mMain[39m [90m[4mIn[23]:1[24m[39m
Static Parameters
  T = [36mFloat64[39m
Arguments
  #self#[36m::Core.Const(carre_plus_un)[39m
  v[36m::Float64[39m
Locals
  g[36m::Float64[39m
Body[36m::Float64[39m
[90m1 ─[39m      (g = v * v)
[90m│  [39m %2 = (g + 1)[36m::Float64[39m
[90m└──[39m      return %2



In [26]:
w = 5

5

In [27]:
@code_warntype carre_plus_un(w)

MethodInstance for carre_plus_un(::Int64)
  from carre_plus_un([90mv[39m::[1mT[22m) where T<:Number[90m @[39m [90mMain[39m [90m[4mIn[23]:1[24m[39m
Static Parameters
  T = [36mInt64[39m
Arguments
  #self#[36m::Core.Const(carre_plus_un)[39m
  v[36m::Int64[39m
Locals
  g[36m::Int64[39m
Body[36m::Int64[39m
[90m1 ─[39m      (g = v * v)
[90m│  [39m %2 = (g + 1)[36m::Int64[39m
[90m└──[39m      return %2



Sur les deux exemples précedents on peut déduire le type de sortie de la fonction.
```
function carre_plus_un(v::T) where T <:Number
    g = v*v         # Type(T * T) ==> T
    return g+1      # Type(T + Int)) ==> "max" (T,Int)
end

```
Le type de la valeur de retour peut être différent: `Float64` ou `Int64`. Mais la fonction est toujours stable.

Créons maintenant un nouveau type:

In [28]:
mutable struct Cube
    length
    width
    height
end

In [29]:
volume(c::Cube) = c.length*c.width*c.height

volume (generic function with 1 method)

In [30]:
mutable struct Cube_typed
    length::Float64
    width::Float64
    height::Float64
end
volume(c::Cube_typed) = c.length*c.width*c.height

volume (generic function with 2 methods)

In [31]:
mutable struct Cube_parametric_typed{T <: Real}
    length :: T
    width :: T
    height :: T
end
volume(c::Cube_parametric_typed) = c.length*c.width*c.height

volume (generic function with 3 methods)

In [32]:
c1 = Cube(1.1,1.2,1.3)
c2 = Cube_typed(1.1,1.2,1.3)
c3 = Cube_parametric_typed(1.1,1.2,1.3)
@show volume(c1) == volume(c2) == volume(c3)

volume(c1) == volume(c2) == volume(c3) = true


true

In [33]:
using BenchmarkTools
@btime volume(c1) # not typed
@btime volume(c2) # typed float
@btime volume(c3) # typed parametric

  23.127 ns (1 allocation: 16 bytes)
  6.572 ns (1 allocation: 16 bytes)
  17.361 ns (1 allocation: 16 bytes)


1.7160000000000002

In [34]:
c4 = Cube_parametric_typed{Float64}(1.1,1.2,1.3)
@btime volume(c4) 

  17.300 ns (1 allocation: 16 bytes)


1.7160000000000002

The second and the third function calls are faster! Let's call `@code_warntype` and check type stability

In [35]:
@code_warntype volume(c1)

MethodInstance for volume(::Cube)
  from volume([90mc[39m::[1mCube[22m)[90m @[39m [90mMain[39m [90m[4mIn[29]:1[24m[39m
Arguments
  #self#[36m::Core.Const(volume)[39m
  c[36m::Cube[39m
Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.getproperty(c, :length)[91m[1m::Any[22m[39m
[90m│  [39m %2 = Base.getproperty(c, :width)[91m[1m::Any[22m[39m
[90m│  [39m %3 = Base.getproperty(c, :height)[91m[1m::Any[22m[39m
[90m│  [39m %4 = (%1 * %2 * %3)[91m[1m::Any[22m[39m
[90m└──[39m      return %4



In [36]:
@code_warntype volume(c2)

MethodInstance for volume(::Cube_typed)
  from volume([90mc[39m::[1mCube_typed[22m)[90m @[39m [90mMain[39m [90m[4mIn[30]:6[24m[39m
Arguments
  #self#[36m::Core.Const(volume)[39m
  c[36m::Cube_typed[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = Base.getproperty(c, :length)[36m::Float64[39m
[90m│  [39m %2 = Base.getproperty(c, :width)[36m::Float64[39m
[90m│  [39m %3 = Base.getproperty(c, :height)[36m::Float64[39m
[90m│  [39m %4 = (%1 * %2 * %3)[36m::Float64[39m
[90m└──[39m      return %4



In [37]:
@code_warntype volume(c3)

MethodInstance for volume(::Cube_parametric_typed{Float64})
  from volume([90mc[39m::[1mCube_parametric_typed[22m)[90m @[39m [90mMain[39m [90m[4mIn[31]:6[24m[39m
Arguments
  #self#[36m::Core.Const(volume)[39m
  c[36m::Cube_parametric_typed{Float64}[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = Base.getproperty(c, :length)[36m::Float64[39m
[90m│  [39m %2 = Base.getproperty(c, :width)[36m::Float64[39m
[90m│  [39m %3 = Base.getproperty(c, :height)[36m::Float64[39m
[90m│  [39m %4 = (%1 * %2 * %3)[36m::Float64[39m
[90m└──[39m      return %4



**Conclusion**: Les types en Julia sont importants donc si vous les connaissez, ajoutez-les, cela peut améliorer les performances.

In [38]:
function zero_or_val(x::Real)
    if x >= 0
        return x
    else
        return 0
    end
end
@code_warntype zero_or_val(0.2)

MethodInstance for zero_or_val(::Float64)
  from zero_or_val([90mx[39m::[1mReal[22m)[90m @[39m [90mMain[39m [90m[4mIn[38]:1[24m[39m
Arguments
  #self#[36m::Core.Const(zero_or_val)[39m
  x[36m::Float64[39m
Body[33m[1m::Union{Float64, Int64}[22m[39m
[90m1 ─[39m %1 = (x >= 0)[36m::Bool[39m
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m      return x
[90m3 ─[39m      return 0



In [39]:
function zero_or_val_stable(x::Real)
    T = promote_type(typeof(x),Int)
    if x >= 0
        y = x
    else
        y = zero(T)
    end
    
    return y
end
@code_warntype zero_or_val_stable(0.2)

MethodInstance for zero_or_val_stable(::Float64)
  from zero_or_val_stable([90mx[39m::[1mReal[22m)[90m @[39m [90mMain[39m [90m[4mIn[39]:1[24m[39m
Arguments
  #self#[36m::Core.Const(zero_or_val_stable)[39m
  x[36m::Float64[39m
Locals
  y[36m::Float64[39m
  T[36m::Type{Float64}[39m
Body[36m::Float64[39m
[90m1 ─[39m      Core.NewvarNode(:(y))
[90m│  [39m %2 = Main.typeof(x)[36m::Core.Const(Float64)[39m
[90m│  [39m      (T = Main.promote_type(%2, Main.Int))
[90m│  [39m %4 = (x >= 0)[36m::Bool[39m
[90m└──[39m      goto #3 if not %4
[90m2 ─[39m      (y = x)
[90m└──[39m      goto #4
[90m3 ─[39m      (y = Main.zero(T::Core.Const(Float64)))
[90m4 ┄[39m      return y



**Conclusion**: `promote_type` peut vous permettre de supprimer une instabilité de type en utilisant la réprésentation la plus haute dans l'abre des types.

Je vous propose le jeu suivant: Soit un vecteur de nombres. Calculons la somme comme suit. 
Pour chaque nombre du vecteur, on lance une pièce de monnaie (`rand()`), si c'est face (`>=0.5`), vous ajoutez `1`. Sinon, vous ajoutez le nombre lui-même.


In [40]:
function flipcoin_then_add(v::Vector{T}) where T <: Real
    s = 0
    for vi in v
        r = rand()
        if r >=0.5
            s += 1
        else
            s += vi
        end
    end
end

function flipcoin_then_add_typed(v::Vector{T}) where T <: Real
    s = zero(T)
    for vi in v
        r = rand()
        if r >=0.5
            s += one(T)
        else
            s += vi
        end
    end
end
myvec = rand(1000)
@show flipcoin_then_add(myvec) == flipcoin_then_add_typed(myvec)

flipcoin_then_add(myvec) == flipcoin_then_add_typed(myvec) = true


true

In [41]:
@btime flipcoin_then_add(rand(1000))
@btime flipcoin_then_add_typed(rand(1000))

  7.114 μs (1 allocation: 7.94 KiB)
  1.247 μs (1 allocation: 7.94 KiB)


**Conclusion**: Think about the variables you are declaring. Do you know their types? If so, specify the type somehow.

### @code_XXX

Nous avons vu durant ce chapitre que regarder le code généré peut nous aider à améliorer les performances. Voici toutes les macros à votre disposition:

In [42]:
# @code_llvm 
# @code_lowered 
# @code_native 
# @code_typed 
# @code_warntype

function flipcoin(randval::Float64)
    if randval<0.5
        return "H"
    else
        return "T"
    end
end

flipcoin (generic function with 1 method)

In [43]:
@code_lowered flipcoin(rand()) # syntax tree

CodeInfo(
[90m1 ─[39m %1 = randval < 0.5
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m      return "H"
[90m3 ─[39m      return "T"
)

In [44]:
@code_warntype flipcoin(rand()) # try @code_typed

MethodInstance for flipcoin(::Float64)
  from flipcoin([90mrandval[39m::[1mFloat64[22m)[90m @[39m [90mMain[39m [90m[4mIn[42]:7[24m[39m
Arguments
  #self#[36m::Core.Const(flipcoin)[39m
  randval[36m::Float64[39m
Body[36m::String[39m
[90m1 ─[39m %1 = (randval < 0.5)[36m::Bool[39m
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m      return "H"
[90m3 ─[39m      return "T"



In [45]:
@code_llvm flipcoin(rand()) # this and code_warntype are probably the most relevant

[90m;  @ In[42]:7 within `flipcoin`[39m
[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_flipcoin_2470[39m[33m([39m[36mdouble[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ In[42]:8 within `flipcoin`[39m
[90m; ┌ @ float.jl:536 within `<`[39m
   [0m%1 [0m= [96m[1mfcmp[22m[39m [96m[1muge[22m[39m [36mdouble[39m [0m%0[0m, [33m5.000000e-01[39m
   [0m%. [0m= [96m[1mselect[22m[39m [36mi1[39m [0m%1[0m, [33m{[39m[33m}[39m[0m* [95minttoptr[39m [33m([39m[36mi64[39m [33m139910097979208[39m [95mto[39m [33m{[39m[33m}[39m[0m*[33m)[39m[0m, [33m{[39m[33m}[39m[0m* [95minttoptr[39m [33m([39m[36mi64[39m [33m139910097979152[39m [95mto[39m [33m{[39m[33m}[39m[0m*[33m)[39m
[90m; └[39m
[90m;  @ In[42] within `flipcoin`[39m
  [96m[1mret[22m[39m [33m{[39m[33m}[39m[0m* [0m%.
[33m}[39m


In [46]:
@code_native flipcoin(rand())

	[0m.text
	[0m.file	[0m"flipcoin"
	[0m.section	[0m.rodata.cst8[0m,[0m"aM"[0m,[0m@progbits[0m,[33m8[39m
	[0m.p2align	[33m3[39m                               [90m# -- Begin function julia_flipcoin_2486[39m
[91m.LCPI0_0:[39m
	[0m.quad	[33m0x3fe0000000000000[39m              [90m# double 0.5[39m
	[0m.text
	[0m.globl	[0mjulia_flipcoin_2486
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_flipcoin_2486[0m,[0m@function
[91mjulia_flipcoin_2486:[39m                    [90m# @julia_flipcoin_2486[39m
[90m; ┌ @ In[42]:7 within `flipcoin`[39m
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[96m[1mmovabs[22m[39m	[0mrax[0m, [95moffset[39m [0m.LCPI0_0
	[96m[1mvmovsd[22m[39m	[0mxmm1[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrax[33m][39m           [90m# xmm1 = mem[0],zero[39m
[90m; │ @ In[42]:8 within `flipcoin`[39m
[90m; │┌ @ float.jl: