A small project to do a pure port of ARPACK Symmetric solver to Julia - Part 5 - Understanding Ref
Here, we pick up with debugging the call to dsconv
within the ARPACK library
itself from the last post. This is when the repo was at
this commit
When we run the tests against the Arpack wrapped function directly, it errors.
using Pkg; Pkg.test("ArpackInJulia"; test_args=["arpackjll"])
In particular, the following code always returns “-1”
import Arpack_jll, LinearAlgebra
function arpack_dsconv(n::Int, ritz::Vector{Float64}, bounds::Vector{Float64},
tol::Float64)
nconv::LinearAlgebra.BlasInt = -1
ccall((:dsconv_, Arpack_jll.libarpack), Cvoid,
(Ref{LinearAlgebra.BlasInt},
Ptr{Float64},
Ptr{Float64},
Ref{Float64},
Ref{LinearAlgebra.BlasInt}),
n, ritz, bounds, tol, nconv)
return nconv
end
Ugh. This is tough to debug. It seems like nconv always isn’t getting updated. One of my common refrains is that problems occur at boundaries between code. Interfaces are very hard to get right because slight differences in models can manifest as bugs or just outright incorrect answers. This is exactly what we are seeing here.
Now, I could just jump to the solution, which is blindingly obvious in retrospect. On the other hand, there’s value in walking through the process I used. (Or so I tell myself.)
Given that the code above doesn’t work, what were my working hypotheses about the causes?
- So maybe I don’t understand how the fortran calling stuff works and I shouldn’t have expected this to get updated.
- Maybe Julia is doing something I don’t understand?
- There is some odd interaction for this particular piece of code?
- The Fortran / Arpack code doesn’t do what I think it does.
Well, the last one is the the most important as everything follows from my assumption the fortran code works. So that seems like a good place to start. (In truth, I did futz around with some of the others, but then quickly settled on this one.)
Step 1. Do something you know or strongly suspect will work.
Arpack is designed to work with C and C++. There are just fewer places for bugs to hide there. So I wanted to make sure I could call this from C and get the response I expected.
This gave me the following C program.
#include <inttypes.h>
typedef int64_t blasint;
void dsconv_(blasint* n, double* ritz, double* bounds, double* tol, blasint *nconv);
#include <printf.h>
int main(int argc, char** argv) {
double ritz[] = {1.0, 1.0, 1.0};
double bounds[] = {0.0, 0.0, 0.0};
double tol = 1e-8;
blasint n = 3;
blasint nconv = 4;
dsconv_(&n, ritz, bounds, &tol, &nconv);
printf("nconv = %i\n", (int)nconv);
return 0;
}
Which I saved into a file: test_dsconv.c
. This should print 3 if I’ve understood
this correctly. (And the other Julia code is correct.)
I wanted to compile this against the Julia libraries. Here’s how I got it working. Get the path to libarpack from Julia
@show Arpack_jll.libarpack_path
/Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib
(Ugh, security breach… yes, my local username is dgleich. Entirely surprising!) After much fussing with library paths, etc. here’s how to get this all compiled.
gcc test_dsconv.c \
/Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib \
-Wl,-rpath /Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/ \
-Wl,-rpath /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia
Let’s pick this apart.
gcc test_dsconv.c
# above is the call to compile this piece of code
/Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib
# above is telling it we want to use this library.
-Wl,-rpath /Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/
# above is telling gcc to past an option to the linker command. Specifically
# rpath means where to get runtime libraries ("runtime path")
# this can also be set with LD_LIBRARY_PATH, but that's... controversial
Let’s try without the last link to Julia’s libraries to see what happens!
dgleich@circulant test_dsconv % gcc test_dsconv.c \
/Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib \
-Wl,-rpath /Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/ \
dgleich@circulant test_dsconv % ./a.out
dyld: Library not loaded: @rpath/libopenblas64_.dylib
Referenced from: /Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib
Reason: image not found
zsh: abort ./a.out
Which is because it can’t find libopenblas, referenced by libarpack. So we need it tell it about all the other common Julia libraries. That’s what the final
-Wl,-rpath /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia
# tell it about Julia's libraries.
And now it works great.
dgleich@circulant test_dsconv % gcc test_dsconv.c \
/Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/libarpack.2.0.0.dylib \
-Wl,-rpath /Users/dgleich/.julia/artifacts/096eefeb84e5b1a463aefac5a561d6109f6e9467/lib/ \
-Wl,-rpath /Applications/Julia-1.6.app/Contents/Resources/julia/lib/julia
dgleich@circulant test_dsconv % ./a.out
nconv = 3
Okay, so my understanding of the Arpack function was correct. Ugh. That means I need to learn something about Julia.
Step 2: Better understanding Julia.
Okay, let’s go back to Julia because what we are doing ought to work. Here’s where @code_native and some hints about assembly code come in handy.
julia> @code_native arpack_dsconv(3, ones(3), zeros(3), 1e-8)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ none:1 within `arpack_dsconv'
subq $40, %rsp
; │ @ none:4 within `arpack_dsconv'
; │┌ @ essentials.jl:396 within `cconvert'
; ││┌ @ refpointer.jl:104 within `convert'
; │││┌ @ refvalue.jl:8 within `RefValue'
movq %rdi, 32(%rsp)
vmovsd %xmm0, 16(%rsp)
movq $-1, (%rsp)
; │└└└
; │┌ @ pointer.jl:65 within `unsafe_convert'
movq (%rsi), %rsi
movq (%rdx), %rdx
leaq 32(%rsp), %rdi
leaq 16(%rsp), %rcx
movq %rsp, %r8
movabsq $dsconv_, %rax
; │└
callq *%rax
; │ @ none:11 within `arpack_dsconv'
movq $-1, %rax
addq $40, %rsp
retq
nopw %cs:(%rax,%rax)
; └
If you look at this code long enough and google around about how amd64
assembly
works (or x86_64
if you prefer, although it’s really amd64…) then the return
value of the function is in register rax
. This code
movq $-1, %rax
sets the return register to -1 right before returning (retq
)
If you want to learn more about calling conventions and register conventions
on amd64
, see this powerpoint from cs217 at Princeton
Okay, so what’s happening is that we are just always returning -1
independently
of the function call. This means that the Julia compiler is treating the value
of nconv
as a constant and optimizing the code.
Another debugging tenant is try things you think will likely work. Here’s something I strongly suspect will work. And just for fun, I tried the following code.
import Arpack_jll, LinearAlgebra
function arpack_dsconv(n::Int, ritz::Vector{Float64}, bounds::Vector{Float64},
tol::Float64)
nconv::Vector{LinearAlgebra.BlasInt} = [-1]
ccall((:dsconv_, Arpack_jll.libarpack), Cvoid,
(Ref{LinearAlgebra.BlasInt},
Ptr{Float64},
Ptr{Float64},
Ref{Float64},
Ref{LinearAlgebra.BlasInt}),
n, ritz, bounds, tol, nconv)
return nconv[1]
end
Why do I think this will work? Because I’m passing an array and then explicitly looking at an entry. This is what I want to do, just without the array allocation that creating the vector causes. (Julia’s memory allocater is good, but not that good.)
julia> arpack_dsconv(3, ones(3), zeros(3), 1e-8)
1-element Vector{Int64}:
3
Hmm… so I need to learn more about Ref to understand why the other call didn’t work.
This line stuck out to me.
In Julia, Ref objects are dereferenced (loaded or stored) with [].
Anyway, at this point, I realized my mental model was wrong and got to …
The fix
Now, here’s the fix. The issue was the ccall
will auto wrap
an intrinsic type (i.e. int, float, etc.) with a Ref if you tell
it the ref type. But this doesn’t refer to the original value, it
makes a new copy of it and refers to that. So here’s working code.
import Arpack_jll, LinearAlgebra
function arpack_dsconv(n::Int, ritz::Vector{Float64}, bounds::Vector{Float64},
tol::Float64)
nconv = Ref{LinearAlgebra.BlasInt}(-1)
ccall((:dsconv_, Arpack_jll.libarpack), Cvoid,
(Ref{LinearAlgebra.BlasInt},
Ptr{Float64},
Ptr{Float64},
Ref{Float64},
Ref{LinearAlgebra.BlasInt}),
n, ritz, bounds, tol, nconv)
return nconv[]
end
The error in my mental model
As I said above, bugs exist on boundaries. Here’s the origin of this bug and my incorrect mental model of the code.
My mental model based on C/C++
nconv::Int = -1
ccall((:func, libfunc), Cvoid, (Ref{Int}), nconv) # <-> func(&nconv) in C
But that’s not right! Here’s what the code actually does.
nconv::Int = -1
ccall((:func, libfunc), Cvoid, (Ref{Int}), nconv) # <-> int newint = nconv; func(&int(newint)) in C
Which of course explains why nconv isn’t updated. It wasn’t even passed in!
Suggestion
- Add a note to
Ref
documentation on this - Add a note to
ccall
documentation on this
Is it worth it for one person with an incorrect mental model? I think yes
because the ccall
area is likely to be a source of similar bugs
and it’s needed to document the assumptions! This is already partially done
in this paragraph
When passed as a
ccall
argument (either as aPtr
orRef
type), aRef
object will be converted to a native pointer to the data it references. For mostT
, or when converted to aPtr{Cvoid}
, this is a pointer to the object data. WhenT
is anisbits
type, this value may be safely mutated, otherwise mutation is strictly undefined behavior.