Okay, this is more a case of “confusing debugging output” than an actual bug, but it’s still a fun story.
So the other day at work, I had to debug a linking issue. (Don’t worry, it wasn’t as bad as it sounds.) I eventually figured it out, but one thing that bugged1 me was the somewhat unexpected output that I got when I was trying to resolve the issue by looking at the debugging output.
Here’s a simplified version of the problem: suppose we write a pretty basic library implementing, say, the factorial function.
/* fact.c */
unsigned factorial(unsigned n) {
if (n == 0)
return 1;
return n * factorial(n - 1);
}
Now generally speaking, there are three ways that we can use this library code in another file:
When it comes to dynamic linking and dynamic loading, how does the program find its libraries at runtime? This depends on your system (on Linux, reading the man pages for ld-linux
is a good start), but typically it’s some combination of:
RPATH
or RUNPATH
header.LD_LIBRARY_PATH
environment variable./lib
and /usr/lib
, as well as directories listed in /etc/ld.so.conf
.Ordinarily, when you have a linker issue, the quickest fix is to set LD_LIBRARY_PATH
to include the directory where the missing library resides, but this is generally seen as a quick-and-dirty hack that should be discouraged. Better is to provide the right RPATH
to the executable, either by patching it (with tools like chrpath
or patchelf
) or by just compiling it right in the first place.
But returning to our wonderful little factorial library, let’s be a little sneaky and introduce another level of indirection here. We’re going to dynamically load our library, but instead of doing it directly, we’re going to write another library that does the loading for us by wrapping the system dlfcn
library:
/* load.c */
#include <dlfcn.h>
void *load(const char *filename) {
return dlopen(filename, RTLD_NOW);
}
void *symbol(void *handle, const char *name) {
return dlsym(handle, name);
}
int close(void *handle) {
return dlclose(handle);
}
The main program will be dynamically linked against the load
library. This library will have the usual header file:
/* load.h */
void *load(const char *filename);
void *symbol(void *handle, const char *name);
int close(void *handle);
Don’t worry about why we’re doing this—I’m just trying to make a point here.2 Finally, we have our main program, which will attempt to use the load
library to load the fact
library to compute a factorial:
/* main.c */
#include <stdlib.h>
#include <stdio.h>
#include "load.h"
int main() {
void *handle;
unsigned (*factorial)(unsigned);
unsigned result;
"libfact.so");
handle = load("factorial");
factorial = symbol(handle, 5);
result = factorial(
"Result: %u\n", result);
printf(
close(handle);return EXIT_SUCCESS;
}
To make things clearer, we’re going to use the following directory structure:
tree
$ .
fact
├── fact.c
│ └── load
├── load.c
│ ├── load.h
│ └── main.c
└──
2 directories, 4 files
Let’s start by compiling the factorial library:
gcc -c -o fact/fact.o fact/fact.c
$ gcc -shared -o fact/libfact.so fact/fact.o $
Next, we’ll compile the loading library. For now, we’ll just compile the library without telling it where we put libfact.so
:
gcc -c -o load/load.o load/load.c
$ gcc -shared -ldl -o load/libload.so load/load.o $
Finally, we’ll compile the main program. There’s a whole bevy of flags that we’ll need to set to tell it where we put libload.so
: -I
for the directory with the header file, -L
for the directory with the library, -l
for the library name itself, and -rpath
(passed to the linked with -Wl
) for the absolute path to libload.so
at runtime. Together, this takes the form:
gcc -Iload -Lload -lload -Wl,-rpath,$PWD/load main.c $
At no stage did we tell anyone where we put libfact.so
, so we should expect that at runtime, the program won’t be able to find it. This is indeed what happens:
./a.out
$ Segmentation fault
To get more information, we can set the LD_DEBUG
environment variable:
LD_DEBUG=files,libs ./a.out $
This prints out a load3 of stuff, but the interesting part is:
file=libfact.so [0]; dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
search cache=/etc/ld.so.cache
search path=<very long path> (system search path)
/path/to/load/libload.so: error: symbol lookup error: undefined symbol: factorial (fatal)
This makes sense; since libload.so
doesn’t have an RPATH
or RUNPATH
header set (and the LD_LIBRARY_PATH
environment variable is not set), the runtime loader is looking for the factorial
symbol from libfact.so
in the usual system library locations, where it doesn’t exist.
We can amend this by recompiling libload.so
and providing the -rpath
flag to the linker with the location of libfact.so
:
gcc -shared -ldl -Wl,-rpath,$PWD/fact -o load/libload.so load/load.o $
And now it works!
./a.out
$ Result: 120
Up through now, this has been some fairly standard linker stuff. Let’s take a closer look at what’s happening here:
LD_DEBUG=files,libs ./a.out $
This time, we see that the library was found, on the RUNPATH
from libload.so
:
file=libfact.so [0]; dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
search path=/path/to/fact (RUNPATH from file /path/to/load/libload.so)
trying file=/path/to/fact/libfact.so
The key thing to note is that we’re searching the RUNPATH
from libload.so
, not from a.out
. We can verify this by inspecting the ELF headers:
objdump -x load/libload.so | grep RUNPATH
$ RUNPATH /path/to/fact
objdump -x a.out | grep RUNPATH
$ RUNPATH /path/to/load
This is the correct behavior, since from the system’s point of view, it’s libload.so
that’s dynamically loading libfact.so
, not a.out
.
Now here’s the funny bit: let’s recompile a.out
so that it has libfact.so
on its RUNPATH
. This is unnecessary, but it won’t harm anyone:
gcc -Iload -Lload -lload -Wl,-rpath,$PWD/load,-rpath,$PWD/fact main.c $
Of course, the program still works. But take a look at the debugging output:
LD_DEBUG=files,libs ./a.out $
The relevant section:
file=libfact.so [0]; dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
search path=/path/to/fact (RUNPATH from file ./a.out)
trying file=/path/to/fact/libfact.so
Huh, it’s claiming that it’s searching the path /path/to/fact
(which is correct), but also that this is the RUNPATH
from a.out
(which is incorrect—this is in fact the RUNPATH
from libload.so
). We can confirm this with:
objdump -x a.out | grep RUNPATH
$ RUNPATH /path/to/load:/path/to/fact
We can make this even more explicit by adding a random entry to the RUNPATH
in libload.so
, say:
gcc -shared -ldl -Wl,-rpath,$PWD/fact,-rpath,$HOME -o load/libload.so load/load.o $
Now the debugging output is certainly wrong, or at the very least quite misleading:
file=libfact.so [0]; dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
search path=/path/to/fact:/home/eric (RUNPATH from file ./a.out)
trying file=/path/to/fact/libfact.so
And of course, if we remove the RUNPATH
from libload.so
entirely, the program crashes again.
gcc -shared -ldl -o load/libload.so load/load.o
$ ./a.out
$ Segmentation fault
Running it with LD_DEBUG=files,libs
confirms that no RUNPATH
is being searched. To summarize4 the issue:
When a program is dynamically linked against library A, which in turn dynamically loads library B, the
RUNPATH
of library A is searched to figure out where library B is located. However, whenLD_DEBUG
is set, the debugging message falsely claims that theRUNPATH
of the main program is being searched.
I’m really not an expert on any of this, so I’d welcome any feedback. (You can find my contact details on my website.)
If you’re curious, this problem originally arose when trying to package Chromium for my employer’s cloud platform. Chromium tries to dynamically load the NSS certificate database from libnssckbi.so
, but our system had it installed in a nonstandard place. This was slightly tricky because Chromium uses another library—NSPR—to do the loading, but due to some design constraints we didn’t want to modify NSPR to tell it where NSS was installed.
I sincerely apologize for the pun. I’m not taking it down, though.↩︎
Although one reason might be to provide a cross-platform wrapper that abstracts over any system-specific details for library loading.↩︎
Sorry for the second pun. I’m also not deleting this one.↩︎
Yes, this is an abuse of the HTML <blockquote>
element. Sue me.↩︎
Comments