A Linux Dynamic Linker Bug?

Okay, this is more a case of “confusing debugging output” than an actual bug, but it’s still a fun story.

So the other day at work, I had to debug a linking issue. (Don’t worry, it wasn’t as bad as it sounds.) I eventually figured it out, but one thing that bugged1 me was the somewhat unexpected output that I got when I was trying to resolve the issue by looking at the debugging output.

The Setup

Here’s a simplified version of the problem: suppose we write a pretty basic library implementing, say, the factorial function.

/* fact.c */

unsigned factorial(unsigned n) {
  if (n == 0)
    return 1;
  return n * factorial(n - 1);
}

Now generally speaking, there are three ways that we can use this library code in another file:

When it comes to dynamic linking and dynamic loading, how does the program find its libraries at runtime? This depends on your system (on Linux, reading the man pages for ld-linux is a good start), but typically it’s some combination of:

Ordinarily, when you have a linker issue, the quickest fix is to set LD_LIBRARY_PATH to include the directory where the missing library resides, but this is generally seen as a quick-and-dirty hack that should be discouraged. Better is to provide the right RPATH to the executable, either by patching it (with tools like chrpath or patchelf) or by just compiling it right in the first place.

But returning to our wonderful little factorial library, let’s be a little sneaky and introduce another level of indirection here. We’re going to dynamically load our library, but instead of doing it directly, we’re going to write another library that does the loading for us by wrapping the system dlfcn library:

/* load.c */

#include <dlfcn.h>

void *load(const char *filename) {
  return dlopen(filename, RTLD_NOW);
}

void *symbol(void *handle, const char *name) {
  return dlsym(handle, name);
}

int close(void *handle) {
  return dlclose(handle);
}

The main program will be dynamically linked against the load library. This library will have the usual header file:

/* load.h */

void *load(const char *filename);
void *symbol(void *handle, const char *name);
int close(void *handle);

Don’t worry about why we’re doing this—I’m just trying to make a point here.2 Finally, we have our main program, which will attempt to use the load library to load the fact library to compute a factorial:

/* main.c */

#include <stdlib.h>
#include <stdio.h>
#include "load.h"

int main() {
  void *handle;
  unsigned (*factorial)(unsigned);
  unsigned result;

  handle = load("libfact.so");
  factorial = symbol(handle, "factorial");
  result = factorial(5);

  printf("Result: %u\n", result);
  close(handle);
  return EXIT_SUCCESS;
}

To make things clearer, we’re going to use the following directory structure:

$ tree
.
├── fact
│   └── fact.c
├── load
│   ├── load.c
│   └── load.h
└── main.c

2 directories, 4 files

Let’s start by compiling the factorial library:

$ gcc -c -o fact/fact.o fact/fact.c
$ gcc -shared -o fact/libfact.so fact/fact.o

Next, we’ll compile the loading library. For now, we’ll just compile the library without telling it where we put libfact.so:

$ gcc -c -o load/load.o load/load.c
$ gcc -shared -ldl -o load/libload.so load/load.o

Finally, we’ll compile the main program. There’s a whole bevy of flags that we’ll need to set to tell it where we put libload.so: -I for the directory with the header file, -L for the directory with the library, -l for the library name itself, and -rpath (passed to the linked with -Wl) for the absolute path to libload.so at runtime. Together, this takes the form:

$ gcc -Iload -Lload -lload -Wl,-rpath,$PWD/load main.c

At no stage did we tell anyone where we put libfact.so, so we should expect that at runtime, the program won’t be able to find it. This is indeed what happens:

$ ./a.out
Segmentation fault

To get more information, we can set the LD_DEBUG environment variable:

$ LD_DEBUG=files,libs ./a.out

This prints out a load3 of stuff, but the interesting part is:

file=libfact.so [0];  dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
 search cache=/etc/ld.so.cache
 search path=<very long path>              (system search path)

/path/to/load/libload.so: error: symbol lookup error: undefined symbol: factorial (fatal)

This makes sense; since libload.so doesn’t have an RPATH or RUNPATH header set (and the LD_LIBRARY_PATH environment variable is not set), the runtime loader is looking for the factorial symbol from libfact.so in the usual system library locations, where it doesn’t exist.

We can amend this by recompiling libload.so and providing the -rpath flag to the linker with the location of libfact.so:

$ gcc -shared -ldl -Wl,-rpath,$PWD/fact -o load/libload.so load/load.o

And now it works!

$ ./a.out
Result: 120

The “Bug”

Up through now, this has been some fairly standard linker stuff. Let’s take a closer look at what’s happening here:

$ LD_DEBUG=files,libs ./a.out

This time, we see that the library was found, on the RUNPATH from libload.so:

file=libfact.so [0];  dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
 search path=/path/to/fact             (RUNPATH from file /path/to/load/libload.so)
  trying file=/path/to/fact/libfact.so

The key thing to note is that we’re searching the RUNPATH from libload.so, not from a.out. We can verify this by inspecting the ELF headers:

$ objdump -x load/libload.so | grep RUNPATH
  RUNPATH              /path/to/fact
$ objdump -x a.out | grep RUNPATH
  RUNPATH              /path/to/load

This is the correct behavior, since from the system’s point of view, it’s libload.so that’s dynamically loading libfact.so, not a.out.

Now here’s the funny bit: let’s recompile a.out so that it has libfact.so on its RUNPATH. This is unnecessary, but it won’t harm anyone:

$ gcc -Iload -Lload -lload -Wl,-rpath,$PWD/load,-rpath,$PWD/fact main.c

Of course, the program still works. But take a look at the debugging output:

$ LD_DEBUG=files,libs ./a.out

The relevant section:

file=libfact.so [0];  dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
 search path=/path/to/fact             (RUNPATH from file ./a.out)
  trying file=/path/to/fact/libfact.so

Huh, it’s claiming that it’s searching the path /path/to/fact (which is correct), but also that this is the RUNPATH from a.out (which is incorrect—this is in fact the RUNPATH from libload.so). We can confirm this with:

$ objdump -x a.out | grep RUNPATH
  RUNPATH              /path/to/load:/path/to/fact

We can make this even more explicit by adding a random entry to the RUNPATH in libload.so, say:

$ gcc -shared -ldl -Wl,-rpath,$PWD/fact,-rpath,$HOME -o load/libload.so load/load.o

Now the debugging output is certainly wrong, or at the very least quite misleading:

file=libfact.so [0];  dynamically loaded by /path/to/load/libload.so [0]
find library=libfact.so [0]; searching
 search path=/path/to/fact:/home/eric             (RUNPATH from file ./a.out)
  trying file=/path/to/fact/libfact.so

And of course, if we remove the RUNPATH from libload.so entirely, the program crashes again.

$ gcc -shared -ldl -o load/libload.so load/load.o
$ ./a.out
Segmentation fault

Running it with LD_DEBUG=files,libs confirms that no RUNPATH is being searched. To summarize4 the issue:

When a program is dynamically linked against library A, which in turn dynamically loads library B, the RUNPATH of library A is searched to figure out where library B is located. However, when LD_DEBUG is set, the debugging message falsely claims that the RUNPATH of the main program is being searched.

I’m really not an expert on any of this, so I’d welcome any feedback. (You can find my contact details on my website.)


If you’re curious, this problem originally arose when trying to package Chromium for my employer’s cloud platform. Chromium tries to dynamically load the NSS certificate database from libnssckbi.so, but our system had it installed in a nonstandard place. This was slightly tricky because Chromium uses another library—NSPR—to do the loading, but due to some design constraints we didn’t want to modify NSPR to tell it where NSS was installed.


  1. I sincerely apologize for the pun. I’m not taking it down, though.↩︎

  2. Although one reason might be to provide a cross-platform wrapper that abstracts over any system-specific details for library loading.↩︎

  3. Sorry for the second pun. I’m also not deleting this one.↩︎

  4. Yes, this is an abuse of the HTML <blockquote> element. Sue me.↩︎


Comments

Submit a comment

Your comment will be held for moderation. If needed, I'll reach out to the provided email address with moderation updates. Your email will not be publicly displayed.

Note: comments are still in beta. Let me know if anything is broken!