How I Wrote Simple C++ LSDA Parser

Subtitle 1: The Worst Debug of my Life

Subtitle 2: One Definition Rule Strikes Again

Subtitle 3: Why You Should Never Use Fedora for Anything Serious

Subtitle 4 (updated 2019-07-16): Some Fedora package maintainers really do not have slightest clue about the most basic C++ principles, or what does “creating a package for distribution” actually mean. See transcript of the conversation.

TLDR

I spent almost two and a half weeks debugging horrible crash in one C++ module. First it all looked like a compiler / std. library bug, but in the end it was just plain and simple undefined behavior caused by the code during the initialization of static variables. The root cause of the problem was that librados2 package in Fedora 27 is faulty, because it was obviously built using boost 1.66, but the rest of Fedora 27 uses boost 1.64, so if you link something which uses boost (native version from the distribution) against librados2 in Fedora 27, you accidentally break one definition rule for the whole program. Curse you, Fedora package maintainers! :-)

How it all Began

It all began with SIGSEGV in one of our C++ modules at work. After some narrowing down of possibilities, we were able to reproduce it deterministically. Well, at least for some builds. It seemed that builds built on some machines were always working, but builds built on another machines were not. It is worth noting that we use dockerized build environment, so the only difference we know of in between different build machines is the path of the source code inside the docker container. That should really not make any difference when it comes to whether the software crashes or not (as you will later see, it is important, because it causes different ELF sections to be laid out differently). Running the software with Valgrind did not show anything suspicious before the crash and when built using LibSanitizer, the crash did not occur at all. What was even stranger was the callstack, which lead through some internals of libstdc++ and libgcc into constructor code of some static instances inside boost. After installing relevant debug symbols and source codes, which you can do like this:


    $ dnf install 'dnf-command(debuginfo-install)'

    $ dnf debuginfo-install libstdc++

    $ dnf debuginfo-install libgcc

... inside Fedora distribution, which we had as a runtime, the callstack from GDB looked like this:


    #0  0x00007ffff6d64290 in boost::asio::error::get_netdb_category()::instance () from /lib64/OUR_LIBRARY

    #1  0x00007ffff58b0055 in get_adjusted_ptr (catch_type=0x7ffff6d64290 <boost::asio::error::get_netdb_category()::instance>, throw_type=throw_type@entry=0x7ffff273a7d0 <typeinfo for OUR_EXCEPTION_TYPE>, thrown_ptr_p=thrown_ptr_p@entry=0x7fffbdff9210) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:228

    #2  0x00007ffff58b09cf in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=1, exception_class=5138137972254386944, ue_header=0x7fffb4031270, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:595

    #3  0x00007ffff52cf2db in _Unwind_RaiseException (exc=exc@entry=0x7fffb4031270) at ../../../libgcc/unwind.inc:113

    #4  0x00007ffff58b1117 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fffb4031290, tinfo=tinfo@entry=0x7ffff273a7d0 <typeinfo for OUR_EXCEPTION_TYPE>, dest=dest@entry=0x7ffff252a2d0 <OUR_EXCEPTION_TYPE::~OUR_EXCEPTION_TYPE()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:88

    #5  0x00007ffff252f6b3 in OUR_CODE which throws exception

    ...

I left out our source code internals and replaced them with OUR_XXX labels.

GCC Bug?

Now what the hell is happening here? The code crashes when we try to throw an exception? After ruling out all the possible obvious problems, such as an error in the code of that exception we are throwing, we were able to narrow down the problem to code similar to the following:

try
{
  throw std::runtime_error("Blah");
}
catch (const OUR_EXCEPTION& ex)
{
  ...
}

Simply having this construction in the code caused the program to SIGSEGV on throwing the std::runtime_error. The problem was not the throwing, but the catching. Changing the catch block from OUR_EXCEPTION to OUR_DIFFERENT_EXCEPTION or exception from C++ standard library caused the crash to go away. Seriously? We did several more experiments, but this all pointed to either very serious memory corruption not detected by LibSanitizer and Valgrind, or GCC bug. Briefly going through the source codes from the callstack above pointed me to directions where I never ever wanted to go...

Here Comes the LSDA Parser

After going through the GCC source code in the above mentioned callstack and googling a bit, it was obvious that the program is crashing while throwing an exception, when it tries to find the stack frame which would be able to catch that exception. If I simplify it a lot, when you throw an exception in C++, GCC runtime uses stack unwinding mechanism (information stored in .eh_frame ELF header) combined with LSDA information (Language Specific Data Area stored in .gcc_except_table) to locate the frame which should do the catching. While unwinding, it looks into LSDA for exception related actions associated with particular instruction pointer ranges. These actions may be something like "call destructors", or "catch exception". The later uses very simple mechanism - it asks the typeinfo object of exception mentioned in the catch block, whether it can catch exception which "is currently flying" (also represented by it’s typeinfo).

And that is exactly what was failing in our case. LSDA in the above described callstack was not pointing to the typeinfo object for OUR_EXCEPTION, but to instance of boost::asio::error::detail::netdb_category. So I wanted to check whether the relevant LSDA information is correct in the executable file on the disk (that would indicate we corrupt something at runtime), or not (that would indicate GCC error - remember, we had some builds which consistently seemed to work and some which did not). Unfortunately, information regarding these areas of GCC / C++ ABI are very very scarce. And even if you find some, they are usually incomplete and / or contradicting. Nevertheless, it was better to have some at the beginning than none. I enclose my list of references, maybe you can find it useful:

https://monoinfinito.wordpress.com/series/exception-handling-in-c/ - Interesting introductory article, unfortunately simplifies things up to the point when they are not truth any more. Useful just for getting very rough overview.
http://www.hexblog.com/wp-content/uploads/2012/06/Recon-2012-Skochinsky-Compiler-Internals.pdf - Very good overview of all the relevant data structures, becomes perfect if you read it in context of the following links.
https://www.airs.com/blog/archives/460 and https://www.airs.com/blog/archives/464 - Nice description which puts things into context.
https://github.com/gcc-mirror/gcc/blob/gcc-7_3_0-release/libstdc++-v3/libsupc++/eh_personality.cc and related files - The Ultimate Source of Truth - the actual implementation which parses LSDA tables at runtime.

After reading the above mentioned articles and especially after detailed reading of GCC sources, I was able to write LSDA parser which is able to dump LSDA information in human readable form. At this point in time, I was not able to find any other tool, except one commercial disassembler, which would be able to do this. This is why I publish it here. The abilities of the parser are very limited, specifically targeted to my usecase, but I believe you may find it useful anyway. Here are the full source codes. My employer was kind enough to let me publish them under GPL license: LSDADecoder.tar.gz

I am warning you again, the code is very experimental, incomplete and not at all production ready, but it did the job for me. All I needed to start was address of the potentially corrupted LSDA chunk and that one I got using GDB from the above mentioned stack:


    (gdb) frame 2

    (gdb) info locals

    ...

    language_specific_data = 0x7ffff6b5c848

    ...

    (gdb) # now we need offset inside the relevant section, following two will give me mappings which I can use to subtract from the above mentioned pointer

    (gdb) info files

    ...

    (gdb) info proc mappings

    ...

          0x7ffff6a33000     0x7ffff6b5e000   0x12b000        0x0 /usr/lib64/OUR_LIBRARY

    ...

    (gdb) # so we are looking for offset 0x7ffff6b5c848 − 0x7ffff6a33000 = 0x129848 in our library file, which we feed to the LSDADecoder

The parsed information from the above published LSDA parser showed that the value inside the ELF file is correct, which meant runtime corruption. But thanks to the LSDA parser, now I knew which address actually got corrupted.

And Here Comes the Culprit

Discovering who is corrupting the memory was now just a matter of setting up a single breakpoint in GDB:


    (gdb) # address below comes from the output of the LSDA parser (type 0x00000000003312A0) and the base offset 0x7ffff6a33000 printed above

    (gdb) watch *0x7ffff6d642a0

    (gdb) start

The hardware watchpoint got hit twice – first time it was dynamic linker actually loading the correct data there, second time it was /usr/lib64/ceph/libceph-common.so.0 corrupting the data. After installing the debug symbols and sources for glibc and librados2, it was immediately obvious what was happening. Fedora package maintainers compiled different version of boost library into the CEPH package:


    $ md5sum /usr/src/debug/ceph-12.2.8-1.fc27.x86_64/build/boost/include/boost/system/error_code.hpp

    f465639145c000ca2198325d1072ba92  /usr/src/debug/ceph-12.2.8-1.fc27.x86_64/build/boost/include/boost/system/error_code.hpp


    $ md5sum /usr/include/boost/system/error_code.hpp

    8d407caebd0f2a989ac1720038c596c6  /usr/include/boost/system/error_code.hpp

Inside the package, they use 1.66, but the rest of the distribution uses 1.64. And that was it. In 1.64, error_category class has no members, but in 1.66, it has member called pc_ pointing to this. So compiler allocated space based on 1.64 version, but constructor for 1.66 got executed by the dynamic linker, which tried to initialize the pc_ which was not there and some bytes of .gcc_except_table got overwritten instead. Sorry, my dear Fedora people, but this is really unfortunate...

Conclusion and Lessons Learned

Memory corruptions during the static globals initialization phase during dynamic linking may not get caught by tools such as Valgrind and LibSanitizer. I didn’t know that.
C++ exception handling is actually pretty interesting, but the only real sources of information are compiler source codes here.
I thought that one of the purposes of Linux distributions is to select software which works well together. Fedora states on their web pages that they are "committed to innovation". I never thought that it actually means "we push newest versions of everything no matter how unstable it gets". Some Fedora package maintainers do not care about stability of software running on their distribution, otherwise they wouldn’t break something as fundamental and basic as one definition rule.