In 2008, I wrote a proof-of-concept to test that it is possible to continue the execution of a program after a segmentation fault: stack.c. The operating system sends a synchronous SIGSEGV signal to the process which can handle it. Using a long jump, it is possible to continue the execution after an invalid memory read or write. With an alternate stack (see the
sigaltstack() function), it is even possible to continue after a stack overflow. The alternate stack is required to be able to execute the signal handler.
Because my proof-of-concept worked as expected, I proposed a patch (issue3999), in september 2008, for Python to convert an evil segmentation fault (usually called “crash” or “fatal error”) to a classic and safe Python exception. I proposed it to improve the security (availability) of Python programs: it is possible to save all documents and display a nice error message before exiting, or even better, just log the error and continue the execution.
But the patch was rejected, because in some cases the long jump may leave some objects in an inconsistent state. In this case, do anything more than exiting is dangerous and can be worse. But someone proposed to just display the Python backtrace before existing, which was already possible indirectly with my patch if the exception was not catched by the program. I was angry because I spent a lot of time on this patch and I was still convinced that my patch was safe. I didn’t write the patch displaying the backtrace because I was unable to dump the Python backtrace in C, especially in a signal handler.
In 2009, I became crazy because of a very annoying bug in Xorg: after a period between 2 and 10 days, I lost my keyboard (“my keyboard is blo…”). I got the bug during 8 months without being able to get any useful information to isolate or understand it. I used two USB keyboards: I disconnected the second keyboard, nothing changed. I tried to keep a list of active applications: I didn’t find any useful information. But I had another problem: I didn’t know if Xorg logged something or not. Xorg doesn’t log messages with the timestamp, and because I never read Xorg logs, I was unable to see if there were new messages or not. So I wrote a simple patch to log messages with the timestamp. After posting it to the freedesktop bugtracker, I found another older patch. I used it to improve mine.
But my Xorg patch was rejected (as the older patch) because it used some functions which are not “signal-safe”. I learnt that a signal handler should only use “signal-safe” functions which are reentrant functions ensuring to be safe in a signal handler. And the list of these functions is short! The main problem was the localtime_r() and strftime() functions. In the GNU libc, these functions are clearly not signal-safe: they change temporary the timezone and use a lock for this.
Thanks to this experience in the signal handler world and my experience in CPython internals, I was able to write a signal handler displaying the Python backtrace: issue8863 (created in may 2010). The first version was naïve, buggy and unsafe:
- if there was a loop in the frame linked list, the signal handler filled stderr and never finished
- it used Python high-level functions such as _PyUnicode_AsString() (encode a unicode string to UTF-8) which may allocate memory on the heap using the Python memory allocator (pymalloc)
- it used the buffered functions to write into stderr (eg. fputs)
- it displayed the wrong backtrace if the thread causing the fault doesn’t hold the GIL (global interpreter lock)
- it doesn’t call the previous signal handler (eg. Apport on Ubuntu)
- It also only caught the SIGSEGV fault
It took me 11 versions to write a safe handler:
- Limit the backtrace to 100 frames (to avoid unlimited loop)
- Only use signal-safe functions (eg. write())
- Only allocate (a few) memory on the stack, not on the heap
- Use PyGILState_GetThisThreadState() to get the backtrace of the thread that causes the fault, instead of getting the “current” thread
- Call the previous signal handler. It restores the previous signal handler and gives back the control flow to the program. The program raises again the same fault on the same instruction and so the previous signal handler is called too.
- Catch SIGSEGV, SIGFPE, SIGBUS and SIGILL faults
But the patch was rejected because it is not safe. The fault handler writes into the file descriptor 2 which is supposed to be stderr, but it may be a network socket in a server. The fault handler may also cause troubles if Python is embedded in an application. The API to disable the fault handler was not decided. It was also too late for Python 3.2: the beta 2 was already released, and new features are not permitted after this strange (as explained in the Python 3.2 schedule). For all these reasons, my patch cannot be included in Python 3.2. I tried my last chance for Python 3.2 by proposing it a new patch with the fault handler disabled by default. But it was rejected too. Again, I was angry because I spent a lot of time on this new patch.
So I converted the patch to a third-party module. I posted it to the Python package index (PyPI) and announced directly the version 1.0 to the python-announce mailing list for christmas.
A dedicated module is more practical than only a signal handler: I added easily functions to enable and disable the fault handler, and then functions to dump the current backtrace (of the current thread or of all threads). The project is still under development and it can be found on github.com: faulthandler.