Working on some eBPF / BCC script that fetches the stacktraces of processes that receive deadly signals I saw dozens of segfaulting Java processes. Despite core dumping was enabled, none was generated.
This is how some of the stacktraces looked like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 PID TID COMM FUNC - 1450691 1450692 java complete_signal 11 complete_signal+0x1 [kernel] force_sig_info+0xbd [kernel] force_sig_info_fault+0x8c [kernel] __bad_area_nosemaphore+0xef [kernel] bad_area+0x46 [kernel] __do_page_fault+0x366 [kernel] do_page_fault+0xc [kernel] page_fault+0x22 [kernel] [unknown] [unknown] 451905 451976 java complete_signal 11 complete_signal+0x1 [kernel] force_sig_info+0xbd [kernel] force_sig_info_fault+0x8c [kernel] __bad_area_nosemaphore+0xef [kernel] bad_area_access_error+0xad [kernel] __do_page_fault+0x16b [kernel] do_page_fault+0xc [kernel] page_fault+0x22 [kernel] java.util.logging.Handler.getFilter()+0x38 [perf-451905.map] [unknown] [perf-451905.map]
My first reaction was thinking that my eBPF code was buggy. I used BCC’s
trace.py to double check, but got the exact same results.
Coming from a Ruby-centric (CRuby / MRI) world full of native extensions, I thought that some Java native extensions could be crashing and that it was somehow quietly handled.
However, it was not a couple of segfaults, it was more like hundreds when invoking some java process like
buck build <...>!! Even calling
buck help generated 2 or 3!!
Chris Down helped me double check that the
trace.py one-liner made sense. We checked the kernel stacktraces and many of them were pointing to
Gave another go at debugging this in the evening with Javier Maestro who thought this could be an implementation detail of the Java Virtual Machine. I thought that was completely improbable, but after some more digging into HotSpot’s code and running a Hello World in Java under GDB and seeing that it was receiving a SIGSEGV, we learnt that it is the way it’s supposed to behave (!!!).
Some of the possible cases are:
- The JVM eliminates NULL checks and on SIGSEGV will replace them with the code that has the checks
- The safepoint execution mechanism
- We overflowed the stack so it will grow it
The documentation explains why it works this way and which signals are used for what.
This function has some interesting logic to check if it’s a SIGSEGV with the traditional semantics or not here. The logic for generating a core which is called here. We also found some interesting blogpost as well: http://jcdav.is/2015/10/06/SIGSEGV-as-control-flow/
- eBPF is amazing
- Debugging stuff and finding unexpected behaviours with coworkers is fun, thanks you both 💞
- Javier Maestro was right 😜
# trace.py -U 'p::complete_signal(int sig, struct task_struct *p, int group) (sig==11) "%d", sig'