diff options
Diffstat (limited to 'Documentation/admin-guide/bug-hunting.rst')
-rw-r--r-- | Documentation/admin-guide/bug-hunting.rst | 248 |
1 files changed, 248 insertions, 0 deletions
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst new file mode 100644 index 000000000000..a8ef794aadae --- /dev/null +++ b/Documentation/admin-guide/bug-hunting.rst @@ -0,0 +1,248 @@ +Bug hunting ++++++++++++ + +Last updated: 20 December 2005 + +Introduction +============ + +Always try the latest kernel from kernel.org and build from source. If you are +not confident in doing that please report the bug to your distribution vendor +instead of to a kernel developer. + +Finding bugs is not always easy. Have a go though. If you can't find it don't +give up. Report as much as you have found to the relevant maintainer. See +MAINTAINERS for who that is for the subsystem you have worked on. + +Before you submit a bug report read +:ref:`Documentation/REPORTING-BUGS <reportingbugs>`. + +Devices not appearing +===================== + +Often this is caused by udev. Check that first before blaming it on the +kernel. + +Finding patch that caused a bug +=============================== + + + +Finding using ``git-bisect`` +---------------------------- + +Using the provided tools with ``git`` makes finding bugs easy provided the bug +is reproducible. + +Steps to do it: + +- start using git for the kernel source +- read the man page for ``git-bisect`` +- have fun + +Finding it the old way +---------------------- + +[Sat Mar 2 10:32:33 PST 1996 KERNEL_BUG-HOWTO lm@sgi.com (Larry McVoy)] + +This is how to track down a bug if you know nothing about kernel hacking. +It's a brute force approach but it works pretty well. + +You need: + + - A reproducible bug - it has to happen predictably (sorry) + - All the kernel tar files from a revision that worked to the + revision that doesn't + +You will then do: + + - Rebuild a revision that you believe works, install, and verify that. + - Do a binary search over the kernels to figure out which one + introduced the bug. I.e., suppose 1.3.28 didn't have the bug, but + you know that 1.3.69 does. Pick a kernel in the middle and build + that, like 1.3.50. Build & test; if it works, pick the mid point + between .50 and .69, else the mid point between .28 and .50. + - You'll narrow it down to the kernel that introduced the bug. You + can probably do better than this but it gets tricky. + + - Narrow it down to a subdirectory + + - Copy kernel that works into "test". Let's say that 3.62 works, + but 3.63 doesn't. So you diff -r those two kernels and come + up with a list of directories that changed. For each of those + directories: + + Copy the non-working directory next to the working directory + as "dir.63". + One directory at time, try moving the working directory to + "dir.62" and mv dir.63 dir"time, try:: + + mv dir dir.62 + mv dir.63 dir + find dir -name '*.[oa]' -print | xargs rm -f + + And then rebuild and retest. Assuming that all related + changes were contained in the sub directory, this should + isolate the change to a directory. + + Problems: changes in header files may have occurred; I've + found in my case that they were self explanatory - you may + or may not want to give up when that happens. + + - Narrow it down to a file + + - You can apply the same technique to each file in the directory, + hoping that the changes in that file are self contained. + + - Narrow it down to a routine + + - You can take the old file and the new file and manually create + a merged file that has:: + + #ifdef VER62 + routine() + { + ... + } + #else + routine() + { + ... + } + #endif + + And then walk through that file, one routine at a time and + prefix it with:: + + #define VER62 + /* both routines here */ + #undef VER62 + + Then recompile, retest, move the ifdefs until you find the one + that makes the difference. + +Finally, you take all the info that you have, kernel revisions, bug +description, the extent to which you have narrowed it down, and pass +that off to whomever you believe is the maintainer of that section. +A post to linux.dev.kernel isn't such a bad idea if you've done some +work to narrow it down. + +If you get it down to a routine, you'll probably get a fix in 24 hours. + +My apologies to Linus and the other kernel hackers for describing this +brute force approach, it's hardly what a kernel hacker would do. However, +it does work and it lets non-hackers help fix bugs. And it is cool +because Linux snapshots will let you do this - something that you can't +do with vendor supplied releases. + +Fixing the bug +============== + +Nobody is going to tell you how to fix bugs. Seriously. You need to work it +out. But below are some hints on how to use the tools. + +To debug a kernel, use objdump and look for the hex offset from the crash +output to find the valid line of code/assembler. Without debug symbols, you +will see the assembler code for the routine shown, but if your kernel has +debug symbols the C code will also be available. (Debug symbols can be enabled +in the kernel hacking menu of the menu configuration.) For example:: + + objdump -r -S -l --disassemble net/dccp/ipv4.o + +.. note:: + + You need to be at the top level of the kernel tree for this to pick up + your C files. + +If you don't have access to the code you can also debug on some crash dumps +e.g. crash dump output as shown by Dave Miller:: + + EIP is at ip_queue_xmit+0x14/0x4c0 + ... + Code: 44 24 04 e8 6f 05 00 00 e9 e8 fe ff ff 8d 76 00 8d bc 27 00 00 + 00 00 55 57 56 53 81 ec bc 00 00 00 8b ac 24 d0 00 00 00 8b 5d 08 + <8b> 83 3c 01 00 00 89 44 24 14 8b 45 28 85 c0 89 44 24 18 0f 85 + + Put the bytes into a "foo.s" file like this: + + .text + .globl foo + foo: + .byte .... /* bytes from Code: part of OOPS dump */ + + Compile it with "gcc -c -o foo.o foo.s" then look at the output of + "objdump --disassemble foo.o". + + Output: + + ip_queue_xmit: + push %ebp + push %edi + push %esi + push %ebx + sub $0xbc, %esp + mov 0xd0(%esp), %ebp ! %ebp = arg0 (skb) + mov 0x8(%ebp), %ebx ! %ebx = skb->sk + mov 0x13c(%ebx), %eax ! %eax = inet_sk(sk)->opt + +In addition, you can use GDB to figure out the exact file and line +number of the OOPS from the ``vmlinux`` file. If you have +``CONFIG_DEBUG_INFO`` enabled, you can simply copy the EIP value from the +OOPS:: + + EIP: 0060:[<c021e50e>] Not tainted VLI + +And use GDB to translate that to human-readable form:: + + gdb vmlinux + (gdb) l *0xc021e50e + +If you don't have ``CONFIG_DEBUG_INFO`` enabled, you use the function +offset from the OOPS:: + + EIP is at vt_ioctl+0xda8/0x1482 + +And recompile the kernel with ``CONFIG_DEBUG_INFO`` enabled:: + + make vmlinux + gdb vmlinux + (gdb) p vt_ioctl + (gdb) l *(0x<address of vt_ioctl> + 0xda8) + +or, as one command:: + + (gdb) l *(vt_ioctl + 0xda8) + +If you have a call trace, such as:: + + Call Trace: + [<ffffffff8802c8e9>] :jbd:log_wait_commit+0xa3/0xf5 + [<ffffffff810482d9>] autoremove_wake_function+0x0/0x2e + [<ffffffff8802770b>] :jbd:journal_stop+0x1be/0x1ee + ... + +this shows the problem in the :jbd: module. You can load that module in gdb +and list the relevant code:: + + gdb fs/jbd/jbd.ko + (gdb) p log_wait_commit + (gdb) l *(0x<address> + 0xa3) + +or:: + + (gdb) l *(log_wait_commit + 0xa3) + + +Another very useful option of the Kernel Hacking section in menuconfig is +Debug memory allocations. This will help you see whether data has been +initialised and not set before use etc. To see the values that get assigned +with this look at ``mm/slab.c`` and search for ``POISON_INUSE``. When using +this an Oops will often show the poisoned data instead of zero which is the +default. + +Once you have worked out a fix please submit it upstream. After all open +source is about sharing what you do and don't you want to be recognised for +your genius? + +Please do read :ref:`Documentation/SubmittingPatches <submittingpatches>` +though to help your code get accepted. |