DescriptionIt appears that breakpad has trouble walking certain stacks on the x86 platform where the function on the top of the stack (the context frame) represents a Windows system call (stub) and it is compiled with FPO (frame-pointer-omission) optimization. The main reason that only certain CPU models are affected is that the OS call dispatching mechanism differs from CPU to CPU. For more details on why and how it differs take a look at the following blog post: http://www.dumpanalysis.org/blog/index.php/2008/01/10/what-is-kifastsystemcal... Now, it is not surprising that stacks which contain FPO frames are not walkable in the general case. However we are dealing with a very specific (and quite common) case where the function on the top of the stack represents a Windows system call stub (the thread is in kernel mode) where the stack configuration is well understood. First, let’s look at the general case. The following blog post makes a very good case on why such stacks (__stdcall and FPO involved) are not walkable in the general case even in the presence of full debugging information: http://www.virtualdub.org/blog/pivot/entry.php?id=148 To summarize, the ESP-to-return-address offset normally changes while executing a function which touches ESP (contains push, pop, call, ... instructions). And unfortunately, on the x86 platform, it is not possible to know this offset for each instruction because such information is not present in the debugging information. So the hope is basically that you won’t crash while preparing a function call (ESP is modified). It is indeed relatively rare for a function to crash while preparing a call (after a push instruction and before the call is executed). And once a call is executed a new frame is established so we don’t need to worry about it. We only have this issue when crashing before the call (and after at least one push). 00000000: 8B 01 mov eax,dword ptr [ecx] ← crashing here is OK (offset is known) 00000002: 6A 02 push 2 ← changes ESP 00000004: 6A 01 push 1 ← changes ESP 00000006: FF D0 call eax ← crashing here is problematic because ESP is changed 00000008: FF D0 call eax 0000000A: C3 ret ← offset is known again Now let’s focus on our issue. In our case, stack walking is affected for threads which are suspended or running in kernel mode - not for the crashing threads (although if an FPO optimized function crashes while preparing a function call we’ll most likely fail to walk the stack as well). The reason stack walking is affected is because ESP is changed due to a call instruction, but no new frame is established (at least not on the user mode stack) because this is a system call and for certain CPU models the system call dispatch mechanism doesn’t cvreate a new frame. Since our function on the top of the stack (ntdll!NtWaitForSingleObject) has no locals, ESP is expected to point exactly at the return address (should point to KERNELBASE!WaitForSingleObjectEx). However the call instruction has modified ESP (stored a new address address on the top of the stack) without actually establishing a new frame (the context record makes the fake impression that the function call has just returned) and ESP points to a return address which equals the current instruction pointer (EIP) stored in the user mode thread context: ntdll!NtWaitForSingleObject: 771ef8ac b801000000 mov eax,1 771ef8b1 b90d000000 mov ecx,0Dh 771ef8b6 8d542404 lea edx,[esp+4] 771ef8ba 64ff15c0000000 call dword ptr fs:[0C0h] ← ESP is decremented by 4 771ef8c1 83c404 add esp,4 ← EIP in the context points here This is the instruction that will get executed when the thread is resumed 771ef8c4 c20c00 ret 0Ch 0:000> r eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=000017b0 edi=002ff2d8 eip=771ef8c1 esp=002ff290 ebp=002ff2fc iopl=0 nv up ei pl zr na pe nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00200246 ntdll!NtWaitForSingleObject+0x15: 771ef8c1 83c404 add esp 0:000> dps @esp 002ff290 771ef8c1 ntdll!NtWaitForSingleObject+0x15 002ff294 75fa0a91 KERNELBASE!WaitForSingleObjectEx+0x98 ← return value for the top frame 002ff298 000017b0 002ff29c 00000000 002ff2a0 002ff2d8 ← second scan. Looking for the return value for frame 1 002ff2a4 88014a2e ← 002ff2a8 002ff364 ... 002ff2ac 000017b0 002ff2b0 00000000 002ff2b4 00000024 002ff2b8 00000001 002ff2bc 00000000 002ff2c0 00000000 002ff2c4 00000000 002ff2c8 00000000 002ff2cc 00000000 002ff2d0 00000000 002ff2d4 00000000 002ff2d8 9e3b9800 002ff2dc fffffff7 002ff2e0 00000000 002ff2e4 002ff2a4 002ff2e8 64a07ff1 chrome_647f0000!base::internal::Invoker1<0,base::internal::InvokerStorage1<void (__thiscall TransportSecurityPersister::Loader::*)(void),base::internal::UnretainedWrapper<TransportSecurityPersister::Loader> >,void (__thiscall TransportSecurityPersister::Loader::*)(void)>::DoInvoke [d:\b\build\slave\chrome-official\build\src\base\bind_internal.h @ 594] ← garbage 002ff2ec 002ff8dc 002ff2f0 75fc6590 KERNELBASE!_except_handler4 002ff2f4 fdd2c6ea 002ff2f8 00000000 002ff2fc 002ff314 ← EBP chain 002ff300 75741194 kernel32!WaitForSingleObjectExImplementation+0x75 ← next return value 002ff304 000017b0 002ff308 0036ee80 002ff30c 00000000 As a result, breakpad doesn’t see the real return value and it is confused that the return value is in the same function (a recursive call). Then, since it has to deal with the same function again (in frame 1) which we already mentioned it is compiled with FPO, breakpad fails to find the return value (high trust option) and fails back to scanning the stack for a suitable return value (low trust option) and detects random garbage. Once a non-FPO frame is detected (being it garbage or not) the EBP frame chain is followed, so the rest of the stack looks reasonable (with the exception of the second frame which is always lost). Here is an example: 0x771ef8c1 [ntdll.dll] + 0x0001f8c1] ZwWaitForSingleObject 0x771ef8c0 [ntdll.dll] + 0x0001f8c0] ZwWaitForSingleObject 0x64a07ff0 [chrome.dll] - ...] TransportSecurityPersister::SerializeData(std::basic_string<char,std::char_traits<char>,std::allocator<char> > *) 0x75741193 [kernel32.dll] + 0x00011193] WaitForSingleObjectExImplementation 0x75741147 [kernel32.dll] + 0x00011147] WaitForSingleObject 0x6491ecdf [chrome.dll] - ...] base::WaitableEvent::TimedWait(base::TimeDelta const &) [..snip..] Expected output if the following (from WinDBG): ChildEBP RetAddr 002ff290 75fa0a91 ntdll!NtWaitForSingleObject+0x15 002ff2fc 75741194 KERNELBASE!WaitForSingleObjectEx+0x98 002ff314 75741148 kernel32!WaitForSingleObjectExImplementation+0x75 002ff328 6491ece0 kernel32!WaitForSingleObject+0x12 002ff33c 64a2d039 chrome_647f0000!base::WaitableEvent::TimedWait+0x2d [..snip..] It appears that this is a very common pattern in our current crash reports. The pattern can be recognized as follows: - The top two frames look like a recursive call done by a system function (very weird indeed) - The real return address from the system function (second frame) is missing - The third frame is very often random crap (execution residue from the stack) - In most cases, the call stack below the fourth frame looks good. In some cases, we can’t recover and the stack is truncated Here is how we can fix it: 1. We need to recognize frames created by the system call dispatch mechanism. 2. We need to correct ESP and get a proper return address. I isolated a simple repro of the issue, created a new unit test for it, and validated that my fix works. Thanks, -Ivan Patch Set 1 #
Total comments: 2
Patch Set 2 : I'm breaking this change into three different changes #
Total comments: 15
Patch Set 3 : Incorporating changes from the code review #
Total comments: 4
Patch Set 4 : Getting rid of the changes to the serialization code #
Total comments: 12
Patch Set 5 : Incorporating changes from code review #
Total comments: 14
Patch Set 6 : Incorporating code review changes #
Total comments: 1
Patch Set 7 : Incorporating changes from code review. Encapsulating 'type_' inside WindowsFrameInfo #
Total comments: 1
Patch Set 8 : Incorporating code review changes - fixing indent #
MessagesTotal messages: 31
|