I'm addressing a couple of issues:
1. There is a race condition in the handling of dump requests and the handling
of client process termination. It appears to have been introduced by the
following change: https://breakpad.appspot.com/365001 around the end of March
2012. The race condition is the following:
- A crash client requests an out of process dump to be generated by the crash
server. The HandleDumpRequest callback in the crash server is executed.
- At the same time (or shortly after that), the crash client's process
terminates (user kills the crashed/hung process maybe). The OnClientEnd
callback is then executed in the crash server.
- The OnClientEnd callback unregisters the dump request wait operation, but
it doesn't wait for any dump requests that are being currently processed to
finish and proceeds with the destruction of the ClientInfo. What is interesting
to notice is that the ClientInfo destructor actually attempts to wait for any
callbacks that might be running to finish, but this attempt is unsuccessful
because the wait handles were already set to NULL by UnregisterWaits (the recent
change mentioned above).
- HandleDumpRequest attempts to read the ClientInfo and crashes (because it
was already deleted by the OnClientEnd callback)
This can easily be avoided by having the OnClientEnd callback wait for any
running dump requests to complete before proceeding with the ClientInfo
destruction.
2. The critical section that protects the access to the clients_ list is used
for too many things (not just protection of the clients_ list), which leads to
excessive locking and complicates further the design of the crash generation
server. Moreover it is an obstacle for the simplification and proper fix of
some of the issues described bellow
3. The client process exit handler (OnClientExit) implementation is overly
complicated and it is currently broken. The complicated part is that the actual
removal of the ClientInfo from the list and its destruction are done on another
thread. I'm assuming that this was done in order to avoid entering the critical
section from the callback and so avoid certain class of deadlocks (trying to
unregister waits and wait for all callbacks that might still be running to
finish while inside the critical section). Now in order to keep track of these
cleanup tasks that are spawned, an internal counter was apparently introduced
and the shutdown thread waits for this counter to go down to 0 before deleting
the clients_ list. And this is the broken part. Two issues. First, the
shutdown thread is only allowed to wait for 10 seconds (I'm presuming this is
some kind of protection against deadlocks). If the wait times out (after 10
seconds) the shutdown thread proceeds with the cleanup regardless of any pending
operations - this will most likely cause a crash. Second, and the bigger
problem actually, is that the counter value can be wrong. It can be wrong
because the shutdown thread simply unregisters the waits and then proceeds to
check the counter value without waiting for any running callbacks to finish.
This is flawed because you may still have callbacks that are running and are
about to update the counter and unless you wait for all of them to finish,
counter value of 0 doesn't mean anything. This is all gone now. We don't
spawns new threads from the callback anymore and the infamous counter is cleaned
up.
4. Another possible deadlock that might have forced the team to move part of the
cleanup code to a different thread has nothing to do with the critical section.
The problem is that the ClientInfo destructor itself unregisters all wait
operations and as part of that it waits for all running callbacks to finish -
now if you call the desctructor from such a callback the thread will deadlock on
itself. This can easily be avoided by having the OnClientExit callback
unregister itself (without waiting) prior to calling the destructor. Since, at
most one OnClientExit callback for a given ClientInfo can be running at a time,
if you are the one running then you are guaranteed that there are no others, so
no need to wait. In fact if you wait for yourself to finish you'll deadlock.
To avoid this deadlock, I can see that an UnregisterWaits call (non-blocking
unregistration) was added to the OnClientExit callback - well maybe it was added
for other reasons - it is not clear from the description
(https://breakpad.appspot.com/365001), but this opened the race condition that I
already described (in the very first bullet) which is the source of recent
CrashGenerationServer crashes.
5. I see improper closing of remote handles (object handles owned by other
processes). CloseHandle cannot do this. DuplicateHandle must be used for this.
Closing a handle that belongs to a different process (the handle ID is invalid
in the context of the current process) using CloseHandle can lead to
unpredictable results.
6. Beyond that, I saw a compilation issue in the crash generation server
unittests (specific for for Windows only). It has to do with incompatibility
between gtest and the Microsoft compiler. I was able to work around this issue,
however we may need to upgrade to the latest gtest version. I'll open a bug for
it and hopefully someone with more knowledge around gtest upgrades will follow
up.
Thanks,
-Ivan
This seems reasonable. http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/crash_generation_server.cc File client/windows/crash_generation/crash_generation_server.cc (right): http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/crash_generation_server.cc#newcode721 client/windows/crash_generation/crash_generation_server.cc:721: // If the server is shutting ...
I think you must have added me because of some changes I made here quite a while
ago. I don't really have any recollection of this codebase, so unless you have
some specific questions about my CLs I'll remove myself from the thread.
On 2012/08/13 13:39:50, Erik Wright wrote:
> I think you must have added me because of some changes I made here quite a
while
> ago. I don't really have any recollection of this codebase, so unless you have
> some specific questions about my CLs I'll remove myself from the thread.
I browsed the file history and compiled a list of possible file owners which I
added to the code review. Feel free to remove yourself if you are not familiar
with this part of the code.
Thanks,
-Ivan
The long comment is now fixed. http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/crash_generation_server.cc File client/windows/crash_generation/crash_generation_server.cc (right): http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/crash_generation_server.cc#newcode721 client/windows/crash_generation/crash_generation_server.cc:721: // If the ...
LGTM http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/client_info.cc File client/windows/crash_generation/client_info.cc (right): http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/client_info.cc#newcode88 client/windows/crash_generation/client_info.cc:88: void ClientInfo::UnregisterDumpRequestWaitAndBlockUntilNoPending() { Can we come up with ...
Publishing draft comments ...
http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/cli...
File client/windows/crash_generation/client_info.cc (right):
http://breakpad.appspot.com/430002/diff/1/client/windows/crash_generation/cli...
client/windows/crash_generation/client_info.cc:88: void
ClientInfo::UnregisterDumpRequestWaitAndBlockUntilNoPending() {
On 2012/08/13 19:17:03, doshimun wrote:
> Can we come up with some shorter name?
I changed this function name a couple of times before settling on the current
name. The shortest I was able to come up with was
UnregisterDumpRequestWaitBlocking, but the current name makes better sense, I
think.
Thanks,
-Ivan
Issue 430002: Fixing a race condition in CrashGenerationServer among other issues
(Closed)
Created 12 years, 3 months ago by Ivan Penkov
Modified 12 years, 3 months ago
Reviewers: Mark Mentovai, Cris Neckar, Ted Mielczarek, doshimun
Base URL: http://google-breakpad.googlecode.com/svn/trunk/src/
Comments: 4