Friday, April 5, 2013

Waiting for a child process with a timeout on Linux

Recently at work we were developing a backend server for a Web app. The server process creates a child process for each request that arrives at it. The server then waits for the child process to terminate. But since we couldn't wait indefinitely we needed a wait() function with a timeout. Since Linux does not have such a function in its wait() family of system calls we created a wrapper around the existing system call waitpid() that takes an additional boolean parameter which is set to true or false depending on whether the wrapper function is returning because of a timeout or not.

It looks something like this:

pid_t waitpid_with_timeout(pid_t pid, int *status, int options, int timeout_period, boolean* timed_out);

The body of the function essentially does this:

1. Set an signal handler for SIGALRM which doesn't do anything (we just need to know that alarm went off) and mask all other signals.
2. Install the signal sigaction structure.
3. Set the alarm clock by calling the alarm() system call.
4. Call the Linux system call waitpid().
5. If waitpid() returned -1 and errno was set to EINTR this means our alarm went off and we set timed_out to true. Otherwise if waitpid() succeeded then we did not timeout and the child process terminated before the timeout period specified in the parameter timeout_period.

After waitpid_with_timeout() returned, we check the timed_out parameter. If timed_out is set to true we kill the child process explicitly:

kill(pid, 9);

Now, everything was all good and dandy with this implementation. Until during testing we found out that even though was called waitpid() in the function waitpid_with_timeout() we did not collect the exit status of the child in the case of a timeout (when we explicitly killed the child with kill()). This was the backend of a Web application, so uncollected children were piling up with each request from the browser and they were all becoming zombie processes!

We realized that the solution to this problem was simply another call to waitpid() when the child was explicitly killed with kill(). So when waitpid_with_timeout() returned timed_out == true we simply added another call to waitpid() after we call kill():

kill(pid, 9);
waitpid(pid, &status, 0);

This solved our zombie process problem!

There are some interesting discussion of this topic on Stack Overflow if you are interested: http://stackoverflow.com/questions/282176/waitpid-equivalent-with-timeout

1 comment: