A bug, how processes work, a history of FDs and C, forking gotchas, race conditions that aren't and general Insanity.

Author: thothonegan

Tags: debugging development wolfrez wolf osx

So I just had a fun bug to track down. Was going to have an early day and just finish up builds for a release, when I hit a bug doing my OSX build. I've had this bug for probably a month, but it was easy to work around so I've been ignoring it. So I decided to fix it, not realizing it'd take 5 hours and the rest of my sanity. This is the true story. No names were changed.

I have a tool called WolfRez, which is a resource compiler. Its kindof like a specialized make : you hand it a directory, and it uses a bunch of rules in the filesystem to run commands on the resources. A lot are just copying to where they should be, but it will also do things like compress YAML and Wexpr files, crush PNGs, encode WAV files as Opus (including adding song tags), compile lua files, etc.

So for example, this rule tells it to crush all .png files it finds, and the one below is compiling .lua files into .luac files:

- inputGlob: "**/Resources/Endless.tek/**.png"
  rule: pngCrush
  # no outputExt: dont change the filename

- inputGlob: "**.lua"
  rule: luaCompile
  notOSTargets: [ android, ios, emscripten ]
  outputRegexpCapture: ".*(?'capture'\\.lua)$"
  outputRegexpReplace: ".luac"

A few months ago, I was tired of its performance not being the best, so I had it multithread its jobs. With the 2500 files currently in Endless's resource directory, and Demon (my computer) having 24 cores, you can easily see it can get a massive speedup. And resources are embarassingly parallel - no real interdependencies at the moment or anything. So I added it, using WolfTask (which is a nice easy task system), and with a few mutexes/semaphores added, it worked perfectly!

Well it did on Linux and Windows at least.

OSX had this odd thing where occassionly it would hang. The log would say it was finished (or close to finished), but it would never finish. In debug mode or manual builds it'd be fine for some reason, but when I did my release builds (which is pretty much automated cause I'm lazy) it almost always would just hang. So of course my first thought is a race condition. Some task isn't reporting its completion properly, or two are stomping on each other somewhere.

And if I ran it again (which 95% of the resources were already compiled from the failed run), it'd be perfectly fine finishing the last few. Weird.

So first, lets pull out our trusty helgrind! Helgrind is a module for valgrind which lets it detect race conditions. Helgrind is really slow, but I have my laptop so lets run it on Linux while we're looking on OSX. Maybe it'll point out the race condition and it'll be easy!

Hint: Nope. Took about an hour, pointed out some races between console statements (that arent a problem in practice since they're in error handling) but no easy race problems under Linux. Damn. Must be platform specific.

So time for debugging! With my trusty console statements, I confirmed that the TaskGroup all the tasks were a part of was getting all of them added, yet it was never being completed. Welp, lets start throwing statements inside TaskGroup. Its old code, it probably fell down right?

An hour later : nope. Its old code, but thats cause its good code.

Instead it seems like the handler isn't ever finishing. Errr. what?

auto task = taskSource->createTask([iter,&outputs,&failedDuringBuilding](WolfTask::Task* /*task*/)
{
    auto output = iter.value;
    BuildProgress progress;
    progress.currentOutput = iter.index;
    progress.numberOfOutputs = outputs.count();
    
    // code made it here... but never left buildIfNeeded()
    if (!output->buildIfNeeded(progress))
    {
        console.error() << "Error when building" << WolfConsole::EndLine;
        failedDuringBuilding = true;
        return;
    }
});

Okkkkk so its a bug inside buildIfNeeded() that only happens on OSX RelWithDebInfo sometimes. And causes hangs. And never was an issue before. Weird, but ok - start tracing through that. At this point I had like 600 console statements outputting task pointers and stuff, and a lot of it interleaving (cause console is not thread safe on purpose). Finally narrowed it down to basically two different tasks that seem to not complete : luac tasks (compiles lua files into binary form), and pngcrush tasks (tries to make png files as small as possible). No pattern at all of which specific job though.

So lets do some tracing within the pngcrush rule!

WolfType::Bool PNGCrushRule::runWithAction (Action* action, const WolfYAML::Node& customProperties)
{
    // ... elided : set up params, etc ...
    
    // code makes it fine here
    auto res = RuleHelpers::runProgramWithArguments(
        pngCrushLocation.ref().filePath().zref(),
        args
    );
    
    // and this never returns.
    return res;
}

Ok, so its blaming runProgramWithArguments. Thats just a wrapper around WolfSystem::Process. And WolfSystem::Process is code thats used practically everywhere for a long time, so its gotta be good, right? Well thats where the bug says it is, so down another rabbit hole! Can't make any assumptions about whats good code or not.

So Process is a wrapper around running an external program. 99% of programs (including WolfRez) just uses the nice wrapper function around it which looks like this:

WolfType::Vector<WolfType::String> stdoutVec;
WolfType::Vector stderrVec;
WolfSystem::ExitStatusCode returnValue = WolfSystem::ExitStatusCodeSuccess;

// Run the given program, with the given arguments, and store its output
// in our vectors.
auto res = WolfSystem::Process::runTillFinished(
    programPath,
    args,
    &stdoutVec, &stderrVec, &returnValue
);

Pretty simple right? Even get the output back easily. Theirs a lot more control if you need it to run async or whatever, but this will just run till the program finishes, then hand back what it got.

Since I think its a multithreaded issue, lets add a mutex right here! Make it so only one thread at a time can run a process. Slow as hell, but it'd show if its a race issue right? Welp, added it, slowed everything by 20x, and while it didnt happen as often, the hang still happened. Sigh, time to go deeper.

Lets talk about how Process works! Unfortunately, most OS level things for running processes dont work the nice simple way the above API does. Most programmers will know about system() which let you run a program, but you can't get the output from it. And that's about it for C standards of running programs! Amazing isn't it. C++ has a technical group I think working on a better interface, but for now it's OS specific. So you need to go even lower to POSIX/WinAPI. So here's how processes work on UNIX (Windows is similar, but has some core differences : if you want me explain it, send me a msg).

Lets say you want to start a new process with the file descriptors (FDs) (such as stdout/stdin) redirected so your program can read them. Here is the list of things you need to do:

First, you need to create the new process. In POSIX, you use fork() to do this. Fork clones the current process, and tells you if you're the parent or the child. Congratulations on your new process!
Second, you need to redirect your FDs. You use something like dup2() to make copies. In Wolf's Process, we use pipes to talk between the parent[application] and child[what will become the program]. Importantly, dup2 needs file numbers, so you can use something like fileno() to get them.
Lastly, you need to run one of the exec functions (such as execve()). This replaces the current process with the new process image at the given location. If you know of Julie-Su (my IRC bot) that could restart herself, this is how she did it.

So in other terms: its like making a clone of yourself, making sure it knows how to talk, then wiping its memory with whatever you want it to do. Programming is weird.

So now you have a child process running the program you want, along with a pipe to your main process to communicate. In the case of runTillFinished, you then just sit on that pipe select()ing, recording any data from stdout/stderr, and waiting for the pipes to close. Once they're closed, you know the program finished and can return all that juicy information.

Back to the bug, I started adding more console statements within WolfSystem::Process. You can't put much debug things within the child process after fork, since you get rid of your FDs, but the parent process basically ends up just waiting on select() for the child process to call. Cue spending 30 minutes playing with arguments to select's timeout which seem to affect it hanging or not, but ultimately was a red herring. The problem is the child never sends messages to its parent, but it also never closed the pipes even though it finished. Err.. it did finish, right?

While I was in the middle of recompiling for the 30 billionth time, I decided to setup a small script running pstree on the process. pstree is just a simple version of ps which shows a process and its children as a nice tree. On Linux, theirs no distinction between processes and threads (somewhat known as light weight processes), so you can easily watch your program to see how its threading or spawning processes. So open another terminal, and lets do:

while true; do clear; pstree PID; sleep 1; done

Now on OSX, threads aren't processes so they dont show up. But as soon as you fork() (which creates a new process), it shows up in pstree. And even nicer, when you exec it shows the new commandline in pstree! Neat! In retrospect this makes sense, since fork()/exec() is how shells work too.

Within pstree though there were two interesting things that showed up which made me curious.

First, zombie processes show up like (pngcrush). A zombie process is a process which has exited, but hasn't been cleaned up by its parent yet. You'd mostly see them for a split second before they get reaped (yes, those are technical terms). However, 1 or 2 of the stuck tasks were showing up as zombies.
Second, I had quite a few child processes which were still named 'WolfRez'. This only should happen for the few microseconds between the fork() (creating the process) and execve() calls (replacing it with the new process). It definitely should not be hanging in that state, theirs literally 5 lines of code there and no loops or anything for it to get stuck in.

What the heck is going on? So lets pull out our debugger (and not in the normal mode). See, if you're a normal process debugging is easy. You start up your program, and can access all the threads and so on. But a debugger generally doesn't let you get to child processes (or more likely, I don't know how to do it in LLDB). Luckily though, like all good debuggers, we can attach to arbitrary processes! So lets try attaching to things.

Attaching to pngcrush doesn't work at all. Mostly cause its a zombie [yes that took me a while to figure out]. Attaching to the WolfRez child processes work fine, and you can see the entire call stack, even pre-fork! (since the fork() copies everything, it even has its old callstack). However the top is a bit odd - its waiting inside a mutex called from the dup2() lines.

if (dup2(m_stdoutPipes[1], fileno(stdout)) == InvalidPipe)
    std::cout << "[Process::execAndPipe] Unable to dup stdout" << std::endl;

Now the dup2() is used to duplicate file descriptors. If you want to get the output from the process (say STDOUT, also known as FD1), you need some way to feed it back to your parent program. So the easiest way is to create a pipe. The parent program has one end (call it FD4) and the child has the other (call it FD5). Output from one end moves to the other end, and bam: easy communication! But how do we trick the program to write its output on FD5 instead of FD1? Thats where dup2() comes in. dup2() copies a file descriptor into another FD, making both of them act the same. So you can duplicate FD5 into FD1 and now anything written to FD1 is the same as writing to FD5! Program writes to FD1, ends up in FD5, goes to FD4, and we're good. Goes over our pipe to the parent, allowing storing in the vector, yadda, yadda.

So lets talk a bit more about forking [I'm sure you're excited about Yet Another Tangent]. See, forking and multithreading dont get along. At all. Forking makes a copy of the process's data space, but ONLY the current thread's callstack. All other threads just dissapear in the new space. Its kindof like a camera that takes a normal picture, except it will only show one person (the active one). Everyone else was never there : if they dug a hole, the hole is still there but thats it. So whats bad about this? Consider this: you're a good programmer (cause of course you are!), and you make things thread safe. Therefore, you have mutexes, critical sections, semaphores, and other constructs to make sure multiple threads dont mess with stuff at the same time. So consider this: a thread enters a mutex, locks it, then BAM: another thread forks. Data is copied, but only the thread that forks is in the new process. What happens to that mutex? Well, its still locked! And the thread that locked it can't unlock it, cause it never existed [in the new process's view at least]. So if your new thread now tries to use anything that needs that mutex, welp, you're screwed : the lock is frozen shut. Meet Mr. Deadlock.

Sounds like fun, eh? So theirs some pretty big warnings about not calling things between fork() and execve() if you're multithreaded. You can't guarantee the state of anything at all, because so many things are trying to be thread-safe. Even simple things like malloc() or printf() might use a global lock somewhere, and could get locked. So maybe it's dup2()s fault! After all it doesn't say its async-signal safe (which is what the man pages says is the only guaranteed working functions). But the dup2() thing is used by a lot of better programmers than I, so either:

they have the same issue and never ran into it
its something else.

So after another half hour of messing with things, I finally looked at the LLDB trace a bit closer for the child process.

The mutex being locked is within fileno(). .....What?

fileno() should be a pretty brain dead function. It gets the FD number for a FILE* pointer. See, UNIX always has used integers for FDs, such as FD1 being stdout. However, when it came time to make C portable (especially to Windows), it needs to hide details like that. So C uses FILE* (the same thing functions like fopen() do) as its main file handle. So stdout for example, is a FILE* to remain portable. Luckily on POSIX, they have fileno() to convert from the C form to the actual int code, which other POSIX functions like dup2() want. So fileno(stdout) is an easy way to get the number 1 (since its always file descriptor 1).

Here's glibc's version of the function for an example:

// this code block is GPL from glibc-2.25.
// not that theirs any reason to copy it.

#define _IO_fileno(FP) ((FP)->_fileno)

int __fileno (_IO_FILE *fp)
{
    CHECK_FILE (fp, EOF);

    if (!(fp->_flags & _IO_IS_FILEBUF) || _IO_fileno (fp) < 0)
    {
        __set_errno (EBADF);
        return -1;
    }
    return _IO_fileno (fp);
}

Simple! It just accesses the struct backing 'fp', and grabs the fileno from it (with error handling and safety things around it).

Well, until you find out that on OSX, fileno() appears to take a mutex. I'm not sure if its a global mutex or one tied to the FD's backing data, but either way it locks a mutex. The same mutex that output uses. A mutex which quite possibly is locked because another thread before forking was outputting progress information or rules to the screen.

So the events to cause this was basically:

WolfRez runs, starts up multiple tasks [24 on this computer].
One thread outputs to the screen its normal messages. This takes an internal mutex by the OS/standard library.
Another thread starts running a task requiring it run an external program (say luac or pngcrush).
That causes WolfSystem::Process to run, causing a fork(), freezing the mutex closed.
Post fork, the process tries to duplicate stdout. To do that, it needs the number so it calls fileno().
fileno() tries to take the lock that was taken prior to the fork.
And since the fork happened, the lock is frozen and will never be unlocked. Deadlock.

Sigh...

So the fix? POSIX also defines the simple defines STDOUT_FILENO STDIN_FILENO STDERR_FILENO which has the explicit numbers (1 2 3) for the standard streams. Which is all that was needed in the first place. So its an easy change. Rerun everything, and its all perfectly happy now.

And that my friends is why you're not supposed to do anything complex at all between a fork() and its execve(). You get to spend 5 hours with Xcode, lldb, pstree, and man pages. You get to question your abilities, what the hell is happening, and try to remind yourself that computers are supposidly logical. And get to listen to all of She's albums over and over and over. On the plus side though, learned a TON about how lldb works, and way too much about how OSX works. Now to forget it all, because after all : programming is about solving problems, then having them nicely wrapped so you can forget it all.

Programming in a nutshell : 5 hours of work for 5 seconds of typing to fix it. And the fix was something you should have done in the first place.

Now for some 酒 and some massively needed sleep.