UFTrace - A Linux Profiler
Tags: linux uftrace profiling c++
Found a new profiler for linux recently thats been pretty nice. Under Linux I generally use callgrind + kcachegrind which does a good job, but it takes a massive performance hit due to how valgrind works. The new tool, uftrace ( https://github.com/namhyung/uftrace ), instead uses hooks from the compiler.
First, you have to compile your code with either
-finstrument-functions. Both of them essentially inject calls into your code when it starts and ends functions, allowing other tools to get your program's flow. In my case, i'm using
-pg caused a lot of weird issues as soon as inlining or optimizations was turned on. (In Catalyst, this is now enabled by
--buildProfile DebugProfile or
To record a profile, you run:
uftrace record [program]
and run through the program as normal. If you didnt install uftrace into the system, use
LD_LIBRARY_PATH to point it to
libmcount-fast.so. Now for the neat part, viewing the information.
UFTrace has several ways to display information about a trace. For an example, I'm using
WolfExampleConsoleExample which is a simple program that tests outputing things using
The first method is
uftrace report which shows what functions spent the longest time including children (total time). Its output is a very nice table.
Total time Self time Calls Function ========== ========== ========== ==================================== 21.483 ms 0.510 us 1 main 21.482 ms 0.706 us 1 WolfSystem::wolfMain 21.221 ms 1.615 us 1 WolfSystem::wolfMainInit 19.971 ms 6.833 us 1 WolfConsoleExample::Application::init 15.369 ms 151.195 us 308 WolfType::UTF8String::operator += 15.042 ms 76.614 us 313 WolfType::UTF8String::reserveIfNeeded ...
In the case, we can ignore main since they're container functions, but can see that adding UTF8Strings is costing quite a bit of real time.
The second method is
uftrace graph [function], which will do two things. First, it shows a backtrace of where the function was called. For example if we do
uftrace graph 'WolfType::UTF8String::operator +=':
backtrace #0: hit 4, time 183.368 us  main (0x440690)  WolfSystem::wolfMain (0x4577d0)  WolfSystem::wolfMainInit (0x457970)  WolfConsoleExample::Application::init (0x440d70)  WolfConsole::operator << (0x442210)  WolfConsole::Tree::toString (0x44c280)  _GLOBAL__N_1::s_simpleNodeAppendToString (0x44c370)  _GLOBAL__N_1::s_simpleNodeAppendToString (0x44c370)  _GLOBAL__N_1::s_simpleNodeAppendToString (0x44c370)  WolfType::UTF8String::operator += (0x443470)
The second thing it does is it shows a graph of the function itself, which can help you narrow down what the cost of things within the function are costing.
calling functions ===================================== 15.369 ms : (308) WolfType::UTF8String::operator += 227.879 us : +-(1232) WolfType::StringInterface::byteSize 49.017 us : | (1232) WolfType::UTF8StringRef::r_constVoidDataByteSize : | 14.932 ms : +-(308) WolfType::UTF8String::reserveIfNeeded 8.040 ms : | +-(308) WolfMemory::Allocator::allocate 8.003 ms : | | (308) WolfMemory::HeapAllocator::v_allocateMemory ...
In this case, most of the cost is coming from
reserveIfNeeded which is essentially a wrapper for
Now lastly, the coolest method (and the reason i was looking at uftrace in the first place). All of these are great, but they're still very commandline - sometimes you need a nice way to browse it. If you do
utrace dump it gives you all the raw data so you can manipulate it. If you do
uftrace dump --chrome > fire.trace and load it into google chrome's
chrome://tracing however, you get:
A nice little chart you can easily navigate around! It also has support for a few other formats, and lots of flexibility with the commandline tools to get information.
So the few little minor issues. The first is having to compile your code in a different mode. Both valgrind and sampling based profilers dont have this issue. The second is the cost of a function call in every single function. In my chart, you can see a lot of tiny functions at the bottom as a long tail for function calls. A lot of them are things like
r_pointer() and other things that are 0 cost inline functions. A lot of my engine involves 0 cost safety wrappers around concepts (such as
WolfMemory::Pointer which provides null call safety in debug mode, yet costs nothing in release mode). With the forced function call approach that
-finstrument-functions uses, all of these functions then explode in cost due to the callback. I'm sure the callback is optimized to be as fast as possible since its called for every function call, but when the function is designed to literally cost 0, anything is a huge cost. While the slowdown doesnt seem to be as big as valgrind, it affects the profile vs a sampler approach. Of course, any function that does any real work will easily be greater than the profile cost.
Neither of these are big issues, but its just something to watch out for if using uftrace. Its been pretty useful so far, and is a nice in-between from callgrind and Instruments.