Scientia potentia est

Performance Tools Comparison Table

—

by

in Uncategorized

I am evaluating the performance tools for my own use. If you see a wrong entry, please let me know, I’ll be happy to correct it!

Test Program	Number of p-threads?	Number of forks?	Heap usage?	Stack usage?	Performance Effect	Extra notes	MPI Compatibility	Need recompiling?	Portable on Linux	Cache Misses	CPU Time
strace	Yes	Yes	Yes, through `brk`	Yes, through `brk`	Significant performance effect. iallreduce jumped from 79.99us to 238.795us		Yes	No	Yes	No	No
perf	Indirectly: by following the threads (similar to custom assembly)	Indirectly: by following the forks (similar to custom assembly)	Yes	Yes but need to compile with a specific flag. See Stack Traces		Link. Also, it can measure number of context switches. Check this link for all capabilities.	Yes	No	No, depends on the CPU registers.	Yes	Yes
valgrind	In theory yes, but couldn’t achieve in practice	Yes, through results (number of output headers)	Yes	It reports the results, but they were wrong. I tested 110 times, and it couldn’t measure the stack size. Also, the measuremets are based on snapshots. So, maybe it just skips the part where the stack is used.	Significant.	Official documentation = “However, the simulations are basic and unlikely to reflect the behaviour of a modern machine”	Yes	No	Yes	Yes	Yes
gdb	Yes with script.	Yes with script.	Yes	Yes			Yes	Yes	No	No	No
gperftools	No (couldn’t find)	Indirectly yes, it creates files for every fork.	Yes.	No (couldn’t find)	It runs a stop-the-world sampler. In other words, it periodically stops the program being profiled to collect information.	libtcmalloc raises error with large allocations	Yes	Yes	?	?	?
pmap	See extra notes	See extra notes	See extra notes	See extra notes	See extra notes	I thought this might be useful. However, the problem is that we cannot use this tool while running the program, but only externally. Therefore, this becomes unpractical.		?	?	?	?
kokkos-tools	No	No info in repo/issues/chatgpt	Didn’t measure malloc.	Didn’t measure local variables.	They say very low in the repository documentation.	I could only find an example which uses instrumented code. I don’t know if we can make this work with manual code instrumentation. Also, from my experiments, I could only measure the memory of the kokkos calls, not others. For example, heap memory is not included in the resutls.		?	?	?	?
Custom Assembly Parser	Yes	Yes	Yes	Yes	None	We can write an assembly parser to virtually execute all instructions and give you the performance bottleneck. However, the results will be different than the real world experiments due to complex nature of CPUs. (but we can use this if we have a CPU simulator? or maybe some machine learning? e.g. we generate lots of assembly, and learn a model that estimates the performance based on this assembly?)		?	?	?	?
ftrace					Sometimes up to 5x stated in this link		?	?	?	?

Comments

Leave a Reply Cancel reply