I am evaluating the performance tools for my own use. If you see a wrong entry, please let me know, I’ll be happy to correct it!
Test Program | Number of p-threads? | Number of forks? | Heap usage? | Stack usage? | Performance Effect | Extra notes | MPI Compatibility | Need recompiling? | Portable on Linux | Cache Misses | CPU Time |
---|---|---|---|---|---|---|---|---|---|---|---|
strace | Yes | Yes | Yes, through `brk` | Yes, through `brk` | Significant performance effect. iallreduce jumped from 79.99us to 238.795us | Yes | No | Yes | No | No | |
perf | Indirectly: by following the threads (similar to custom assembly) | Indirectly: by following the forks (similar to custom assembly) | Yes | Yes but need to compile with a specific flag. See Stack Traces | Link. Also, it can measure number of context switches. Check this link for all capabilities. | Yes | No | No, depends on the CPU registers. | Yes | Yes | |
valgrind | In theory yes, but couldn’t achieve in practice | Yes, through results (number of output headers) | Yes | It reports the results, but they were wrong. I tested 110 times, and it couldn’t measure the stack size. Also, the measuremets are based on snapshots. So, maybe it just skips the part where the stack is used. | Significant. | Official documentation = “However, the simulations are basic and unlikely to reflect the behaviour of a modern machine” | Yes | No | Yes | Yes | Yes |
gdb | Yes with script. | Yes with script. | Yes | Yes | Yes | Yes | No | No | No | ||
gperftools | No (couldn’t find) | Indirectly yes, it creates files for every fork. | Yes. | No (couldn’t find) | It runs a stop-the-world sampler. In other words, it periodically stops the program being profiled to collect information. | libtcmalloc raises error with large allocations | Yes | Yes | ? | ? | ? |
pmap | See extra notes | See extra notes | See extra notes | See extra notes | See extra notes | I thought this might be useful. However, the problem is that we cannot use this tool while running the program, but only externally. Therefore, this becomes unpractical. | ? | ? | ? | ? | |
kokkos-tools | No | No info in repo/issues/chatgpt | Didn’t measure malloc. | Didn’t measure local variables. | They say very low in the repository documentation. | I could only find an example which uses instrumented code. I don’t know if we can make this work with manual code instrumentation. Also, from my experiments, I could only measure the memory of the kokkos calls, not others. For example, heap memory is not included in the resutls. | ? | ? | ? | ? | |
Custom Assembly Parser | Yes | Yes | Yes | Yes | None | We can write an assembly parser to virtually execute all instructions and give you the performance bottleneck. However, the results will be different than the real world experiments due to complex nature of CPUs. (but we can use this if we have a CPU simulator? or maybe some machine learning? e.g. we generate lots of assembly, and learn a model that estimates the performance based on this assembly?) | ? | ? | ? | ? | |
ftrace | Sometimes up to 5x stated in this link | ? | ? | ? | ? |
Leave a Reply