|
| 1 | +# Investigating GraalPy Performance |
| 2 | + |
| 3 | +First, make sure to build GraalPy with debug symbols. |
| 4 | +`export CFLAGS=-g` before doing a fresh `mx build` adds the debug symbols flags to all our C extension libraries. |
| 5 | +When you build a native image, use `find` to get the `.debug` file somewhere from the `mxbuild` directory tree, it's called something like `libpythonvm.so.debug`. |
| 6 | +Make sure to get that one and put it next to the `libpythonvm.so` in the Python standalone so that tools can pick it up. |
| 7 | + |
| 8 | +## Peak Performance |
| 9 | + |
| 10 | +[Truffle docs](https://www.graalvm.org/graalvm-as-a-platform/language-implementation-framework/Optimizing/) under graal/truffle/docs/Optimizing.md are a good starting point. |
| 11 | +They describe how to start with the profiler, especially useful is the [flamegraph](https://www.graalvm.org/graalvm-as-a-platform/language-implementation-framework/Profiling/#creating-a-flame-graph-from-cpu-sampler). |
| 12 | +This gives you a high-level idea of where time is spent. |
| 13 | +Note that currently (GR-58204) executions with native extensions may be less accurate. |
| 14 | + |
| 15 | +In GraalPy's case the flamegraph is also useful to compare performance to CPython. |
| 16 | +[Py-spy](https://pypi.org/project/py-spy/) is pretty good for that, since it generates a flamegraph that is sufficiently comparable. |
| 17 | +Note that `py-spy` is a sampling profiler that accesses CPython internals, so it often does not work on the latest CPython, use a bit older one. |
| 18 | + |
| 19 | +``` |
| 20 | +py-spy record -n -r 100 -o pyspy.svg -- foo.py |
| 21 | +``` |
| 22 | + |
| 23 | +Once you have identified something that takes way too long on GraalPy as compared to CPython, follow the Truffle guide. |
| 24 | + |
| 25 | +When you use [IGV](https://www.graalvm.org/tools/igv/), an interesting thing about debugging deoptimizations with IGV is that if you trace deopts as per the Truffle document linked above, search for "JVMCI: installed code name=". |
| 26 | +If the name ends with "#2" it's a second tier compilation. |
| 27 | +You might notice the presence of a `debugId` or `debug_id` in the output of these options. |
| 28 | +That id can be searched via `id=NUMBER`, `idx=NUMBER` or `debugId=NUMBER` in IGV's `Search in Nodes` search box, then selecting `Open Search for node NUMBER in Node Searches window`, and then clicking the `Search in following phases` button. |
| 29 | +Another useful thing to know is the `compile_id` matches the `compilationId` in IGVs "properties" view of the dumped graph. |
| 30 | + |
| 31 | +[Proftool](https://github.com/graalvm/mx/blob/master/README-proftool.md) can also be helpful. |
| 32 | +Note that this is not really prepared for language launchers, if it doesn't work, just get the commandline and build the arguments manually. |
| 33 | + |
| 34 | +## Interpreter Performance |
| 35 | + |
| 36 | +For interpreter performance async profiler is good and also allows for some visualizations. |
| 37 | +Backtrace view and flat views are good. |
| 38 | +It is only for JVM executions (not native images). |
| 39 | +Download async-profiler and make sure you also have debug symbols in your C extensions. |
| 40 | +Use these options: |
| 41 | + |
| 42 | +``` |
| 43 | +--vm.agentpath:/path/to/async-profiler/lib/libasyncProfiler.so=start,event=cpu,file=profile.html' --vm.XX:+UnlockDiagnosticVMOptions --vm.XX:+DebugNonSafepoints |
| 44 | +``` |
| 45 | + |
| 46 | +Another very useful tool is [gprofng](https://blogs.oracle.com/linux/post/gprofng-the-next-generation-gnu-profiling-tool), it is part of binutils these days. |
| 47 | +If you have debug symbols, it works quite well with JVM launchers since it understands Hotspot frames, but also works fine with native images. |
| 48 | +You might run into a bug with our language launchers: https://sourceware.org/bugzilla/show_bug.cgi?id=32110 The patch in that bugreport from me (Tim) -- while not entirely correct and not passing their testsuite -- lets you review recorded profiles (the bug only manifests when viewing a recorded profile). |
| 49 | +What's nice about gprofng is that it can attribute time spent to Java bytecodes, so you can even profile huge methods like bytecode loops that, for example, the DSL has generated. |
| 50 | + |
| 51 | +For SVM builds it is very useful to look at Truffle's [HostInlining](https://www.graalvm.org/graalvm-as-a-platform/language-implementation-framework/HostOptimization/) docs and check the debugging section there. |
| 52 | +This helps ensure that expected code is inlined (or not). |
| 53 | +When I identify something that takes long using gprofng, for example, I find it useful to check if that stuff is inlined as expected on SVM during the HostInliningPhase. |
| 54 | + |
| 55 | +Supposedly Intel VTune and Oracle Developer Studio work well, but I haven't tried them. |
| 56 | + |
| 57 | +## Memory Usage |
| 58 | + |
| 59 | +Memory usage is best tracked with VisualVM for the Java heap. |
| 60 | +For best performance we keep references to long-lived user objects (mostly functions, classes, and modules) directly in the AST nodes when using the default configuration of a single Python context (as is used when running the launcher). |
| 61 | +For better sharing of warm-up and where absolutely best peak performance is not needed, contexts can be configured with a shared engine and the ASTs will be shared across contexts. |
| 62 | +However, that implies we *must* not store any user objects strongly in the ASTs. |
| 63 | +We test that we have no PythonObjects alive after a Context is closed that are run as part of our JUnit tests. |
| 64 | +These can be run by themselves, for example, like so: |
| 65 | + |
| 66 | +```bash |
| 67 | +mx python-leak-test --lang python \ |
| 68 | + --shared-engine \ |
| 69 | + --code 'import site, json' \ |
| 70 | + --forbidden-class com.oracle.graal.python.builtins.objects.object.PythonObject \ |
| 71 | + --keep-dump |
| 72 | +``` |
| 73 | + |
| 74 | +The `--keep-dump` option will print the heapdump location and leave the file there rather than deleting it. |
| 75 | +It can then be opened for example with VisualVM to check for the paths of any leaked object, if there are any. |
| 76 | + |
| 77 | +For native code, use native memory profiling tools. |
| 78 | +I have used [`massif`](https://valgrind.org/docs/manual/ms-manual.html) in the past to find allocations and memory issues in native extensions, but be aware of the large overhead. |
| 79 | +However, once you do find something interesting using `massif`, [`rr`](https://rr-project.org/) is a good option to dive further into it, because then you can break around places massif found allocations, and use memory breakpoints and reverse and forward execution to find where the memory is allocated and released. |
| 80 | +This can be useful to identify memory leaks in our C API emulation. |
0 commit comments