Skip to content

Reduce default parallelism of tox test runs #1352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

wshanks
Copy link
Collaborator

@wshanks wshanks commented Dec 21, 2023

Measurements of the total test time for different parallelism settings showed that total test time increased when QISKIT_PARALLEL was set to TRUE or OMP_NUM_THREADS was set higher than 1 when the stestr concurrency was higher than the number of physical cores. Additionally, no improvement was seen when increasing the stestr concurrency beyond the number of physical cores (for example up to two times that number on a system with 2x multithreading), but also no degradation in performance was observed, so the stestr concurrency was left at the default value (the number of logical CPUs rather than the number of physical CPUs) for simplicity.

Measurements of the total test time for different parallelism settings
showed that total test time increased when `QISKIT_PARALLEL` was set to
`TRUE` or `OMP_NUM_THREADS` was set higher than 1 when the `stestr`
concurrency was higher than the number of physical cores. Additionally,
no improvement was seen when increasing the `stestr` concurrency beyond
the number of physical cores (for example up to two times that number on
a system with 2x multithreading), but also no degradation in performance
was observed, so the `stestr` concurrency was left at the default value
(the number of logical CPUs rather than the number of physical CPUs) for
simplicity.
@wshanks
Copy link
Collaborator Author

wshanks commented Dec 22, 2023

Here are the results from some tests that I did locally. These tests were done on a laptop with an Intel i5-1145G7 processor that has 4 physical cores with 2x hyperthreading running Linux. The O label marks the setting for OMP_NUM_THREADS, the QP marks QISKIT_PARALLEL status, and # marks the concurrency level set for stestr. We see that concurrency cuts down the total test time as we go from 1->2->4 workers, but beyond 4 it levels off. I guess the hyperthreading is pretty good at switching between additional tasks so there is no performance hit but there is also no gain. Allowing parallelism either through OMP_NUM_THREADS(used by numpy) orQISKIT_PARALLELresults in a net loss of performance --stestr` is already using the compute resources efficiently and adding more parallelism just hurts performance.

image

The default on Linux is for QISKIT_PARALLEL to be true (it is false on Windows and Mac) and the default for OMP_NUM_THREADS is to be the number of processors on the system. I think we generally want these turned off so I made that the default in the tox configuration. It looks like this change has reduced the Linux test time from 6 minutes to 5 minutes. Also, the Windows time might have dropped from 7 minutes to 6 minutes but it has more variability so it is hard to say. Since we were already running with OMP_NUM_THREADS=1 on Mac (QISKIT_PARALLEL defaults to false there), we don't expect Mac to change. That is unfortunate since the Mac times are usually the slowest.

I didn't try to reduce the stestr concurrency to the physical core number. Maybe that would be worth doing. I am not sure what the CPUs in the GitHub VMs are actually like or if they actually do hyperthreading. It might be worth playing around with the number though, especially for Mac. While the Mac test times can be 7 minutes, sometimes they are 12 minutes, and I don't know if that variability is something we can influence or is just part of how the Mac systems are hosted/shared between jobs.

@mtreinish
Copy link
Contributor

The other environment variable to consider is RAYON_NUM_THREADS which controls the threads in the rayon threadpool that rustwork and qiskit's rust code uses internally. By default it will use all available CPUs, although for qiskit's usage by default threading and multiprocessing are mutually exclusive by default, you can use QISKIT_FORCE_THREADS to force rust multithreading with python multiprocessing. (These options are documented here https://docs.quantum.ibm.com/start/configure-qiskit-local#environment-variables )

@wshanks
Copy link
Collaborator Author

wshanks commented Dec 26, 2023

Thanks, @mtreinish. Actually in this PR, I did also set RAYON_NUM_THREADS=1 in tox.ini. I knew about it because it was already there in passenv. I just assumed that setting it to 1 would be best for similar reasons to setting OMP_NUM_THREADS, but I didn't actually test it (so it is not in the plot above). I haven't tried to test this, but I suspect that turning off QISKIT_PARALLEL speeds up the experiments tests mostly because it avoids the overhead of spawning a process pool and serializing and deserializing the circuits rather than because it avoids resource contention. I could be wrong about that though, but a lot of the circuits in the experiments tests are very simple -- just a few gates on a single qubit or sometimes two qubits -- so I don't think there is too much processing happening in the transpile step (so I don't think the Rayon threading should help much any way). There are a lot of tests that turn circuit results into a numpy array and run a curve fit on it (that's how most experiments work). I think limiting OMP_NUM_THREADS does help avoid resource contention in that part.

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for testing these parameters! I'm good with merging this and leaving tweaking parameters for Macs as a follow up.

@wshanks wshanks added this pull request to the merge queue Jan 3, 2024
Merged via the queue into qiskit-community:main with commit 441e549 Jan 3, 2024
nkanazawa1989 pushed a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 17, 2024
Measurements of the total test time for different parallelism settings
showed that total test time increased when `QISKIT_PARALLEL` was set to
`TRUE` or `OMP_NUM_THREADS` was set higher than 1 when the `stestr`
concurrency was higher than the number of physical cores. Additionally,
no improvement was seen when increasing the `stestr` concurrency beyond
the number of physical cores (for example up to two times that number on
a system with 2x multithreading), but also no degradation in performance
was observed, so the `stestr` concurrency was left at the default value
(the number of logical CPUs rather than the number of physical CPUs) for
simplicity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants