Frequent Segmentation Fault in Distributed setting #610
Unanswered
RefatIsmail96
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Affects: PythonCall
Describe the bug
Hi, I am running a parallel computation on a Cluster that uses Slurm. I use SlurmClusterManager package to initialize Julia processes. Each process uses some python library (mainly Stim and mwpf). I frequently get segmentation fault with very long error messages. I cannot replicate the issue locally using Distributed package (only happens when I run parallel computation on cluster).
Here's part of the error message:
"srun: error: nid004513: task 236: Segmentation fault
srun: Terminating StepId=38285639.0
slurmstepd: error: *** STEP 38285639.0 ON nid004513 CANCELLED AT 2025-05-02T17:03:53 ***
srun: error: nid004513: tasks 10,26,46,78,142,184,199,225,244: Terminated
srun: error: nid004513: tasks 32,213: Terminated
srun: error: nid004513: tasks 9,11,14-15,25,31,40,143: Terminated
[954040] signal 11 (1): Segmentation fault
in expression starting at none:1
pymalloc_alloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:1544 [inlined]
_PyObject_Malloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:1564 [inlined]
PyObject_Malloc at /usr/local/src/conda/python-3.12.10/Objects/obmalloc.c:801 [inlined]
PyLong_FromMedium at /usr/local/src/conda/python-3.12.10/Objects/longobject.c:210 [inlined]
PyLong_FromLong at /usr/local/src/conda/python-3.12.10/Objects/longobject.c:306
PyLong_FromLongLong at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/C/pointers.jl:303 [inlined]
pyint at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:719
Py at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:144 [inlined]
pytuple_setitem at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:897
unknown function (ip: 0x7f9cfef57dd3)
unknown function (ip: 0x7f9cfef57999)
unknown function (ip: 0x7f9cfef578d4)
macro expansion at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:0 [inlined]
pytuple_fromiter at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:923 [inlined]
#pycall#21 at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:242 [inlined]
pycall at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/builtins.jl:233 [inlined]
##11 at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:357 [inlined]
Py at /global/homes/r/rismail/.julia/packages/PythonCall/WMWY0/src/Core/Py.jl:357 [inlined]
#98 at ./none:0
unknown function (ip: 0x7f9cfef577b2)
iterate at ./generator.jl:48 [inlined]
collect_to! at ./array.jl:849
collect_to_with_first! at ./array.jl:827 [inlined]
collect at ./array.jl:801
compile at /global/homes/r/rismail/.julia/dev/QuantumErrorCorrection/lib/QECDecoders/src/decoders/mwpf_decoder.jl:35
unknown function (ip: 0x7f9cfef571ba)
compile_decoders_on_all_workers at /global/u2/r/rismail/EarlyFT/src/decode/worker_fns.jl:344
unknown function (ip: 0x7f9cfef531d2)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:666
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/interpreter.c:824
jl_toplevel_eval_flex at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/toplevel.c:994
eval at ./boot.jl:430
jfptr_eval_28294.1 at /global/common/software/nersc9/julia/1.11.4/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:875
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
#invokelatest#2 at ./essentials.jl:1055
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
invokelatest at ./essentials.jl:1052
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
do_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/builtins.c:831
#114 at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:303
run_work_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:70
unknown function (ip: 0x7f9cfef3badb)
run_work_thunk at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:79
#100 at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/usr/share/julia/stdlib/v1.11/Distributed/src/process_messages.jl:88
unknown function (ip: 0x7f9cfef3b61f)
jl_apply at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/julia.h:2157 [inlined]
start_task at /cache/build/builder-amdci5-5/julialang/julia-release-1-dot-11/src/task.c:1202
Allocations: 26958566 (Pool: 26957709; Big: 857); GC: 20"
I looked over all other issues for segmentation fault, and I could not find similar issues. I would appreciate any help in narrowing down the potential sources of error here. It is hard to create a MWE (yet) since the error is non-deterministic.
Your system
Please provide detailed information about your system:
-CondaPkg: 0.2.24
Beta Was this translation helpful? Give feedback.
All reactions