Skip to content

Segmentation fault with threaded Distributed #572

Open
@jarbus

Description

@jarbus

Affects: PythonCall

Describe the bug

This is a very quirky bug. I'm getting a segmentation fault when using python's gymnasium package with multiple processes while a Flux model is loaded on the GPU.

Setup:

]add CondaPkg
]add PythonCall
]add Flux
]add CUDA

using CondaPkg
CondaPkg.add("gymnasium")
CondaPkg.add("swig")
CondaPkg.add("gymnasium-box2d")
CondaPkg.add("gymnasium-other")

Run (crash is non-deterministic, try running a few times on a machine with an NVIDIA GPU):

using Distributed
addprocs(12; env=["CUDA_HARD_MEMORY_LIMIT" => "5%", "CUDA_MEMORY_POOL"=>"none"])
@everywhere begin
    using CUDA
    using Flux
    using CondaPkg
    using PythonCall

    function initialize_car_racing_env(_)
        gym = pyimport("gymnasium")
        x = Flux.Dense(512=>512) |> gpu
        env = gym.make("CarRacing-v3")
        obs, info = env.reset()
        env.close()
        return 1
    end
end

for generation in 1:10_000
    if generation % 100 == 0
        println("Generation: $generation")
    end
    pmap(initialize_car_racing_env, 1:12)
end

Stack trace:

      From worker 5:
      From worker 5:    [35654] signal 11: Segmentation fault
      From worker 5:    in expression starting at none:0
      From worker 5:    jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:334 [inlined]
      From worker 5:    jl_gc_state_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:329 [inlined]
      From worker 5:    jl_gc_state_save_and_set at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia_threads.h:340
      From worker 5:    throw_internal_altstack at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:755 [inlined]
      From worker 5:    ijl_sig_throw at /cache/build/builder-demeter6-6/julialang/julia-master/src/task.c:800
      From worker 5:    Allocations: 21901595 (Pool: 21900914; Big: 681); GC: 219
ERROR: Worker 5 terminated.LoadError: 
ProcessExitedException(Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#832")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:970
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:978
 [3] unsafe_read
   @ ./io.jl:891 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:890
 [5] read!
   @ ./io.jl:895 [inlined]
 [6] deserialize_hdr_raw
   @ ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/share/julia/stdlib/v1.11/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed ~/.julia/juliaup/julia-1.11.1+0.x64.linux.gnu/sh

Your system
Please provide detailed information about your system:

  • The operating system

5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

  • The version of Julia, Python, PythonCall, JuliaCall and any other affected packages
[052768ef] CUDA v5.5.2
[992eb4ea] CondaPkg v0.2.24
[587475ba] Flux v0.14.25
[6099a3de] PythonCall v0.9.23 `https://github.com/JuliaPy/PythonCall.jl.git#main`
[02a925ec] cuDNN v1.4.0
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 5975WX 32-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Environment:
  LD_LIBRARY_PATH = 
CondaPkg Status /home/garbus/.julia/environments/v1.11/CondaPkg.toml
Environment
  /home/garbus/.julia/environments/v1.11/.CondaPkg/env
Packages
  gymnasium v1.0.0
  gymnasium-box2d v1.0.0
  gymnasium-other v1.0.0
  swig v4.2.1

Additional context
I'm researching embodied AI and trying to use Julia's distributed capabilities to do so while still evaluating on python environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions