Skip to content

Error in torch_compile.ipynb #1988

Open
@KumoLiu

Description

@KumoLiu
[2025-05-06T16:47:27.159Z] Running ./modules/torch_compile.ipynb
[2025-05-06T16:47:27.159Z] Checking PEP8 compliance...
[2025-05-06T16:47:27.719Z] Running notebook...
[2025-05-06T16:47:34.247Z] Unable to import quantization op. Please install modelopt library (https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=readme-ov-file#installation) to add support for compiling quantized models
[2025-05-06T16:47:37.507Z] MONAI version: 1.4.1rc1+46.gb58e883c
[2025-05-06T16:47:37.507Z] Numpy version: 1.26.4
[2025-05-06T16:47:37.507Z] Pytorch version: 2.7.0+cu126
[2025-05-06T16:47:37.507Z] MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
[2025-05-06T16:47:37.507Z] MONAI rev id: b58e883c887e0f99d382807550654c44d94f47bd
[2025-05-06T16:47:37.507Z] MONAI __file__: /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/__init__.py
[2025-05-06T16:47:37.507Z] 
[2025-05-06T16:47:37.507Z] Optional dependencies:
[2025-05-06T16:47:38.067Z] Pytorch Ignite version: 0.4.11
[2025-05-06T16:47:38.067Z] ITK version: 5.4.3
[2025-05-06T16:47:38.067Z] Nibabel version: 5.3.2
[2025-05-06T16:47:38.067Z] scikit-image version: 0.19.3
[2025-05-06T16:47:38.067Z] scipy version: 1.14.0
[2025-05-06T16:47:38.067Z] Pillow version: 7.0.0
[2025-05-06T16:47:38.067Z] Tensorboard version: 2.16.2
[2025-05-06T16:47:38.067Z] gdown version: 5.2.0
[2025-05-06T16:47:38.067Z] TorchVision version: 0.22.0+cu126
[2025-05-06T16:47:38.067Z] tqdm version: 4.66.5
[2025-05-06T16:47:38.067Z] lmdb version: 1.6.2
[2025-05-06T16:47:38.067Z] psutil version: 6.0.0
[2025-05-06T16:47:38.067Z] pandas version: 2.2.2
[2025-05-06T16:47:38.067Z] einops version: 0.8.0
[2025-05-06T16:47:38.067Z] transformers version: 4.40.2
[2025-05-06T16:47:38.067Z] mlflow version: 2.22.0
[2025-05-06T16:47:38.067Z] pynrrd version: 1.1.3
[2025-05-06T16:47:38.067Z] clearml version: 2.0.0rc0
[2025-05-06T16:47:38.067Z] 
[2025-05-06T16:47:38.067Z] For details about installing the optional dependencies, please visit:
[2025-05-06T16:47:38.067Z]     https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
[2025-05-06T16:47:38.067Z] 
[2025-05-06T16:47:41.332Z] papermill  --progress-bar --log-output -k python3
[2025-05-06T16:47:41.332Z] /usr/local/lib/python3.10/dist-packages/papermill/iorw.py:149: UserWarning: the file is not specified with any extension : -
[2025-05-06T16:47:41.332Z]   warnings.warn(f"the file is not specified with any extension : {os.path.basename(path)}")
[2025-05-06T16:55:02.641Z] 
Executing:   0%|          | 0/32 [00:00<?, ?cell/s]
Executing:   3%|▎         | 1/32 [00:00<00:29,  1.04cell/s]
Executing:  12%|█▎        | 4/32 [00:12<01:33,  3.33s/cell]
Executing:  19%|█▉        | 6/32 [00:22<01:44,  4.03s/cell]
Executing:  31%|███▏      | 10/32 [00:26<00:52,  2.38s/cell]
Executing:  50%|█████     | 16/32 [00:28<00:19,  1.23s/cell]
Executing:  56%|█████▋    | 18/32 [00:35<00:23,  1.71s/cell]
Executing:  62%|██████▎   | 20/32 [07:07<09:10, 45.85s/cell]
Executing:  69%|██████▉   | 22/32 [07:08<05:46, 34.69s/cell]
Executing:  72%|███████▏  | 23/32 [07:18<04:38, 30.98s/cell]
Executing:  72%|███████▏  | 23/32 [07:20<02:52, 19.17s/cell]
[2025-05-06T16:55:02.641Z] /usr/local/lib/python3.10/dist-packages/papermill/iorw.py:149: UserWarning: the file is not specified with any extension : -
[2025-05-06T16:55:02.641Z]   warnings.warn(f"the file is not specified with any extension : {os.path.basename(path)}")
[2025-05-06T16:55:02.641Z] Traceback (most recent call last):
[2025-05-06T16:55:02.641Z]   File "/usr/local/bin/papermill", line 8, in <module>
[2025-05-06T16:55:02.641Z]     sys.exit(papermill())
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
[2025-05-06T16:55:02.641Z]     return self.main(*args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
[2025-05-06T16:55:02.641Z]     rv = self.invoke(ctx)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
[2025-05-06T16:55:02.641Z]     return ctx.invoke(self.callback, **ctx.params)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
[2025-05-06T16:55:02.641Z]     return __callback(*args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func
[2025-05-06T16:55:02.641Z]     return f(get_current_context(), *args, **kwargs)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/cli.py", line 235, in papermill
[2025-05-06T16:55:02.641Z]     execute_notebook(
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/execute.py", line 131, in execute_notebook
[2025-05-06T16:55:02.641Z]     raise_for_execution_errors(nb, output_path)
[2025-05-06T16:55:02.641Z]   File "/usr/local/lib/python3.10/dist-packages/papermill/execute.py", line 251, in raise_for_execution_errors
[2025-05-06T16:55:02.641Z]     raise error
[2025-05-06T16:55:02.641Z] papermill.exceptions.PapermillExecutionError: 
[2025-05-06T16:55:02.641Z] ---------------------------------------------------------------------------
[2025-05-06T16:55:02.641Z] Exception encountered at "In [11]":
[2025-05-06T16:55:02.641Z] ---------------------------------------------------------------------------
[2025-05-06T16:55:02.641Z] InductorError                             Traceback (most recent call last)
[2025-05-06T16:55:02.641Z] Cell In[11], line 14
[2025-05-06T16:55:02.641Z]      12 inputs, labels = batch_data["image"].to(device), batch_data["label"].to(device)
[2025-05-06T16:55:02.641Z]      13 optimizer.zero_grad()
[2025-05-06T16:55:02.641Z] ---> 14 loss, train_time = timed(lambda: train(model_opt, inputs, labels))  # noqa: B023
[2025-05-06T16:55:02.641Z]      15 optimizer.step()
[2025-05-06T16:55:02.641Z]      16 epoch_loss += loss.item()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[6], line 5, in timed(fn)
[2025-05-06T16:55:02.641Z]       3 end = torch.cuda.Event(enable_timing=True)
[2025-05-06T16:55:02.641Z]       4 start.record()
[2025-05-06T16:55:02.641Z] ----> 5 result = fn()
[2025-05-06T16:55:02.641Z]       6 end.record()
[2025-05-06T16:55:02.641Z]       7 torch.cuda.synchronize()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[11], line 14, in <lambda>()
[2025-05-06T16:55:02.641Z]      12 inputs, labels = batch_data["image"].to(device), batch_data["label"].to(device)
[2025-05-06T16:55:02.641Z]      13 optimizer.zero_grad()
[2025-05-06T16:55:02.641Z] ---> 14 loss, train_time = timed(lambda: train(model_opt, inputs, labels))  # noqa: B023
[2025-05-06T16:55:02.641Z]      15 optimizer.step()
[2025-05-06T16:55:02.641Z]      16 epoch_loss += loss.item()
[2025-05-06T16:55:02.641Z] 
[2025-05-06T16:55:02.641Z] Cell In[6], line 12, in train(model, inputs, labels)
[2025-05-06T16:55:02.641Z]      11 def train(model, inputs, labels):
[2025-05-06T16:55:02.641Z] ---> 12     outputs = model(inputs)
[2025-05-06T16:55:02.642Z]      13     loss_function = monai.losses.DiceCELoss(to_onehot_y=True, softmax=True)
[2025-05-06T16:55:02.642Z]      14     loss = loss_function(outputs, labels)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
[2025-05-06T16:55:02.642Z]    1749     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
[2025-05-06T16:55:02.642Z]    1750 else:
[2025-05-06T16:55:02.642Z] -> 1751     return self._call_impl(*args, **kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1762, in Module._call_impl(self, *args, **kwargs)
[2025-05-06T16:55:02.642Z]    1757 # If we don't have any hooks, we want to skip the rest of the logic in
[2025-05-06T16:55:02.642Z]    1758 # this function, and just call forward.
[2025-05-06T16:55:02.642Z]    1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[2025-05-06T16:55:02.642Z]    1760         or _global_backward_pre_hooks or _global_backward_hooks
[2025-05-06T16:55:02.642Z]    1761         or _global_forward_hooks or _global_forward_pre_hooks):
[2025-05-06T16:55:02.642Z] -> 1762     return forward_call(*args, **kwargs)
[2025-05-06T16:55:02.642Z]    1764 result = None
[2025-05-06T16:55:02.642Z]    1765 called_always_called_hooks = set()
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:663, in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
[2025-05-06T16:55:02.642Z]     659     raise e.with_traceback(None) from None
[2025-05-06T16:55:02.642Z]     660 except ShortenTraceback as e:
[2025-05-06T16:55:02.642Z]     661     # Failures in the backend likely don't have useful
[2025-05-06T16:55:02.642Z]     662     # data in the TorchDynamo frames, so we strip them out.
[2025-05-06T16:55:02.642Z] --> 663     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
[2025-05-06T16:55:02.642Z]     664 finally:
[2025-05-06T16:55:02.642Z]     665     # Restore the dynamic layer stack depth if necessary.
[2025-05-06T16:55:02.642Z]     666     set_eval_frame(None)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:760, in _compile_fx_inner(gm, example_inputs, **graph_kwargs)
[2025-05-06T16:55:02.642Z]     758     raise
[2025-05-06T16:55:02.642Z]     759 except Exception as e:
[2025-05-06T16:55:02.642Z] --> 760     raise InductorError(e, currentframe()).with_traceback(
[2025-05-06T16:55:02.642Z]     761         e.__traceback__
[2025-05-06T16:55:02.642Z]     762     ) from None
[2025-05-06T16:55:02.642Z]     763 finally:
[2025-05-06T16:55:02.642Z]     764     TritonBundler.end_compile()
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:745, in _compile_fx_inner(gm, example_inputs, **graph_kwargs)
[2025-05-06T16:55:02.642Z]     743 TritonBundler.begin_compile()
[2025-05-06T16:55:02.642Z]     744 try:
[2025-05-06T16:55:02.642Z] --> 745     mb_compiled_graph = fx_codegen_and_compile(
[2025-05-06T16:55:02.642Z]     746         gm, example_inputs, inputs_to_check, **graph_kwargs
[2025-05-06T16:55:02.642Z]     747     )
[2025-05-06T16:55:02.642Z]     748     assert mb_compiled_graph is not None
[2025-05-06T16:55:02.642Z]     749     mb_compiled_graph._time_taken_ns = time.time_ns() - start_time
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:1295, in fx_codegen_and_compile(gm, example_inputs, inputs_to_check, **graph_kwargs)
[2025-05-06T16:55:02.642Z]    1291     from .compile_fx_subproc import _SubprocessFxCompile
[2025-05-06T16:55:02.642Z]    1293     scheme = _SubprocessFxCompile()
[2025-05-06T16:55:02.642Z] -> 1295 return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:1119, in _InProcessFxCompile.codegen_and_compile(self, gm, example_inputs, inputs_to_check, graph_kwargs)
[2025-05-06T16:55:02.642Z]    1117 metrics_helper = metrics.CachedMetricsHelper()
[2025-05-06T16:55:02.642Z]    1118 with V.set_graph_handler(graph):
[2025-05-06T16:55:02.642Z] -> 1119     graph.run(*example_inputs)
[2025-05-06T16:55:02.642Z]    1120     output_strides: list[Optional[tuple[_StrideExprStr, ...]]] = []
[2025-05-06T16:55:02.642Z]    1121     if graph.graph_outputs is not None:
[2025-05-06T16:55:02.642Z]    1122         # We'll put the output strides in the compiled graph so we
[2025-05-06T16:55:02.642Z]    1123         # can later return them to the caller via TracingContext
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:877, in GraphLowering.run(self, *args)
[2025-05-06T16:55:02.642Z]     875 def run(self, *args: Any) -> Any:  # type: ignore[override]
[2025-05-06T16:55:02.642Z]     876     with dynamo_timed("GraphLowering.run"):
[2025-05-06T16:55:02.642Z] --> 877         return super().run(*args)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/fx/interpreter.py:171, in Interpreter.run(self, initial_env, enable_io_processing, *args)
[2025-05-06T16:55:02.642Z]     168     continue
[2025-05-06T16:55:02.642Z]     170 try:
[2025-05-06T16:55:02.642Z] --> 171     self.env[node] = self.run_node(node)
[2025-05-06T16:55:02.642Z]     172 except Exception as e:
[2025-05-06T16:55:02.642Z]     173     if self.extra_traceback:
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:1527, in GraphLowering.run_node(self, n)
[2025-05-06T16:55:02.642Z]    1525 else:
[2025-05-06T16:55:02.642Z]    1526     debug("")
[2025-05-06T16:55:02.642Z] -> 1527     result = super().run_node(n)
[2025-05-06T16:55:02.642Z]    1529 # require the same stride order for dense outputs,
[2025-05-06T16:55:02.642Z]    1530 # 1. user-land view() will not throw because inductor
[2025-05-06T16:55:02.642Z]    1531 # output different strides than eager
[2025-05-06T16:55:02.642Z]    (...)
[2025-05-06T16:55:02.642Z]    1534 # 2: as_strided ops, we need make sure its input has same size/stride with
[2025-05-06T16:55:02.642Z]    1535 # eager model to align with eager behavior.
[2025-05-06T16:55:02.642Z]    1536 as_strided_ops = [
[2025-05-06T16:55:02.642Z]    1537     torch.ops.aten.as_strided.default,
[2025-05-06T16:55:02.642Z]    1538     torch.ops.aten.as_strided_.default,
[2025-05-06T16:55:02.642Z]    (...)
[2025-05-06T16:55:02.642Z]    1541     torch.ops.aten.resize_as.default,
[2025-05-06T16:55:02.642Z]    1542 ]
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/fx/interpreter.py:240, in Interpreter.run_node(self, n)
[2025-05-06T16:55:02.642Z]     238 assert isinstance(args, tuple)
[2025-05-06T16:55:02.642Z]     239 assert isinstance(kwargs, dict)
[2025-05-06T16:55:02.642Z] --> 240 return getattr(self, n.op)(n.target, args, kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/graph.py:1169, in GraphLowering.call_function(self, target, args, kwargs)
[2025-05-06T16:55:02.642Z]    1163             decided_constraint = None  # type: ignore[assignment]
[2025-05-06T16:55:02.642Z]    1165     # for implicitly fallback ops, we conservatively requires
[2025-05-06T16:55:02.642Z]    1166     # contiguous input since some eager kernels does not
[2025-05-06T16:55:02.642Z]    1167     # support non-contiguous inputs. They may silently cause
[2025-05-06T16:55:02.642Z]    1168     # accuracy problems. Check https://github.com/pytorch/pytorch/issues/140452
[2025-05-06T16:55:02.642Z] -> 1169     make_fallback(target, layout_constraint=decided_constraint)
[2025-05-06T16:55:02.642Z]    1171 elif get_decompositions([target]):
[2025-05-06T16:55:02.642Z]    1172     # There isn't a good way to dynamically patch this in
[2025-05-06T16:55:02.642Z]    1173     # since AOT Autograd already ran.  The error message tells
[2025-05-06T16:55:02.642Z]    1174     # the user how to fix it.
[2025-05-06T16:55:02.642Z]    1175     raise MissingOperatorWithDecomp(target, args, kwargs)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] File /usr/local/lib/python3.10/dist-packages/torch/_inductor/lowering.py:2023, in make_fallback(op, layout_constraint, warn, override_decomp)
[2025-05-06T16:55:02.642Z]    2018         torch._dynamo.config.suppress_errors = False
[2025-05-06T16:55:02.642Z]    2019         log.warning(
[2025-05-06T16:55:02.642Z]    2020             "A make_fallback error occurred in suppress_errors config,"
[2025-05-06T16:55:02.642Z]    2021             " and suppress_errors is being disabled to surface it."
[2025-05-06T16:55:02.642Z]    2022         )
[2025-05-06T16:55:02.642Z] -> 2023     raise AssertionError(
[2025-05-06T16:55:02.642Z]    2024         f"make_fallback({op}): a decomposition exists, we should switch to it."
[2025-05-06T16:55:02.642Z]    2025         " To fix this error, either add a decomposition to core_aten_decompositions (preferred)"
[2025-05-06T16:55:02.642Z]    2026         " or inductor_decompositions, and delete the corresponding `make_fallback` line."
[2025-05-06T16:55:02.642Z]    2027         " Get help from the inductor team if unsure, don't pick arbitrarily to unblock yourself.",
[2025-05-06T16:55:02.642Z]    2028     )
[2025-05-06T16:55:02.642Z]    2030 def register_fallback(op_overload):
[2025-05-06T16:55:02.642Z]    2031     add_needs_realized_inputs(op_overload)
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] InductorError: AssertionError: make_fallback(aten.upsample_trilinear3d.default): a decomposition exists, we should switch to it. To fix this error, either add a decomposition to core_aten_decompositions (preferred) or inductor_decompositions, and delete the corresponding `make_fallback` line. Get help from the inductor team if unsure, don't pick arbitrarily to unblock yourself.
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] 
[2025-05-06T16:55:02.642Z] real	7m21.829s
[2025-05-06T16:55:02.642Z] user	8m18.975s
[2025-05-06T16:55:02.642Z] sys	5m22.797s
[2025-05-06T16:55:02.642Z] Check failed!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions