metal : need help debugging a kernel and setting up Xcode

Recently, we found out that we can run the Metal code in debug mode with shader validation enabled:

```bash
make -j tests && MTL_DEBUG_LAYER=1 MTL_SHADER_VALIDATION=1 MTL_SHADER_VALIDATION_REPORT_TO_STDERR=1 MTL_SHADER_VALIDATION_FAIL_MODE=allow MTL_DEBUG_LAYER_VALIDATE_STORE_ACTIONS=1 MTL_DEBUG_LAYER_VALIDATE_LOAD_ACTIONS=1 ./tests/test-backend-ops -b Metal -o MUL_MAT
```

This has been a useful way to debug the Metal shaders in `ggml-metal.metal`. The above command runs the `ggml_mul_mat` operator in isolation and compares the results with the reference CPU result.

The result from the command without `MTL_SHADER_VALIDATION=1` is success.
However, when the shader validation instrumentation is enabled, we get some NaNs:

![image](https://github.com/ggerganov/llama.cpp/assets/1991296/07e57680-b200-4e72-a59d-fded8350caa1)

The failures are produced by the matrix-matrix multiplication kernel:

- invoked here:
  https://github.com/ggerganov/llama.cpp/blob/328b83de23b33240e28f4e74900d1d06726f5eb1/ggml-metal.m#L1481-L1520
- implemented here:
  https://github.com/ggerganov/llama.cpp/blob/328b83de23b33240e28f4e74900d1d06726f5eb1/ggml-metal.metal#L3788-L3920

I've been trying to figure out what is causing the NaNs, but without success, so I'm calling for help from people that are more familiar with the Metal debugging pipeline and environment.

Currently, I think that this is somehow a false-positive because we've never observed an error and this kernel is used a lot. But still, I would very much like to make the Metal Validator happy.

I am reading on the internet that there should be a way to make a capture of this kernel and somehow debug it in Xcode, but I am not able to figure out exactly how. I was able to create an Xcode project with:

```bash
mkdir build-xcode
cd build-xcode
cmake -G Xcode ..
```

This allows me to run the `test-backend-ops` tool in Xcode in Debug mode.
I can also enable "Shader Validation" from the UI:

![image](https://github.com/ggerganov/llama.cpp/assets/1991296/016584f2-2aef-4870-99c2-4d9c1071846a)

But that is as far as I've gotten. I cannot figure out how to capture some additional data in order to debug the source of these NaNs.

To isolate the tool to a single kernel call, you can apply this patch:

```diff
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index f04b9438..7e363912 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1506,22 +1506,23 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
     for (ggml_type type_a : all_types) {
         for (ggml_type type_b : {GGML_TYPE_F32 /*, GGML_TYPE_F16 */}) {
             // FIXME: CPU crashes on f16xf16
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
 
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, { 1,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
         }
+        break;
     }
 
     for (ggml_type type_a : all_types) {
```

Any help on this would be very appreciated. Even if it is just information of how to setup Xcode so that I can obtain some extra debug information.

cc @jhen0409 @bachittle or anyone else, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal : need help debugging a kernel and setting up Xcode #4545

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

metal : need help debugging a kernel and setting up Xcode #4545

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions