Skip to content

metal : need help debugging a kernel and setting up Xcode #4545

Closed
@ggerganov

Description

@ggerganov

Recently, we found out that we can run the Metal code in debug mode with shader validation enabled:

make -j tests && MTL_DEBUG_LAYER=1 MTL_SHADER_VALIDATION=1 MTL_SHADER_VALIDATION_REPORT_TO_STDERR=1 MTL_SHADER_VALIDATION_FAIL_MODE=allow MTL_DEBUG_LAYER_VALIDATE_STORE_ACTIONS=1 MTL_DEBUG_LAYER_VALIDATE_LOAD_ACTIONS=1 ./tests/test-backend-ops -b Metal -o MUL_MAT

This has been a useful way to debug the Metal shaders in ggml-metal.metal. The above command runs the ggml_mul_mat operator in isolation and compares the results with the reference CPU result.

The result from the command without MTL_SHADER_VALIDATION=1 is success.
However, when the shader validation instrumentation is enabled, we get some NaNs:

image

The failures are produced by the matrix-matrix multiplication kernel:

I've been trying to figure out what is causing the NaNs, but without success, so I'm calling for help from people that are more familiar with the Metal debugging pipeline and environment.

Currently, I think that this is somehow a false-positive because we've never observed an error and this kernel is used a lot. But still, I would very much like to make the Metal Validator happy.

I am reading on the internet that there should be a way to make a capture of this kernel and somehow debug it in Xcode, but I am not able to figure out exactly how. I was able to create an Xcode project with:

mkdir build-xcode
cd build-xcode
cmake -G Xcode ..

This allows me to run the test-backend-ops tool in Xcode in Debug mode.
I can also enable "Shader Validation" from the UI:

image

But that is as far as I've gotten. I cannot figure out how to capture some additional data in order to debug the source of these NaNs.

To isolate the tool to a single kernel call, you can apply this patch:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index f04b9438..7e363912 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1506,22 +1506,23 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
     for (ggml_type type_a : all_types) {
         for (ggml_type type_b : {GGML_TYPE_F32 /*, GGML_TYPE_F16 */}) {
             // FIXME: CPU crashes on f16xf16
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10,  1}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
 
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, { 1,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
-            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10,  1}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
+            //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
         }
+        break;
     }
 
     for (ggml_type type_a : all_types) {

Any help on this would be very appreciated. Even if it is just information of how to setup Xcode so that I can obtain some extra debug information.

cc @jhen0409 @bachittle or anyone else, thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededmacosIssues specific to macOS

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions