Description
Recently, we found out that we can run the Metal code in debug mode with shader validation enabled:
make -j tests && MTL_DEBUG_LAYER=1 MTL_SHADER_VALIDATION=1 MTL_SHADER_VALIDATION_REPORT_TO_STDERR=1 MTL_SHADER_VALIDATION_FAIL_MODE=allow MTL_DEBUG_LAYER_VALIDATE_STORE_ACTIONS=1 MTL_DEBUG_LAYER_VALIDATE_LOAD_ACTIONS=1 ./tests/test-backend-ops -b Metal -o MUL_MAT
This has been a useful way to debug the Metal shaders in ggml-metal.metal
. The above command runs the ggml_mul_mat
operator in isolation and compares the results with the reference CPU result.
The result from the command without MTL_SHADER_VALIDATION=1
is success.
However, when the shader validation instrumentation is enabled, we get some NaNs:
The failures are produced by the matrix-matrix multiplication kernel:
- invoked here:
https://github.com/ggerganov/llama.cpp/blob/328b83de23b33240e28f4e74900d1d06726f5eb1/ggml-metal.m#L1481-L1520 - implemented here:
https://github.com/ggerganov/llama.cpp/blob/328b83de23b33240e28f4e74900d1d06726f5eb1/ggml-metal.metal#L3788-L3920
I've been trying to figure out what is causing the NaNs, but without success, so I'm calling for help from people that are more familiar with the Metal debugging pipeline and environment.
Currently, I think that this is somehow a false-positive because we've never observed an error and this kernel is used a lot. But still, I would very much like to make the Metal Validator happy.
I am reading on the internet that there should be a way to make a capture of this kernel and somehow debug it in Xcode, but I am not able to figure out exactly how. I was able to create an Xcode project with:
mkdir build-xcode
cd build-xcode
cmake -G Xcode ..
This allows me to run the test-backend-ops
tool in Xcode in Debug mode.
I can also enable "Shader Validation" from the UI:
But that is as far as I've gotten. I cannot figure out how to capture some additional data in order to debug the source of these NaNs.
To isolate the tool to a single kernel call, you can apply this patch:
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index f04b9438..7e363912 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1506,22 +1506,23 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
for (ggml_type type_a : all_types) {
for (ggml_type type_b : {GGML_TYPE_F32 /*, GGML_TYPE_F16 */}) {
// FIXME: CPU crashes on f16xf16
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1, 1}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 1}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 1}, {2, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, { 1, 1}, {1, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 1}, {1, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 1}, {2, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {1, 2}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 1, 256, {10, 10}, {2, 2}));
test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, { 1, 1}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 1}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 1}, {2, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
- test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 1}, {1, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 1}, {2, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 1}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {1, 2}));
+ //test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 256, {10, 10}, {2, 2}));
}
+ break;
}
for (ggml_type type_a : all_types) {
Any help on this would be very appreciated. Even if it is just information of how to setup Xcode so that I can obtain some extra debug information.
cc @jhen0409 @bachittle or anyone else, thanks!