commit 6d26b8

Commit `6d26b8`

2025-09-11 17:39:42 lhl: added ROCm long context numbers

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`AI/llamacpp-performance.md` ..
@@ 23,7 23,7 @@

	Most tests use the default llama-bench numbers (pp512/tg128) and these are good for broad comparisons, however as you can see [even from 1-4096 sweeps](https://github.com/lhl/strix-halo-testing/tree/main/llm-bench/llama-2-7b.Q4_0), different backends and settings have different performance characteristics as context extends.

-	For example, running tests on the AMDVLK vs RADV Vulkan backends, while RADV is about 4% on `tg128`, it's 16% slower for `pp512`. One some models the `pp512` performance gap is even bigger. This test takes ~5s to run and of course you might decide that AMDVLK is the better choice:
+	For example, running tests on the AMDVLK vs RADV Vulkan backends, while RADV is about 4% on `tg128`, it's 16% slower for `pp512`. One some models the `pp512` performance gap is even bigger if you're [looking purely at pp512/tg128 numbers](https://kyuz0.github.io/amd-strix-halo-toolboxes/). This test takes ~5s to run and varies wildly for different numbers. For example, for Qwen 3 30B-A3B UD-Q4_K_XL RADV is actually a bit better:

	```
	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
@@ 31,13 31,13 @@
	```
	\| Backend \| pp512 (t/s) \| tg128 (t/s) \|
	\|---------------\|---------------\|---------------\|
-	\| Vulkan AMDVLK \| 660.34 \| 79.87 \|
-	\| Vulkan RADV \| 765.61 \| 83.19 \|
+	\| Vulkan AMDVLK \| 741.60\| 81.79 \|
+	\| Vulkan RADV \| 755.14 \| 85.11 \|

-	However, running these tests at the end of its 128K context depth gives you an decidedly different picture. For long context, the RADV driver scales much better:
+	However, running these tests at the end of its 128K context depth gives you an decidedly different picture. For long context, the RADV driver in this case scales much better:

	```
-	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
+	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
	❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
	```

@@ 49,3 49,26 @@
	These runs took about ~2-3h to run on Strix Halo. You can do this with the ROCm backend as well (I recommend using the rocWMMA compile, and testing with and without `ROCBLAS_USE_HIPBLASLT=1` to compare the rocBLAS vs hipBLASlt kernels, but I leave that as a multi-hour exercise for the reader. This is simply illustrative for the differences you might see. (If you're doing an overnight run, doing something like `-d 0,5000,10000,50000,100000` might give you a better idea of how things slow down as context grows.

	NOTE: `pp` or context processing is critically important if you are loading previous long conversations/large context (agentic flows) or adding many tokens (grounding via search, tools, file attachments, etc), however in single-user conversational usage, `llama-cli` has a `--prompt-cache-all` flag and `llama-server` allows you to submit `cache_prompt=True` with your request to be able to use previously cached context similar to vLLM's [automatic prefix caching](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) or SGLang's [radix caching](https://lmsys.org/blog/2024-01-17-sglang/).
+
+	## Bonus ROCm numbers
+	Built w/ rocWMMA (`-DGGML_HIP_ROCWMMA_FATTN=ON`)
+
+	```
+	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
+	❯ ROCBLAS_USE_HIPBLASLT=1 build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
+	```
+
+	\| Backend \| pp512 (t/s) \| tg128 (t/s) \|
+	\|---------------\|---------------\|---------------\|
+	\| ROCm \| 650.59 \| 64.17 \|
+	\| ROCm hipBLASlt \| 651.93 \| 63.95 \|
+
+	```
+	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
+	❯ ROCBLAS_USE_HIPBLASLT=1 build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
+	```
+
+	\| Backend \| pp512 @ d130560 (t/s) \| tg128 @ d130560 (t/s) \|
+	\|---------------\|---------------\|---------------\|
+	\| ROCm \| 40.58 \| 4.98 \|
+	\| ROCm hipBLASlt \| 40.35 \| 4.97 \|

Commit 6d26b8

Commit `6d26b8`