llamacpp-performance - blame None

Blame

d8cc0f	lhl	2025-08-28 11:20:29	1	# llama.cpp Performance
			2
			3	There are some tests that have been run on Strix Halo that can be useful to get a ballpark idea of performance:
			4	- [lhl Strix Halo LLM Benchmark Results](https://github.com/lhl/strix-halo-testing/tree/main/llm-bench) - scripted tests from 2025-08 on a number of models, sweeping pp and tg from 1-4096 across a number of backends
			5	- [kyuz0 AMD Ryzen AI MAX+ 395 "Strix Halo" — Llama.cpp Backend Performance Comparison](https://kyuz0.github.io/amd-strix-halo-toolboxes/) - automated pp512/tg128 tests across a number of backends; may be more up-to-date
			6
			7	## llama-bench Basics
			8
			9	It's important to note that both AMD, ROCm and Vulkan drivers/libs and llama.cpp are constantly being updated and that different models (and different context lengths!) will perform differently, so if you're looking for the best performance, you should run your own testing which backend and settings work best for your own use case.
			10
			11	For testing llama.cpp performance, it is best to use the included `llama-bench`. By default it generates a `pp512` and `tg128` number, running each test 5 times.
			12
			13	Here are the most important flags you should be aware of:
			14
			15	- `-ngl 999` - This specifies the number of layers of a model is loaded to the GPU and is by default set to `99` - there are some large models that are larger so you may want to set `999` in those cases. Even though Strix Halo has unified memory, you always want to load all your layers to "GPU" memory as it has full access to memory bandwidth (about 2X faster than CPU)
			16	- `--mmap 0` - Memory mapping can allow lazy loading of a model and is enabled by default, but when you load large models to the GPU, it will make loading moderately slower for Vulkan, and catastrophically slower for ROCm. You should always set `--mmap 0` or `--no-mmap` (different flags for different commands, check `--help` for the right one for each command if necessary)
			17	- `-fa 1` - while you may take a slight hit on small context performance depending on backend, you will almost always want to use [Flash Attention](https://towardsdatascience.com/understanding-flash-attention-writing-the-algorithm-from-scratch-in-triton-5609f0b143ea/) for long context and real world usage, so it's best to just test with this on unless you are sure you won't be using longer context.
			18	- `-b`, `-ub` - These specify logical and physical max batch size and default to 2048 and 512 respectively. You can increase these to improve longer context performance, however `-b` raises reserved memory and `-ub` raises peak memory usage. You can get some gains in performance (diminishing returns) increasing these, but you also risk OOM errors
			19	- `-r` - if you are doing tests on ultra-long context, these can sometimes take an hour or even longer to run a single pass - for these tests setting repetition to 1 is advised: `-r 1` - if you want something more statistically valid or test under load, you could of course go the other way and set `-r 10` or higher
			20	- `-p`, `-n`, `-pg`, `-d` - These control the various tests. `-p` (default 512) measures `pp` "prompt processing" speed or prefill - this is how many tokens are in your prior context/conversation and is typically compute limited. `-n` (default 128) tests your `tg` token generation - this is the speed at which new tokens generate after you've processed your prompt and is typically memory limited. `-pg` let's you measure a pp+tg together, and `-d` lets you specify a "depth" or number of tokens before you test - eg, if you specify `-p 0 -n 128 -d 10000` it will measure the speed that tokens generate after there is 10000 tokens of context - the longer the context length, the slower this will be.
			21
0c9309	lhl	2025-09-14 02:51:07	22	## Performance Testing
			23	Some community members have run llama-bench in a way that can give you a good idea of performance across various model sizes/architectures
			24
			25	- https://github.com/lhl/strix-halo-testing/tree/main/llm-bench - context sweeps of various models and backends
			26	- https://kyuz0.github.io/amd-strix-halo-toolboxes/ - interactive pp512/128 chart of various models and backends
			27	- https://community.frame.work/t/will-the-ai-max-395-128gb-be-able-to-run-gpt-oss-120b/73280/26 - comparing performance differences of different quants of gpt-oss
			28
d8cc0f	lhl	2025-08-28 11:20:29	29	## Long Context Length Testing
			30
			31	Most tests use the default llama-bench numbers (pp512/tg128) and these are good for broad comparisons, however as you can see [even from 1-4096 sweeps](https://github.com/lhl/strix-halo-testing/tree/main/llm-bench/llama-2-7b.Q4_0), different backends and settings have different performance characteristics as context extends.
			32
de25a1	lhl	2025-09-12 10:10:56	33	For example, running tests on the AMDVLK vs RADV Vulkan backends, while RADV is about 4% on `tg128`, it's 16% slower for `pp512`. One some models the `pp512` performance gap is even bigger if you're [looking purely at pp512/tg128 numbers](https://kyuz0.github.io/amd-strix-halo-toolboxes/). This test takes ~5s to run and varies wildly for different model. For example, for Qwen 3 30B-A3B UD-Q4_K_XL RADV is actually a bit faster than AMDVLK:
d8cc0f	lhl	2025-08-28 11:20:29	34
			35	```
			36	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
			37	❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
			38	```
			39	\| Backend \| pp512 (t/s) \| tg128 (t/s) \|
			40	\|---------------\|---------------\|---------------\|
6d26b8	lhl	2025-09-11 17:39:42	41	\| Vulkan AMDVLK \| 741.60\| 81.79 \|
			42	\| Vulkan RADV \| 755.14 \| 85.11 \|
d8cc0f	lhl	2025-08-28 11:20:29	43
6d26b8	lhl	2025-09-11 17:39:42	44	However, running these tests at the end of its 128K context depth gives you an decidedly different picture. For long context, the RADV driver in this case scales much better:
d8cc0f	lhl	2025-08-28 11:20:29	45
			46	```
6d26b8	lhl	2025-09-11 17:39:42	47	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
d8cc0f	lhl	2025-08-28 11:20:29	48	❯ AMD_VULKAN_ICD=RADV build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
			49	```
			50
			51	\| Backend \| pp512 @ d130560 (t/s) \| tg128 @ d130560 (t/s) \|
			52	\|---------------\|---------------\|---------------\|
			53	\| Vulkan AMDVLK \| 10.75 \| 3.51 \|
			54	\| Vulkan RADV \| 17.24 \| 12.54 \|
			55
			56	These runs took about ~2-3h to run on Strix Halo. You can do this with the ROCm backend as well (I recommend using the rocWMMA compile, and testing with and without `ROCBLAS_USE_HIPBLASLT=1` to compare the rocBLAS vs hipBLASlt kernels, but I leave that as a multi-hour exercise for the reader. This is simply illustrative for the differences you might see. (If you're doing an overnight run, doing something like `-d 0,5000,10000,50000,100000` might give you a better idea of how things slow down as context grows.
			57
			58	NOTE: `pp` or context processing is critically important if you are loading previous long conversations/large context (agentic flows) or adding many tokens (grounding via search, tools, file attachments, etc), however in single-user conversational usage, `llama-cli` has a `--prompt-cache-all` flag and `llama-server` allows you to submit `cache_prompt=True` with your request to be able to use previously cached context similar to vLLM's [automatic prefix caching](https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html) or SGLang's [radix caching](https://lmsys.org/blog/2024-01-17-sglang/).
6d26b8	lhl	2025-09-11 17:39:42	59
			60	## Bonus ROCm numbers
			61	Built w/ rocWMMA (`-DGGML_HIP_ROCWMMA_FATTN=ON`)
			62
			63	```
			64	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
			65	❯ ROCBLAS_USE_HIPBLASLT=1 build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf
			66	```
			67
			68	\| Backend \| pp512 (t/s) \| tg128 (t/s) \|
			69	\|---------------\|---------------\|---------------\|
			70	\| ROCm \| 650.59 \| 64.17 \|
			71	\| ROCm hipBLASlt \| 651.93 \| 63.95 \|
			72
			73	```
			74	❯ build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
			75	❯ ROCBLAS_USE_HIPBLASLT=1 build/bin/llama-bench -fa 1 -r 1 --mmap 0 -m /models/gguf/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -d 130560
			76	```
			77
			78	\| Backend \| pp512 @ d130560 (t/s) \| tg128 @ d130560 (t/s) \|
			79	\|---------------\|---------------\|---------------\|
			80	\| ROCm \| 40.58 \| 4.98 \|
			81	\| ROCm hipBLASlt \| 40.35 \| 4.97 \|