commit 3ecd12

Commit `3ecd12`

2025-08-11 11:43:18 lhl: llama.cpp and some cleanup

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`Guides/AI-Capabilities.md` ..
@@ 1,18 1,16 @@
	# AI Capabilities

	> [!WARNING]
-	> This guide is a work in progress. Also, there has been ongoing progress with software quality and performance so please be aware that some of this information may become out of date.
+	> This guide is a work in progress. Also, there has been ongoing progress with software quality and performance so please be aware that some of this information may become rapidly out of date.

	## Intro
-	Strix Halo can be a capable local LLM inferencing platform. With up to 128GiB of shared system memory (LPDDR5x-8000 on a 256-bit bus), it has a theoretical limit of 256GiB/s, double most PC desktop and APU platforms.
+	Strix Halo can be a capable local LLM inferencing platform. With up to 128GiB of shared system memory (LPDDR5x-8000 on a 256-bit bus) and a theoretical bandwidth limit of 256GiB/s (double most PC desktop and APU platforms) it is well suited for running quantized medium sized (~30B) dense models as well as large 100B+ parameter sparse Mixture-of-Experts (MoE) models.

-	That being said, it's important to put this in context. 256GiB/s is still much lower than most mid-range dGPUs. As a point of reference, a 3060 Ti has 448 GiB/s of MBW. Also, the Strix Halo GPU uses an RDNA3.5 architecture (gfx1151), which for AI is pretty sub-optimal architecturally. For compute and memory bandwidth, you can think of the Strix Halo GPU like a [Radeon RX 7600 XT](https://www.techpowerup.com/gpu-specs/radeon-rx-7600-xt.c4190), but with up to 128GiB of VRAM.
+	There have been some wild claims made on Strix Halo's AI capabilities, so it's important to put this in context. 256GiB/s of MBW is still much lower than most mid-range dGPUs. As a point of reference, a 3060 Ti has 448 GiB/s of MBW. Also, the Strix Halo GPU uses an RDNA3.5 architecture (gfx1151), which for AI is sub-optimal architecturally. For compute and memory bandwidth, you can think of the Strix Halo GPU like a [Radeon RX 7600 XT](https://www.techpowerup.com/gpu-specs/radeon-rx-7600-xt.c4190), but with up to 100GiB+ of VRAM.

-	Due to limited memory-bandwidth and compute, unless you're very patient, for real-time inferencing, Strix Halo is best for quantized versions of large Mixture-of-Expert (MoE) LLMs that have fewer activations or for having multiple models loaded or models loaded while doing other (non-GPU) tasks.
+	Vulkan works well for Strix Halo on Windows and Linux (both Mesa RADV and AMDVLK), but as of August 2025, its ROCm support is still immature and incomplete, and you may find running other AI/ML tasks (training, image/video generation) slow or impossible.

-	The software support is also another issue. Strix Halo's Vulkan works well on Windows and Linux (Mesa RADV and AMDVLK), but its ROCm support is still immature and incomplete, and far under-tuned compared to other RDNA3 platforms like the 7900 series (gfx1100).
-
-	If you are doing more than running common desktop inferencing software (llama.cpp, etc), then you will want to do some careful research.
+	If you are doing more than running common desktop inferencing software (llama.cpp, etc), then you will want to do some careful research.

	### GPU Compute
	For the 40CU Radeon 8060S at a max clock of 2.9GHz, the 395 Strix Halo should have a peak of 59.4 FP16/BF16 TFLOPS:
@@ 33,27 31,50 @@
	- https://github.com/ROCm/ROCm/issues/4748
	- https://github.com/ROCm/ROCm/issues/4499

+	### Comparison to other available options
+	There are not that many options for those looking for 96GB+ of VRAM, annd among its competition, Strix Halo is well priced:
+
+	\| Spec \| AMD Ryzen AI Max Plus 395 \| Apple Mac Studio M4 Max \| NVIDIA DGX Spark \| NVIDIA RTX PRO 6000 \|
+	\|---------------------------\|------------------------------\|----------------------------------\|---------------------------------------------\|---------------------\|
+	\| Release Date \| Spring 2025 \| March 12, 2025 \| August 2025 \| May 2025 \|
+	\| Price \| $2,000 \| $3,499 \| $3,000+ \| $8,200 \|
+	\| Power (Max W) \| 120 \| 300+ \| ?? \| 600 \|
+	\| CPU \| 16x Zen5 5.1 GHz \| 16x M4 4.5 GHz \| 10x Arm Cortex-X925, 10x Arm Cortex-A725 \| \|
+	\| GPU \| RDNA 3.5 \| M4 \| Blackwell \| Blackwell \|
+	\| GPU Cores \| 40 \| 40 \| 192 \| 752 \|
+	\| GPU Clock \| 2.9 GHz \| 1.8 GHz \| 2.5 GHz \| 2.6 GHz \|
+	\| Memory \| 128GB LPDDR5X-8000 \| 128GB LPDDR5X-8533 \| 128GB LPDDR5X-8533 \| 96GB GDDR7 ECC \|
+	\| Memory Bandwidth \| 256 GB/s \| 546 GB/s \| 273 GB/s \| 1792 GB/s \|
+	\| FP16 TFLOPS \| 59.39 \| 34.08 \| 62.5 \| 251.90 \|
+	\| FP8 TFLOPS \| \| \| 125 \| 503.79 \|
+	\| INT8 TOPS (GPU) \| 59.39 (same as FP16) \| 34.08 (same as FP16) \| 250 \| 1007.58 \|
+	\| INT4 TOPS (GPU) \| 118.78 \| \| 500 \| 2015.16 \|
+	\| INT8 TOPS (NPU) \| 50 \| 38 \| \| \|
+	\| Storage \| 2-3x NVMe \| 512GB (non-upgradable) + 3x TB5 (120 Gbps) \| NVMe \| \|
+
+	An NVIDIA RTX PRO 6000 (Blackwell) dGPU is included as a point of comparison - if you are willing to pay (significantly) more, you can get much better performance, but of course there are many other options if you are open to a bigger form factor, power envelope, or price point (usually all three).
+
	## Setup

	### Memory Limits
-	For the Strix Halo GPU, memory is either GART, which is a fixed reserved aperture set exclusively in the BIOS, and GTT, which is a dynamically allocable memory amount of memory. In Windows, this should be automatic (but is limited to 96GB?). In Linux, it can be set via boot configuration.
+	For the Strix Halo GPU, memory is either GART, which is a fixed reserved aperture set exclusively in the BIOS, and GTT, which is a dynamically allocable memory amount of memory. In Windows, this should be automatic (but is limited to 96GB). In Linux, it can be set via boot configuration up to the point of system instability.

-	As long are your software supports using GTT, for AI purposes, you are probably best off setting GART to the minimum (eg, 512MB) and then allocating via GTT. In Linux, you can create a conf in your `/etc/modprobe.d/` (like `/etc/modprobe.d/amdgpu_llm_optimized.conf`):
+	As long are your software supports using GTT, for AI purposes, you are probably best off setting GART to the minimum (eg, 512MB) and then allocating automatically via GTT. In Linux, you can create a conf in your `/etc/modprobe.d/` (like `/etc/modprobe.d/amdgpu_llm_optimized.conf`):

	```
-	# Maximize GTT for LLM usage on 128GB UMA system
+	# Use up to 120GB GTT with 60GB pre-reserved
	options amdgpu gttsize=120000
	options ttm pages_limit=31457280
	options ttm page_pool_size=15728640
	```

-	`amdgpu.gttsize` is an [officially deprecated](https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg117333.html) parameter that may be referenced by some software, so it's best to still set it to match, but GTT allocation in linux is now handled by the Translation Table Maps (TTM) memory management subsystem .
+	`amdgpu.gttsize` is an [officially deprecated](https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg117333.html) parameter that may be referenced by some software, so it's best to still set it to match, but GTT allocation in Linux is now actually handled by the Translation Table Maps (TTM) memory management subsystem.
	- `pages_limit` sets the maximum number or 4KiB pages that can be used for GPU memory.
	- `page_pool_size` pre-caches/allocates the memory for usage by the GPU. (This will not be available for your system). In theory you could set this to 0, but if you are looking for the maximum performance (minimizing fragmentation), then you can set it to match the `pages_limit` size.

	You can increase the limits as high as you want, but you'll want to make sure you reserve enough memory for your OS/system.

-	If you are setting `page_pool_size` lower than the `pages_limit` you may want to try increasing `amdgpu.vm_fragment_size=8` (4=64K default, 9=2M) to allocate in bigger chunks.
+	If you are setting `page_pool_size` lower than the `pages_limit` you may want to try increasing, eg `amdgpu.vm_fragment_size=8` (4=64K default, 9=2M) to allocate in bigger chunks.

	### ROCm
	The latest release version of ROCm (6.4.3 as of this writing) has preliminary rocBLAS support for Strix Halo gfx1151, but it is much slower and buggier (and doesn't have hipBLASlt kernels). For the most up-to-date support, currently it's recommended to use the latest gfx1151 [TheRock/ROCm "nightly" release](https://github.com/ROCm/TheRock/blob/main/RELEASES.md). These can be found at [https://therock-nightly-tarball.s3.amazonaws.com/](https://therock-nightly-tarball.s3.amazonaws.com/) (find the filename) or you can use the helper scripts described in the [Releases page]((https://github.com/ROCm/TheRock/blob/main/RELEASES.md)).
@@ 79,7 100,7 @@
	## LLMs
	The recommended way to run LLMs is with [llama.cpp](https://github.com/ggml-org/llama.cpp) or one of the apps that leverage it like [LM Studio](https://lmstudio.ai/) with the Vulkan backend.

-	AMD has also been sponsoring the rapidly developing [Lemonade Server](https://lemonade-server.ai/) - an easy-to-install, all-in-one package that leverages the latest builds of software (like llama.cpp) and even provides NPU/hybrid inferencing on Windows. If you're new or unsure on what to run, you might wan to check that out first.
+	AMD has also been sponsoring the rapidly developing [Lemonade Server](https://lemonade-server.ai/) - an easy-to-install, all-in-one package that leverages the latest builds of software (like llama.cpp) and even provides NPU/hybrid inferencing on Windows. If you're new or unsure on what to run, you might want to check that out first.

	One of our community members, kyuz0, also maintains a [AMD Strix Halo Llama.cpp Toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) which has Docker builds of all llama.cpp backends.

@@ 87,11 108,17 @@
	- https://kyuz0.github.io/amd-strix-halo-toolboxes/ - an interactive viewer of standard pp512/tg128 results
	- https://github.com/lhl/strix-halo-testing/tree/main/llm-bench - graphs and sweeps of pp/tg from 1-4096 for a variety of architectures

-	TODO: add a list of some models that work well with pp512/tg128, memory usage, model architecture, weight sizes?
+	### llama.cpp
+	The easiest way to get llama.cpp to work is with the Vulkan backend. This reliable and relatively performant on both Windows and Linux and you can either [build it yourself](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#vulkan) or simply [download the latest release](https://github.com/ggml-org/llama.cpp/releases) for your OS.
+	- For Vulkan on Linux, you should install both the AMDVLK and Mesa RADV drivers. When both are installed AMDVLK will be the default Vulkan driver, which is generally fine as it's `pp` can be up to 2X faster than Mesa RADV. You can set `AMD_VULKAN_ICD=RADV` to switch to the RADV to compare. The latter tends to have slightly higher `tg` speed, hold up better in long context, and be slightly more reliable, so you should test both and see which works better for you.
+
+	If you want to use the ROCm backend (the rocWMMA FA implementation in particular can offer huge (eg 2X) `pp` performance advantages vs Vulkan), the easiest way is to use kyuz0's [AMD Strix Halo Llama.cpp Toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) container builds or Lemonade's [llamacpp-rocm](https://github.com/lemonade-sdk/llamacpp-rocm) builds. If you're looking to build your own, please refer to: [[Guides/AI-Capabilities/llamacpp-with-ROCm]]

-	For llama.cpp, for Vulkan, you should install AMDVLK and Mesa RADV. When both are installed AMDVLK will be the default Vulkan driver, which is generally fine as it's `pp` can be 2X faster than Mesa RADV. You can set `AMD_VULKAN_ICD=RADV` to try out RADV though if you run into problems or are curious.
+	If you are using the llama.cpp ROCm backend, you may want to also try to use the hipBLASlt kernel with the `ROCBLAS_USE_HIPBLASLT=1` environment variable as it is sometimes faster than the default rocBLAS kernels.
+
+
+	TODO: add a list of some models that work well with pp512/tg128, memory usage, model architecture, weight sizes?

-	If you are using the llama.cpp ROCm backend, you should always attempt to use the hipBLASlt kernel `ROCBLAS_USE_HIPBLASLT=1` as it is almost always much (like 2X) faster than the default rocBLAS kernels. Currently ROCm often hangs or crashes on different model architectures and is generally slower than Vulkan (although sometimes the `pp` can be much faster), so unless you're looking to experiment, you'll probably be better off using Vulkan.

	### Additional Resources
	- Deep dive into LLM usage on Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
@@ 99,11 126,8 @@
	- Ready to use Docker containers: https://github.com/kyuz0/amd-strix-halo-toolboxes

	## Image/Video Generation
-	For Windows you can give AMUSE is the easiest way to get started quickly: https://www.amuse-ai.com/
+	For Windows you can give AMUSE a try. It's probably the easiest way to get started quickly: https://www.amuse-ai.com/

	Here are some instructions for getting ComfyUI up and running on Windows: https://www.reddit.com/r/StableDiffusion/comments/1lmt44b/running_rocmaccelerated_comfyui_on_strix_halo_rx/

	Note, while TheRock has [PyTorch nightlies](https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages) available for Strix Halo gfx1151, they do not currently have AOTriton or FA and may not run very well.
-
-	### Relevant Pages
-	- [[Guides/110GB-of-VRAM]]

Commit 3ecd12

Commit `3ecd12`