2025-08-09 13:07:57lhl:
First pass rewrite of introduction
Guides/AI-Capabilities.md ..
@@ 3,15 3,47 @@
> [!WARNING]
> This guide is a work in progress, the information might be incomplete or plain wrong.
-
### Intro
-
It's pretty much as promised - there's a lot of memory, you can run a lot of stuff with it. Still, in my opinion, it doesn't make sense to allocate more than 64G to the gpu for these purposes, because even though you'll have enough memory to put big models into, the memory bandwidth would be a huge bottleneck, making most stuff unusable for realtime applications.
+
## Intro
+
Strix Halo can be a capable local LLM inferencing platform. With up to 128GiB of shared system memory (LPDDR5x-8000 on a 256-bit bus), it has a theoretical limit of 256GiB/s, double most PC desktop and APU platforms.
-
One really cool thing is that you can just keep several models in the memory to call instantly when you need them and still be able to play some really demanding games at the same time.
+
That being said, it's important to put this in context. 256GiB/s is still much lower than most mid-range dGPUs (eg, as a point of reference, a 3060 Ti has 448 GiB/s). Also, the Strix Halo GPU is an RDNA3.5 architecture (gfx1151), which for AI is pretty sub-optimal architecturally. It's ROCm support is still also far under-tuned vs other RDNA3 platforms like the 7900 series (gfx1100). For compute and memory bandwidth, you can think of the Strix Halo GPU like a [Radeon RX 7600 XT](https://www.techpowerup.com/gpu-specs/radeon-rx-7600-xt.c4190), but with up to 128GiB of VRAM.
-
### LLMs
-
Context processing is very slow due to memory speed limitations, adding more than 4k context at once is painful (if you're using models in chat mode, adding context gradually, then it's fine of course), using FlashAttention and KV cache quantization is highly recommended. BLAS batch size of 512 seems to be optimal.
+
Due to limited memory-bandwidth and compute, unless you're very patient, for real-time inferencing, Strix Halo is best for quantized versions of large Mixture-of-Expert (MoE) LLMs that have fewer activations or for having multiple models loaded or models loaded while doing other (non-GPU) tasks.
+
+
The software support is also another issue. Strix Halo's Vulkan works well on Windows and Linux (Mesa RADV and AMDVLK), but its ROCm support is still immature and incomplete. If you are doing more than running common desktop inferencing software (llama.cpp, etc), then you will want to do some careful research.
+
+
### GPU Compute
+
For the 40CU Radeon 8060S at a max clock of 2.9GHz, the 395 Strix Halo should have a peak of 59.4 FP16/BF16 TFLOPS:
Note that this theoretical max involves perfect utilization of the [dual issue capabilities of the WGPs](https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture) which can be [hard in practice](https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/#tinygrad-rdna3-matrix-multiplication-benchmark). The best way to achieve efficient AI acceleration on RDNA3 is [via WMMA](https://gpuopen.com/learn/wmma_on_rdna3/).
In practice, as of 2025-08, Strix Halo's performance is substantially lower. It may be worth tracking these issues:
+
- https://github.com/ROCm/ROCm/issues/4748
+
- https://github.com/ROCm/ROCm/issues/4499
+
+
## LLMs
+
The recommended way to run LLMs is with [llama.cpp](https://github.com/ggml-org/llama.cpp) or one of the apps that leverage it like [LM Studio](https://lmstudio.ai/) with the Vulkan backend.
+
+
AMD has also been sponsoring the rapidly developing [Lemonade Server](https://lemonade-server.ai/) - an easy-to-install, all-in-one package that leverages the latest builds of software (like llama.cpp) and even provides NPU/hybrid inferencing on Windows. If you're new or unsure on what to run, you might wan to check that out first.
+
+
One of our community members, kyuz0, also maintains a [AMD Strix Halo Llama.cpp Toolboxes](https://github.com/kyuz0/amd-strix-halo-toolboxes) which has Docker builds of all llama.cpp backends.
+
+
Performance has been improving and several of our community members have been running extensive LLM inferencing benchmarks on many models:
+
- https://kyuz0.github.io/amd-strix-halo-toolboxes/ - an interactive viewer of standard pp512/tg128 results
+
- https://github.com/lhl/strix-halo-testing/tree/main/llm-bench - graphs and sweeps of pp/tg from 1-4096 for a variety of architectures
+
+
TODO: add a list of some models that work well with pp512/tg128, memory usage, model architecture, weight sizes?
-
MoE models work very well, hopefully we'll see some 50-70B ones in the future, they could be the real sweet spot for this hardware.
Some test results (KoboldCPP 1.96.2, Vulkan, full GPU offloading, [example config file](./gemma-3-27b.kcpps)) **on Windows**:
@@ 33,12 65,12 @@
All tests were run 3 times and the best result was picked. Generation speed is pretty stable, prompt processing speed fluctuates a bit (5-10%).
-
#### Additional Resources
+
### Additional Resources
- Deep dive into LLM usage on Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
- Newbie Linux inference guide: https://github.com/renaudrenaud/local_inference
- Ready to use Docker containers: https://github.com/kyuz0/amd-strix-halo-toolboxes
-
### Image/Video Generation
+
## Image/Video Generation
Didn't play with it too much yet, but looks like here the memory bandwidth and GPU performance limitations strike the most. With SDXL you can generate an image every 4-5 seconds, but going to something like Flux will lead to wait times of several minutes.