Blame

fc559c deseven 2025-06-12 14:46:15 1
# AI Capabilities
15b2b6 deseven 2025-06-12 14:46:05 2
9b71f6 deseven 2025-06-15 22:35:30 3
> [!WARNING]
4
> This guide is a work in progress, the information might be incomplete or plain wrong.
5
15b2b6 deseven 2025-06-12 14:46:05 6
### Intro
7
It's pretty much as promised - there's a lot of memory, you can run a lot of stuff with it. Still, in my opinion, it doesn't make sense to allocate more than 64G to the gpu for these purposes, because even though you'll have enough memory to put big models into, the memory bandwidth would be a huge bottleneck, making most stuff unusable for realtime applications.
8
9
One really cool thing is that you can just keep several models in the memory to call instantly when you need them and still be able to play some really demanding games at the same time.
10
11
### LLMs
12
Context processing is very slow due to memory speed limitations, adding more than 4k context at once is painful (if you're using models in chat mode, adding context gradually, then it's fine of course), using FlashAttention and KV cache quantization is highly recommended. BLAS batch size of 512 seems to be optimal.
13
14
MoE models work very well, hopefully we'll see some 50-70B ones in the future, they could be the real sweet spot for this hardware.
15
63a179 deseven 2025-06-18 10:11:54 16
Some real-life examples (KoboldCPP, Vulkan, full GPU offloading, [example config file](./gemma-3-27b.kcpps)):
15b2b6 deseven 2025-06-12 14:46:05 17
e19d4b deseven 2025-06-18 21:44:39 18
| Model | Quantization | Prompt Processing | Generation Speed |
19
| --------------------- | ------------ | ----------------- | ---------------- |
20
| Llama 4 Scout 17B 16E | Q4_K_XL | 106.0 t/s | 12.7 t/s |
21
| Llama 3.3 70B | Q4_K_M | 51.1 t/s | 4.1 t/s |
22
| Gemma 3 27B | Q5_K_M | 94.4 t/s | 6.2 t/s |
23
| Qwen3 30B A3B | Q5_K_M | 94.5 t/s | **27.8 t/s** |
24
| GLM 4 9B | Q5_K_M | **273.7 t/s** | 15.0 t/s |
15b2b6 deseven 2025-06-12 14:46:05 25
59cd56 deseven 2025-07-15 09:42:08 26
#### Additional Resources
27
- Deep dive into LLM usage on Strix Halo: https://llm-tracker.info/_TOORG/Strix-Halo
28
- Newbie Linux inference guide: https://github.com/renaudrenaud/local_inference
806f1f deseven 2025-06-12 15:37:07 29
15b2b6 deseven 2025-06-12 14:46:05 30
### Image/Video Generation
d278f4 deseven 2025-06-12 21:43:23 31
Didn't play with it too much yet, but looks like here the memory bandwidth and GPU performance limitations strike the most. With SDXL you can generate an image every 4-5 seconds, but going to something like Flux will lead to wait times of several minutes.
ed1cb3 deseven 2025-07-03 08:57:58 32
33
### Relevant Pages
34
- [[Guides/110GB-of-VRAM]]