commit b5d1d4

2025-08-09 19:15:02 lhl: fixes to describing tensor math/limits

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`Guides/AI-Capabilities.md` ..
@@ 20,7 20,7 @@
	512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS
	```

-	Note that this theoretical max involves perfect utilization of the [dual issue capabilities of the WGPs](https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture) which can be [hard in practice](https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/#tinygrad-rdna3-matrix-multiplication-benchmark). The best way to achieve efficient AI acceleration on RDNA3 is [via WMMA](https://gpuopen.com/learn/wmma_on_rdna3/).
+	For more information on how RDNA3 does its tensor math, see this article on [WMMA on RDNA3](https://gpuopen.com/learn/wmma_on_rdna3/).

	\| Type \| RDNA3 \| RDNA4/sparse \|
	\|------------\|--------------------\|------------------------\|
@@ 29,7 29,7 @@
	\| INT8 \| 512 ops/cycle \| 2048/4096 ops/cycle \|
	\| INT4 \| 1024 ops/cycle \| 4096/8192 ops/cycle \|

-	In practice, as of 2025-08, Strix Halo's performance is substantially lower. It may be worth tracking these issues:
+	In practice, as of 2025-08, Strix Halo's tested performance is substantially lower what should be theoretically achievable. You can track performance progress via these issues:
	- https://github.com/ROCm/ROCm/issues/4748
	- https://github.com/ROCm/ROCm/issues/4499