vLLM
@lhl got implemented the first public vLLM builds and shortly after Discord member @ssweens created Arch-based dockerfiles. @kyuz0 adapted these into a amd-strix-halo-vllm-toolboxes and that is probably the easiest way currently (2025-09-13) to bring up vLLM on Strix Halo.
Current Status
While vLLM can be built, and runs for basic models like llama2, newer models (gpt-oss for example) or those with different kernel/code dependencies may not work.
Build Steps
A general description if you want to build vLLM. You should work in a clean env if possible:
- First have ROCm setup (latest prod or TheRock build should be ok)
- Build a PyTorch w/ AOTriton support - currently, TheRock PyTorch builds do not have this, so you need to build your own
- Build vLLM - there are many patches you might need to apply
- use_existing_torch.py
- Add gfx1151 to CMakeLists.txt
- build isolation fixes
- patch out amdsmi (not gfx1151 compatible)
pip install -e . --no-build-isolation
- either set constraints or reinstall torch/triton after as some intermediate steps may try to overwrite; alternatively you can
--no-deps
or manual installs of some packages but life is short