Status: Preview support in ROCm 6.4.4; full official support expected Q2 2026 with ROCm 7.2.2
| Version | Status | Notes |
|---|---|---|
| ROCm 6.4.4 | ✅ Preview/Stable | AMD’s stable stepping stone; PyTorch support on Windows/Linux |
| ROCm 6.5.0rc | ⚠️ Community | scottt’s self-contained wheels (no system ROCm needed) |
| ROCm 7.0.x | ⚠️ Regressions | Some models slower than 6.4.4 |
| ROCm 7.1.1 | ⚠️ Batch norm bug | ROCm#5339 reported issues |
| ROCm 7.2.0 | ✅ Released | Counter collection for gfx1150/gfx1151; rocWMMA gfx1150 support; JAX 0.8.0 |
| ROCm 7.2.2 | 🔜 Q2 2026 | Full official support with AMD’s “Ryzen AI Halo” dev kit |
| TheRock 7.11 | ✅ Nightlies | Best native gfx1151 support; 2x faster transformers vs gfx1100 fallback |
AMD Official Statement (ROCm#5339):
“The 6.4.4 PyTorch releases are stepping stones to getting full support with ROCm 7.x + Strix Halo.”
| Library | Status | Notes |
|---|---|---|
| rocBLAS | ✅ Full | gfx1151 TensileLibraries included |
| hipBLASLt | ✅ Full | ROCBLAS_USE_HIPBLASLT=1 for better perf |
| rocWMMA | ✅ Added | Via PR #538; needed for Flash Attention |
| AOTriton | ⚠️ Experimental | Custom builds required for FA |
| MIOpen | ⚠️ Partial | Some convolution issues |
For best compatibility with latest kernels, use TheRock nightly builds:
# Download gfx1151 nightly tarball
https://github.com/ROCm/TheRock/releases/
# Untar to any folder (e.g., /opt/rocm or ~/therock/rocm-7.0)
tar xf rocm-*.tar.xz -C ~/therock/
Environment Setup Script (required to use extracted tarball):
# ---- ROCm nightly from extracted tarball ----
export ROCM_PATH=$HOME/therock/rocm-7.0 # adjust to your path
export HIP_PLATFORM=amd
export HIP_PATH=$ROCM_PATH
export HIP_CLANG_PATH=$ROCM_PATH/llvm/bin
export HIP_INCLUDE_PATH=$ROCM_PATH/include
export HIP_LIB_PATH=$ROCM_PATH/lib
export HIP_DEVICE_LIB_PATH=$ROCM_PATH/lib/llvm/amdgcn/bitcode
# Search paths -- prepend
export PATH="$ROCM_PATH/bin:$HIP_CLANG_PATH:$PATH"
export LD_LIBRARY_PATH="$HIP_LIB_PATH:$ROCM_PATH/lib:$ROCM_PATH/lib64:$ROCM_PATH/llvm/lib:${LD_LIBRARY_PATH:-}"
export LIBRARY_PATH="$HIP_LIB_PATH:$ROCM_PATH/lib:$ROCM_PATH/lib64:${LIBRARY_PATH:-}"
export CPATH="$HIP_INCLUDE_PATH:${CPATH:-}"
export PKG_CONFIG_PATH="$ROCM_PATH/lib/pkgconfig:${PKG_CONFIG_PATH:-}"
Compiling llama.cpp with ROCm 7.0 Nightlies:
One HIP_VERSION change is required - see: https://www.reddit.com/r/LocalLLaMA/comments/1m6b151/comment/n4jlc3z
# REQUIRED for stability
export HSA_ENABLE_SDMA=0 # Fix checkerboard artifacts
# For PyTorch (IMPORTANT: disable hipBLASLt for better performance!)
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True"
export ROCBLAS_USE_HIPBLASLT=0 # +20% GEMM performance on Strix Halo
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
# For llama.cpp (hipBLASLt may help or hurt - benchmark your model)
# export ROCBLAS_USE_HIPBLASLT=1 # Test both 0 and 1
⚠️ SoftAb finding:
ROCBLAS_USE_HIPBLASLT=0improves PyTorch GEMM by ~20% on Strix Halo (32.7 vs 27.2 TFLOPS). This contradicts some older guidance recommending hipBLASLt=1.
| Use Case | Recommended Backend |
|---|---|
| General inference | Vulkan (AMDVLK for pp, RADV for stability) |
| Long context (8K+) | ROCm + rocWMMA + FA |
| PyTorch training | ROCm (TheRock nightlies) |
| Maximum stability | Vulkan RADV |
| Windows | Vulkan (ROCm not well supported) |
Important: Best backend for prompt processing (pp) often differs from best for token generation (tg). If optimizing for a specific workload, benchmark both phases separately. Different backends also have different performance decay characteristics as context length grows - some have higher peak but drop off faster at long context.
Token Generation (Vulkan AMDVLK) - gpt-oss-120b Q8/MXFP4:
| Context | DGX Spark | Strix Halo | Spark Advantage |
|---|---|---|---|
| 2K | 52.87 | 50.05 | +5.6% |
| 8K | 48.46 | 43.15 | +12.3% |
| 32K | 38.76 | 31.54 | +22.9% |
Token Generation (ROCm + rocWMMA, lhl tuned):
| Context | DGX Spark | Strix Halo | Spark Advantage |
|---|---|---|---|
| 2K | 52.87 | 48.97 | +8.0% |
| 8K | 48.46 | 43.55 | +11.3% |
| 32K | 38.76 | 36.43 | +6.4% |
Key Insight: lhl’s tuned rocWMMA branch significantly improves long-context performance, closing the gap with DGX Spark.
Performance varies by model size on Strix Halo’s unified memory architecture:
| Model | Size | Backend | pp512 (t/s) | Notes |
|---|---|---|---|---|
| TinyLlama 1.1B Q4_K_M | ~0.6 GiB | Vulkan RADV | 5369 | Small models = highest t/s |
| Llama 2 7B Q4_0 | ~3.5 GiB | Vulkan RADV | 908-1012 | Mid-size models (FA impact visible) |
| Qwen3MoE 30B Q5_K | ~20 GiB | Vulkan AMDVLK | 525-735 | Large models (context-dependent) |
Key Insight: Smaller models achieve much higher tokens/sec due to less memory bandwidth, better cache utilization, and reduced compute per token.
# Option 1: gfx1151-specific nightlies (recommended)
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchaudio torchvision
# Option 2: Official ROCm wheels
pip install torch torchvision torchaudio -f https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/
# Option 3: scottt/rocm-TheRock wheels (recommended for CV work)
pip install "https://github.com/scottt/rocm-TheRock/releases/download/v6.5.0rc-pytorch/torch-2.7.0a0+gitbfd8155-cp311-cp311-linux_x86_64.whl"
| Source | Version | Native gfx1151? | hipBLASLt | Peak TFLOPS | Notes |
|---|---|---|---|---|---|
| PyPI rocm6.2 | 2.5.1+rocm6.2 | ❌ Fails | N/A | 0 | “invalid device function” |
| PyPI rocm6.2 + HSA_OVERRIDE | 2.5.1+rocm6.2 | ⚠️ gfx1100 fallback | Disabled | 25.6 (43%) | Works but suboptimal |
| TheRock 7.11 | 2.11.0a0 | ✅ Native gfx1151 | Disabled | 32.7 (55%) | Best config - disable hipBLASLt! |
| TheRock 7.11 + hipBLASLt | 2.11.0a0 | ✅ Native gfx1151 | Enabled | 27.2 (46%) | hipBLASLt hurts performance |
Key Findings (SoftAb benchmark 2026-02-02):
download.pytorch.org/whl/rocm6.2 do NOT include gfx1151 kernelsROCBLAS_USE_HIPBLASLT=0| Model | TheRock gfx1151 | ROCm 6.2 gfx1100 | Speedup |
|---|---|---|---|
| ResNet-18 | 4697 img/s | 4665 img/s | 1.0x |
| ResNet-50 | 1083 img/s | 1085 img/s | 1.0x |
| ViT-B/16 | 725 img/s | 317 img/s | 2.3x |
| BERT-base (seq=128) | 1193 seq/s | 709 seq/s | 1.7x |
| BERT-base (seq=512) | 271 seq/s | 125 seq/s | 2.2x |
Key Insight: CNNs perform similarly, but attention-based models are 2x faster on native gfx1151.
📊 Full benchmark data: See docker/pytorch/BENCHMARKS.md for complete results, image compatibility matrix, and recommended configurations.
# Forces gfx1100 kernels - WORKS but with caveats
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HSA_ENABLE_SDMA=0
# Results in: 30.9 TFLOPS (52% utilization)
# Warning: hipBLASLt disabled, may crash on long runs
⚠️ Not recommended for production - causes eventual MES/kernel errors.
# Clone lhl's build scripts
git clone https://github.com/lhl/strix-halo-testing
cd strix-halo-testing/torch-therock
# Follow setup script
./00-setup-env.sh
# Build with AOTriton
export PYTORCH_ROCM_ARCH="gfx1151"
export USE_AOTRITON=1 BUILD_AOTRITON=1
github.com/scottt/aotriton/commits/gfx1151-therock/docker.io/kyuz0/vllm-therock-gfx1151github.com/lhl/strix-halo-testing/tree/main/torch-therock| Driver | Optimal -ub |
Prompt (pp) | Generation (tg) | Stability |
|---|---|---|---|---|
| AMDVLK | 512 | Better | Good | Good |
| Mesa RADV | 1024 | Lower | Better | Best |
# Use RADV explicitly
AMD_VULKAN_ICD=RADV llama-cli ...
# Use AMDVLK (default when both installed)
llama-cli ...
| Project | Notes |
|---|---|
| llama.cpp | Mature Vulkan backend, KHR_coopmat support |
| Kompute | Linux Foundation GPU compute framework |
| Ollama | OLLAMA_VULKAN=1 for Vulkan mode |
| LM Studio | Vulkan works (ROCm broken on gfx1151) |
Back to: KNOWLEDGE_BASE.md