AMD Strix Halo (gfx1151) AI Software Stack - Knowledge Base
Last Updated: February 2026
Primary Sources: lhl/strix-halo-testing, Strix Halo Wiki, llm-tracker.info
Note: This knowledge base is reference material curated from community sources. The SoftAb project’s ablation scope is narrower: we test software stack configurations (drivers, ROCm versions, installation methods, backends) - not model comparisons, quantization sweeps, or inference parameter tuning. See README.md for project scope.
SoftAb’s Approach to Dependency Hell
AMD Strix Halo’s software ecosystem is rapidly evolving with frequent breaking changes:
- Kernel updates break ROCm packages
- ROCm version X works with kernel Y but not Z
- Firmware versions can break GPU detection
- Different installation methods (Fedora repos, AMD repos, TheRock nightlies) have different compatibility
Instead of manually troubleshooting, SoftAb provides:
- Pre-configured Docker images for every major combination
- Automated test scripts that quickly identify what works
- Structured documentation of failures (error patterns, root causes)
When you upgrade your kernel, run an experiment to test all configs. Most will fail - document the failures to build a compatibility matrix for your environment.
Executive Summary
AMD Strix Halo (Ryzen AI Max+ 395) offers 128GB unified memory enabling 70B+ parameter models on a single chip. The software stack is maturing but requires careful configuration.
Key Findings:
- Vulkan outperforms ROCm 2-2.5x for prompt processing in most scenarios
- ROCm + rocWMMA + Flash Attention excels at long context (8K+)
- Fedora 43 is recommended over Ubuntu 24.04 (simpler ROCm install, newer kernel)
- Kernel 6.15+ required, 6.16.9+ fixes VRAM visibility bugs
- ⚠️ Current kernel 6.18.6 - Standard ROCm broken, MUST use TheRock nightlies
- Full official ROCm support expected Q2 2026 with ROCm 7.2.2
gfx1100 faster than gfx1151 - OUTDATED! Native gfx1151 (TheRock 7.11) is now 2x faster for transformers
- Disable hipBLASLt (
ROCBLAS_USE_HIPBLASLT=0) for +20% PyTorch performance
Quick Reference
| Topic |
Quick Answer |
Details |
| Best backend? |
Vulkan RADV for general use; ROCm for long context |
ROCm Support |
| Best distro? |
Fedora 43 |
Applications |
| Kernel version? |
6.18.3-200 (avoid 6.18.4+) |
Troubleshooting |
| ROCm version? |
TheRock 7.11 (best), 7.2.0 (official), or 6.4.4 (stable fallback) |
ROCm Support |
| Required flags? |
--no-mmap -ngl 999 -fa 1 for llama.cpp |
Troubleshooting |
| GPU not detected? |
Check kernel 6.16.9+, firmware not 20251125 |
Troubleshooting |
| Turnkey solution? |
kyuz0 toolboxes for llama.cpp |
Applications |
Documentation Structure
- GPU compute specifications (59.4 TFLOPS peak FP16)
- Memory architecture (128GB unified LPDDR5X)
- Comparison with competition (M4 Max, DGX Spark, RTX PRO)
- Model capacity guide (70B comfortable, 235B Q3_K near limit)
- Performance baselines
- Official support timeline (6.4.4 stable, 7.2.2 coming Q2 2026)
- Library support (rocBLAS, hipBLASLt, rocWMMA, AOTriton)
- TheRock nightlies installation
- Critical environment variables (
HSA_ENABLE_SDMA=0)
- Backend performance comparison (Vulkan vs HIP)
- PyTorch compatibility and installation
- Triton/AOTriton support
- Linux distribution recommendations (Fedora 43 vs Ubuntu 24.04)
- Application support status:
- llama.cpp (Vulkan ✅, ROCm HIP ✅)
- Ollama (Vulkan mode supported)
- vLLM (official support via PR #25908)
- whisper.cpp (ROCm 7.0.1+ working)
- pyannote-audio (GPU acceleration working)
- Turnkey solutions:
- kyuz0 Strix Halo Toolboxes (recommended for llama.cpp)
- Ryzers (AMD Research framework)
- Ryzen AI SDK 1.6.1
- GAIA, Lemonade Server
- Reference Dockerfiles for gfx1151
- Known issues and workarounds:
- VRAM visibility bug (upgrade to kernel 6.16.9+)
- GPU stuck at low clocks (
echo high fix)
- MES/kernel hangs (
amdgpu.cwsr_enable=0)
- Bad firmware (avoid 20251125)
- Kernel 6.18.4+ compatibility (use 6.18.3-200)
- Container permission denied
- Performance optimization:
- System-level (tuned, IOMMU disable, force high perf)
- llama.cpp specific (
--no-mmap, -fa 1, -ngl 999)
- Flash Attention impact (+11.5% pp, +8.8% tg)
- Environment variables reference
- Vulkan driver selection and batch sizes
- Diagnostic commands
- Primary documentation (Strix Halo Wiki, lhl’s repo, llm-tracker.info)
- Interactive tools (benchmark viewer, forums, Discord)
- Key repositories (kyuz0, scottt, lhl, lemonade-sdk)
- Lemonade Server (AMD official)
- Kernel and container considerations
- Glossary of terms
- Version history
Hardware Target
- APU: AMD Ryzen AI Max+ 395 (Strix Halo)
- GPU: Radeon 8060S (40 CU, gfx1151)
- Architecture: RDNA 3.5 / Zen 5
- Memory: 128GB unified LPDDR5X-8000 (256 GB/s GPU bandwidth)
- Peak FP16: 59.4 TFLOPS theoretical
Essential Commands
# Required environment variables
export HSA_ENABLE_SDMA=0
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True"
# llama.cpp critical flags
llama-cli --no-mmap -ngl 999 -fa 1 -m model.gguf
# Check GPU
rocminfo | grep gfx
rocm-smi --showmeminfo vram
# Force high performance
echo high | sudo tee /sys/class/drm/card0/device/power_dpm_force_performance_level
Quick Links
Contributing
This knowledge base is community-driven. If you find issues or have updates:
- Test your findings on Strix Halo hardware
- Document your system configuration (kernel, ROCm version, etc.)
- Submit issues or PRs to SoftAb GitHub
- Share benchmarks at llm-tracker.info
License
Knowledge base content is MIT licensed. Community contributions acknowledged.
Last major update: 2026-02-02 (Added ROCm 7.2 info, PyTorch NN benchmark data, hipBLASLt findings)
Software stack status: Rapidly evolving - verify current status before major decisions