Hardware Target: AMD Ryzen AI Max+ 395 (Strix Halo) with Radeon 8060S iGPU
| Metric | Value |
|---|---|
| Architecture | RDNA 3.5 (gfx1151) |
| Compute Units | 40 |
| Max Clock | 2.9 GHz |
| Peak FP16/BF16 | 59.4 TFLOPS |
| Peak INT8 | 59.4 TOPS |
| Peak INT4 | 118.8 TOPS |
| Memory | 128GB LPDDR5X-8000 (unified) |
| Memory Bandwidth | 256 GB/s theoretical, ~215 GB/s measured (84% efficiency) |
Peak FP16 Calculation: 512 ops/clock/CU × 40 CU × 2.9e9 clock / 1e12 = 59.392 TFLOPS
| Spec | Strix Halo 395 | Mac Studio M4 Max | DGX Spark | RTX PRO 6000 |
|---|---|---|---|---|
| Price | $2,000 | $3,499 | $4,000 | $8,200 |
| Power (Max W) | 120 | 300+ | 240 | 600 |
| Memory | 128GB | 128GB | 128GB | 96GB GDDR7 |
| Memory BW | 256 GB/s | 546 GB/s | 273 GB/s | 1792 GB/s |
| FP16 TFLOPS | 59.4 | 34.1 | 62.5 | 251.9 |
Key Advantage: Strix Halo trades bandwidth for capacity—ideal for:
Create /etc/modprobe.d/amdgpu-ai.conf:
# GTT allocation via TTM (pages_limit × 4KB = total bytes)
# 31457280 pages × 4KB = 120 GiB
options ttm pages_limit=31457280
# Optional: Pre-allocate to reduce fragmentation
# Set equal to pages_limit for AI-dedicated systems
options ttm page_pool_size=31457280
# Legacy parameter (still set for compatibility)
options amdgpu gttsize=122800
Add to GRUB_CMDLINE_LINUX_DEFAULT:
# Option 1: Maximum memory (kyuz0 toolboxes - 124GB GPU, 4GB OS)
amd_iommu=off # May improve perf (unverified, see experiment-kernel-config.md)
amdgpu.gttsize=126976 # 124GB GTT size
ttm.pages_limit=32505856 # 124GB in pages (4KB pages)
# Option 2: Alternative full allocation (~128GB GPU)
amd_iommu=off # May improve perf (unverified, see experiment-kernel-config.md)
amdgpu.gttsize=131072 # 128GB GTT size
ttm.pages_limit=33554432 # ~128GB in pages
amdgpu.cwsr_enable=0 # Fix MES hangs (add if needed)
| Model | Size | Status |
|---|---|---|
| 70B Q4_K | ~37 GiB | ✅ Comfortable |
| 109B MoE (Llama 4 Scout) | ~58 GiB | ✅ Works well |
| 235B MoE Q3_K | ~97 GiB | ⚠️ Near limit |
VRAM Estimation (using kyuz0’s gguf-vram-estimator.py):
# Llama-4-Scout 17B Q4_K at different context lengths
# 4K context: 61.6 GB
# 32K context: 74.8 GB
# 1M context: 108.8 GB
# Qwen3-235B Q3_K (97 GiB model)
# 4K context: ~97 GiB
# 32K context: ~110 GiB
# 130K context: ~130 GiB max on 128GB system
| Workload | Achieved | Theoretical | Utilization |
|---|---|---|---|
| GEMM (hipBLASLt, native gfx1151) | 36.9 TFLOPS | 59.4 TFLOPS | 62% |
| GEMM (HSA_OVERRIDE gfx1100) | 30.9 TFLOPS | 59.4 TFLOPS | 52% |
| GEMM (rocBLAS only) | 5.1 TFLOPS | 59.4 TFLOPS | 9% |
| Flash Attention | ~1 TFLOPS | - | Poor |
| Backend | Prompt (t/s) | Generation (t/s) | Notes |
|---|---|---|---|
| Vulkan AMDVLK | 884 | 53 | Best prompt processing |
| Vulkan RADV | 729 | 55 | Slightly better tg, more stable |
| ROCm HIP baseline | 349 | 49 | Without optimizations |
| ROCm + rocWMMA + FA | 344 | 51 | Better at long context |
| CPU only | 294 | 29 | Fallback |
Key Finding: Vulkan outperforms ROCm 2-2.5x for prompt processing in most scenarios.
Back to: KNOWLEDGE_BASE.md