softab

AMD Strix Halo Hardware Specifications

Hardware Target: AMD Ryzen AI Max+ 395 (Strix Halo) with Radeon 8060S iGPU

GPU Compute Specifications

Metric Value
Architecture RDNA 3.5 (gfx1151)
Compute Units 40
Max Clock 2.9 GHz
Peak FP16/BF16 59.4 TFLOPS
Peak INT8 59.4 TOPS
Peak INT4 118.8 TOPS
Memory 128GB LPDDR5X-8000 (unified)
Memory Bandwidth 256 GB/s theoretical, ~215 GB/s measured (84% efficiency)

Peak FP16 Calculation: 512 ops/clock/CU × 40 CU × 2.9e9 clock / 1e12 = 59.392 TFLOPS

Comparison with Competition

Spec Strix Halo 395 Mac Studio M4 Max DGX Spark RTX PRO 6000
Price $2,000 $3,499 $4,000 $8,200
Power (Max W) 120 300+ 240 600
Memory 128GB 128GB 128GB 96GB GDDR7
Memory BW 256 GB/s 546 GB/s 273 GB/s 1792 GB/s
FP16 TFLOPS 59.4 34.1 62.5 251.9

Key Advantage: Strix Halo trades bandwidth for capacity—ideal for:

Memory Architecture

Understanding GTT vs GART

Linux Memory Configuration

Create /etc/modprobe.d/amdgpu-ai.conf:

# GTT allocation via TTM (pages_limit × 4KB = total bytes)
# 31457280 pages × 4KB = 120 GiB
options ttm pages_limit=31457280

# Optional: Pre-allocate to reduce fragmentation
# Set equal to pages_limit for AI-dedicated systems
options ttm page_pool_size=31457280

# Legacy parameter (still set for compatibility)
options amdgpu gttsize=122800

Kernel Parameters

Add to GRUB_CMDLINE_LINUX_DEFAULT:

# Option 1: Maximum memory (kyuz0 toolboxes - 124GB GPU, 4GB OS)
amd_iommu=off              # May improve perf (unverified, see experiment-kernel-config.md)
amdgpu.gttsize=126976      # 124GB GTT size
ttm.pages_limit=32505856   # 124GB in pages (4KB pages)

# Option 2: Alternative full allocation (~128GB GPU)
amd_iommu=off              # May improve perf (unverified, see experiment-kernel-config.md)
amdgpu.gttsize=131072      # 128GB GTT size
ttm.pages_limit=33554432   # ~128GB in pages
amdgpu.cwsr_enable=0       # Fix MES hangs (add if needed)

Model Capacity Guide

Model Size Status
70B Q4_K ~37 GiB ✅ Comfortable
109B MoE (Llama 4 Scout) ~58 GiB ✅ Works well
235B MoE Q3_K ~97 GiB ⚠️ Near limit

VRAM Estimation (using kyuz0’s gguf-vram-estimator.py):

# Llama-4-Scout 17B Q4_K at different context lengths
# 4K context:  61.6 GB
# 32K context: 74.8 GB
# 1M context:  108.8 GB

# Qwen3-235B Q3_K (97 GiB model)
# 4K context:  ~97 GiB
# 32K context: ~110 GiB
# 130K context: ~130 GiB max on 128GB system

Performance Baseline

Measured Performance

Workload Achieved Theoretical Utilization
GEMM (hipBLASLt, native gfx1151) 36.9 TFLOPS 59.4 TFLOPS 62%
GEMM (HSA_OVERRIDE gfx1100) 30.9 TFLOPS 59.4 TFLOPS 52%
GEMM (rocBLAS only) 5.1 TFLOPS 59.4 TFLOPS 9%
Flash Attention ~1 TFLOPS - Poor

llama.cpp Benchmarks (Llama-2-7B Q4_0, pp512/tg128)

Backend Prompt (t/s) Generation (t/s) Notes
Vulkan AMDVLK 884 53 Best prompt processing
Vulkan RADV 729 55 Slightly better tg, more stable
ROCm HIP baseline 349 49 Without optimizations
ROCm + rocWMMA + FA 344 51 Better at long context
CPU only 294 29 Fallback

Key Finding: Vulkan outperforms ROCm 2-2.5x for prompt processing in most scenarios.


Back to: KNOWLEDGE_BASE.md