softab

PyTorch Benchmarks on AMD Strix Halo (gfx1151)

Last Updated: 2026-02-02 Hardware: AMD Ryzen AI Max+ 395, Radeon 8060S (40 CU, gfx1151), 128GB unified memory

Executive Summary

GEMM Performance (4096x4096 FP16)

Configuration GFX Target TFLOPS % of Peak (59.4)
TheRock 7.11 + hipblaslt=0 gfx1151 32.7 55%
TheRock 7.11 (default) gfx1151 27.2 46%
ROCm 6.2 (gfx1100 fallback) gfx1100 25.6 43%

Key finding: ROCBLAS_USE_HIPBLASLT=0 improves performance by ~20%.

Neural Network Throughput (batch=32, FP16)

Model TheRock gfx1151 ROCm 6.2 gfx1100 Speedup
ResNet-18 4697 img/s 4665 img/s 1.0x
ResNet-50 1083 img/s 1085 img/s 1.0x
ViT-B/16 725 img/s 317 img/s 2.3x
BERT-base (seq=128) 1193 seq/s 709 seq/s 1.7x
BERT-base (seq=512) 271 seq/s 125 seq/s 2.2x

Key finding: CNNs are equal, but attention-based models are 2x faster on native gfx1151.

Image Compatibility Matrix

Tested 75 PyTorch images on 2026-02-02:

Status Count Description
PASS 40 Working (use python3.12 not python3)
INVALID_FUNC 17 Kernels fail at runtime
NO_PY312 12 Image lacks python3.12
FAIL/OTHER 6 Various failures

Native gfx1151 (best for transformers):

softab:pytorch-fedora-rocm          # TheRock 7.11, gfx1151
softab:pytorch-therock-gfx1151      # TheRock 7.11, gfx1151
softab:pytorch-therock-pip-gfx1151  # TheRock 7.11, gfx1151
softab:pytorch-mismatch-fwd-*       # TheRock 7.11, gfx1151

gfx1100 fallback (stable, slower for transformers):

softab:pytorch-rocm644-official     # ROCm 6.2, gfx1100 fallback
softab:pytorch-ablation-*           # ROCm 6.2, gfx1100 fallback
softab:pytorch-official-v3          # ROCm 6.2, gfx1100 fallback

Images That Fail

INVALID_FUNC (detect gfx1151 but lack kernels):

softab:pytorch-rocm72-official      # Detects gfx1151, crashes on compute
softab:pytorch-nightly-*            # Same issue
softab:pytorch-official-gfx115*     # Same issue

Environment Variables

# REQUIRED for best performance
export ROCBLAS_USE_HIPBLASLT=0              # +20% GEMM performance
export HSA_ENABLE_SDMA=0                    # Stability fix
export PYTORCH_HIP_ALLOC_CONF="backend:native,expandable_segments:True"

# NOT recommended (outdated advice)
# export ROCBLAS_USE_HIPBLASLT=1            # Actually HURTS performance
# export HSA_OVERRIDE_GFX_VERSION=11.0.0    # Native gfx1151 now faster

Outdated Claims (Corrected)

Old Claim Reality (2026-02)
“gfx1100 kernels are 2-6x faster” FALSE - Native gfx1151 is 2x faster for transformers
“Use HSA_OVERRIDE for better perf” FALSE - Causes instability, native is faster
“Enable hipBLASLt for speed” FALSE - Disabling it gives +20% on Strix Halo

Running Benchmarks

# GEMM benchmark
./benchmarks/pytorch-gemm.sh softab:pytorch-fedora-rocm

# Neural network throughput
./benchmarks/pytorch-nn-bench.sh softab:pytorch-fedora-rocm 32

# Quick ablation sweep (requires python3.12 in container)
for img in softab:pytorch-fedora-rocm softab:pytorch-rocm644-official; do
  podman run --rm \
    --device=/dev/kfd --device=/dev/dri \
    --ipc=host \
    --security-opt seccomp=unconfined \
    --security-opt label=disable \
    -e ROCBLAS_USE_HIPBLASLT=0 \
    "$img" python3.12 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'GFX: {torch.cuda.get_device_properties(0).gcnArchName}')
a = torch.randn(4096, 4096, dtype=torch.float16, device='cuda')
b = torch.randn(4096, 4096, dtype=torch.float16, device='cuda')
import time
for _ in range(10): torch.matmul(a, b)
torch.cuda.synchronize()
t = time.perf_counter()
for _ in range(100): torch.matmul(a, b)
torch.cuda.synchronize()
print(f'TFLOPS: {2*4096**3*100/(time.perf_counter()-t)/1e12:.1f}')
"
done

TheRock vs Official ROCm

Aspect TheRock 7.11 Nightlies Official ROCm 6.2/7.2
gfx1151 target ✅ Compiled ❌ Not included
Transformer perf 2x faster Baseline
Stability Good (nightlies) Stable
Installation Pip/tarball System packages

Why TheRock is faster: It’s the same code, but TheRock nightlies compile with gfx1151 as a target. Official releases only include “supported” GPUs, so gfx1151 falls back to gfx1100 kernels which miss architecture-specific optimizations.


See also: