softab

Audio Pipeline Benchmarks on AMD Strix Halo (gfx1151)

Last Updated: 2026-02-02 Hardware: AMD Ryzen AI Max+ 395, Radeon 8060S (40 CU, gfx1151), 128GB unified memory Container: softab:audio-pipeline (ROCm 6.2 + Vulkan whisper.cpp)

Executive Summary

Single-container audio processing pipeline combining:

Why this combination?

Benchmark Results

Short Audio (11s JFK Speech)

Stage Time Realtime Factor Backend
VAD (Silero) 1.9s 5.8x CPU
Whisper 0.6s 19x Vulkan RADV
Pyannote 9.9s 1.1x ROCm 6.2
Total 12.4s 0.9x -

Long Audio (71 min / 4268s Meeting Recording)

Stage Time Realtime Factor Backend
VAD (Silero) 23.2s 184x CPU
Whisper 64.6s 66x Vulkan RADV
Pyannote 119.4s 35.7x ROCm 6.2
Total 207.3s 20.6x -

Key Finding: Longer audio amortizes model loading overhead significantly. The 71-minute file achieves 20.6x realtime vs 0.9x for the 11s clip.

Notes:

Stage Breakdown

Stage 1: Voice Activity Detection (Silero)

Silero VAD detects speech segments before transcription. This enables:

Stage 2: Transcription (Whisper.cpp Vulkan)

Vulkan is used instead of ROCm HIP because:

Stage 3: Speaker Diarization (Pyannote)

This is the slowest stage. Diarization involves:

Usage

# Basic usage
podman run --rm \
  --device=/dev/kfd --device=/dev/dri \
  --ipc=host \
  --security-opt seccomp=unconfined \
  --security-opt label=disable \
  -e HF_TOKEN="$HF_TOKEN" \
  -v /data/models:/models:ro \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  softab:audio-pipeline /models/audio.wav -m /models/ggml-base.en.bin

# Skip specific stages
podman run ... softab:audio-pipeline audio.wav --skip-vad
podman run ... softab:audio-pipeline audio.wav --skip-pyannote

# Output to JSON
podman run ... softab:audio-pipeline audio.wav -o /output/result.json

Environment Variables

Variable Value Purpose
HSA_ENABLE_SDMA 0 Required for ROCm stability on Strix Halo
AMD_VULKAN_ICD RADV Use Mesa RADV driver for Vulkan
HF_TOKEN (your token) Required for pyannote model access

Comparison with Standalone Containers

Configuration Whisper Time Pyannote Time Notes
Combined Pipeline 580ms 9.9s Single container
Standalone Whisper ROCm 7.2 447ms - HIP backend, fastest
Standalone Whisper Vulkan 547ms - RADV driver
Standalone Pyannote ROCm 6.2 - 2.5s Cached model

Observation: Pyannote takes longer in the combined pipeline (~10s) vs standalone (~2.5s). This may be due to:

Known Issues

  1. First run is slower - Model warmup for both Silero and Pyannote
  2. HF_TOKEN required - Pyannote uses gated HuggingFace models
  3. ROCm warnings - hipBLASLt disabled (expected on gfx1151 fallback)
  4. TF32 disabled - Pyannote disables for reproducibility

Building the Container

cd docker/audio-pipeline
podman build \
  --security-opt seccomp=unconfined \
  --security-opt label=disable \
  -t softab:audio-pipeline \
  -f Dockerfile.rocm62-vulkan .

Future Improvements


See also: