AMD GPU Monitoring Tool for Virtual Function | Generated by AI

Home 2026.05

Here’s the comprehensive breakdown of what I did:


Goal

Create an nvidia-smi equivalent for an AMD GPU on a remote server where the standard rocm-smi couldn’t detect the GPU.


Phase 1: Remote Access & Discovery

SSH in — key-based auth already set up, no friction.

ssh root@134.199.199.108

First attempt — ran rocm-smi directly. Got command not found. This is the obvious AMD counterpart to nvidia-smi, but it wasn’t installed.

Searched for alternatives — checked for amd-smi, rocminfo, clinfo, and looked under /opt/rocm*/. Nothing. Only libdrm-amdgpu1 was installed (a minimal DRM userspace library).


Phase 2: Hardware Identification

lspci scan — the critical discovery. Found the GPU at 83:00.0:

Processing accelerators: AMD/ATI Aqua Vanjaram [Instinct MI300X VF]

Key detail: it’s a VF (Virtual Function), not a PF (Physical Function). This is SR-IOV — the physical GPU is partitioned by a hypervisor, and this VM gets a virtual slice.

Checked /sys/class/drm/ — only card0 and renderD128. No amdgpu-specific sysfs entries (no gpu_busy_percent, no mem_info_vram_*, no hwmon). This confirmed the VF limitation — no management interface exposed.

Checked /sys/class/kfd/ topology — only a CPU node, no GPU node. KFD (Kernel Fusion Driver, AMD’s compute driver) doesn’t enumerate SR-IOV VFs.

Read PCI sysfs directly — this is where the real data lived:

lspci -vv — gave the memory BAR layout: 256 GB BAR at 0x4000000000, confirming the full HBM3 allocation.


Phase 3: Installing rocm-smi (partial success)

Added AMD ROCm apt repohttps://repo.radeon.com/rocm/apt/latest noble. The “noble” (Ubuntu 24.04) repo works on 25.10 because the packages are mostly userspace.

Installed rocm-smiapt install rocm-smi. This succeeded.

Ran itrocm-smi and rocm-smi --alldevices --showhw both returned “No AMD GPUs specified” with empty tables. This validated the hypothesis: rocm-smi relies on KFD (/sys/class/kfd/) which is absent for VFs.


Phase 4: Building amd-smi (the custom tool)

Since rocm-smi couldn’t see the GPU, I built a Python script that reads the same data sources but bypasses the KFD requirement entirely.

Data sources used:

Info Source
GPU name, BARs lspci -s <bdf> -vv
Device/vendor IDs /sys/bus/pci/devices/<bdf>/device, vendor, subsystem_*
PCIe link speed/width /sys/bus/pci/devices/<bdf>/current_link_speed, current_link_width, max_link_*
NUMA node /sys/bus/pci/devices/<bdf>/numa_node
Power state /sys/bus/pci/devices/<bdf>/power_state
IRQ /sys/bus/pci/devices/<bdf>/irq
Resizable BAR /sys/bus/pci/devices/<bdf>/resource0_resize
Driver version /sys/module/amdgpu/version

Device filtering logic — iterates /sys/bus/pci/devices/*, checks:

  1. vendor == 0x1002 (AMD)
  2. class starts with 0x03 (display) or 0x12 (processing accelerator)

This catches both GPUs and compute accelerators like the MI300X.

Deployment hurdles:

Installed to /usr/local/bin/amd-smi with chmod +x.


Phase 5: Bug Fix

The first run showed Device ID: 0x0x74b5 — the sysfs files already include the 0x prefix, and my code was adding another. Fixed by adding .removeprefix("0x") to the reads, patched locally, and re-scp’d.


Key Learnings

  1. SR-IOV VFs are invisible to KFD/rocm-smi — the standard AMD tooling relies on KFD topology which doesn’t enumerate virtual functions. You have to fall back to raw PCI sysfs.

  2. PCI class 0x12 = processing accelerator — MI300X doesn’t identify as a VGA device (class 0x03), so GPU detection must include the accelerator class.

  3. SSH heredocs are fragile with nested quotingwrite_file + scp is the reliable pattern for deploying scripts to remote hosts.

  4. rocm-smi from apt works on non-LTS Ubuntu — AMD packages for “noble” (24.04) install fine on 25.10 since they’re userspace libs without kernel module dependencies.

  5. The 0x prefix in PCI sysfs values/sys/bus/pci/devices/*/vendor and /device return values like 0x1002, not raw hex. Need to strip before formatting.


Back Donate