Skip to main content

AMD / ROCm Integration

GPUFlight supports AMD GPUs via ROCm, including HIP kernel tracing, system telemetry, occupancy analysis, and ISA disassembly.

Profiling engines: AMD vs. NVIDIA

On AMD today, only ProfilingEngine::Monitor / Trace (system monitoring + kernel timing) and the dispatch-counter path are supported. The NVIDIA-specific engines — PcSampling, SassMetrics, RangeProfiler, Deep — are not yet implemented on the ROCm backend. Setting them on AMD falls back to the dispatch-counter path after a startup warning. AMD parity is on the roadmap; for now, AMD users get kernel timing, occupancy, ISA disassembly, and system telemetry, but not PC-level stall sampling.

Prerequisites

  • ROCm 6.x or later
  • HIP runtime
  • ROCm SMI library
  • rocprofiler-sdk
  • CMake 3.31+

Build Setup

Fetch the library via CMake FetchContent (see Installation for the canonical boilerplate), then enable the AMD backend and link your HIP target:

# AMD backend is opt-in — set BEFORE FetchContent_MakeAvailable(gpufl)
# so the gpufl-client subproject configures with these flags.
set(GPUFL_ENABLE_AMD ON CACHE BOOL "" FORCE)
set(GPUFL_ENABLE_NVIDIA OFF CACHE BOOL "" FORCE)

hip_add_executable(my_app my_app.cpp)
target_link_libraries(my_app PRIVATE gpufl::gpufl hip::host)

HIP Example

#include <gpufl/gpufl.hpp>
#include <hip/hip_runtime.h>

__global__ void scaleKernel(int* data, int scale, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) data[idx] *= scale;
}

int main() {
gpufl::InitOptions opts;
opts.app_name = "hip_demo";
opts.continuous_system_sampling = true;
opts.system_sample_rate_ms = 50;
opts.profiling_engine = gpufl::ProfilingEngine::Trace; // capture kernels (default is Monitor = telemetry only)
gpufl::init(opts);

int* d_data;
hipMalloc(&d_data, N * sizeof(int));

GFL_SCOPE("scale_loop") {
for (int i = 0; i < 50; ++i) {
scaleKernel<<<N/256, 256>>>(d_data, 2, N);
}
hipDeviceSynchronize();
}

hipFree(d_data);
gpufl::shutdown();
gpufl::generateReport();
}

Extended AMD Metrics

GPUFlight collects additional metrics on AMD GPUs via ROCm SMI:

MetricDescription
Junction TemperatureGPU junction (hotspot) temperature
Memory TemperatureVRAM temperature
Fan SpeedFan speed percentage
VoltageGFX voltage in millivolts
EnergyCumulative energy consumption
PCIe BandwidthCombined PCIe read+write throughput
ECC ErrorsCorrectable and uncorrectable error counts

These appear automatically in the system metrics section of the report when available.

Occupancy on AMD

GPUFlight computes theoretical occupancy for AMD kernels using:

  • Wavefront size (typically 32 for RDNA, 64 for CDNA)
  • Max wavefronts per CU from the GPU architecture
  • VGPR usage per kernel (from rocprofiler code object metadata)
  • LDS (shared memory) usage per workgroup

The limiting resource is identified as "waves", "registers", or "shared_mem".

note

AMD occupancy uses architecture VGPR count only (not combined SGPR+VGPR). SGPRs have a separate allocation pool and don't limit VGPR occupancy.

ISA Disassembly

AMD ISA disassembly is captured automatically when GPU code objects are loaded. GPUFlight:

  1. Captures the ELF code object during the CODE_OBJECT_LOAD callback
  2. Computes a CRC32 for deduplication
  3. Disassembles using llvm-objdump (from the ROCm LLVM toolchain)
  4. Emits per-function instruction listings with PC offsets

The disassembly appears in the web UI under the "ISA" column (vs "SASS" for NVIDIA).

Known Limitations

  • No PC sampling on RDNA consumer GPUs: PC sampling requires MI200+ (CDNA) hardware
  • No SASS-equivalent metrics: Instruction-level metric collection is not yet available via rocprofiler-sdk for RDNA
  • CPU iGPU filtering: Systems with AMD APUs (Ryzen with integrated graphics) are automatically filtered out of telemetry to avoid polluted metrics