AMD / ROCm Integration
GPUFlight supports AMD GPUs via ROCm, including HIP kernel tracing, system telemetry, occupancy analysis, and ISA disassembly.
On AMD today, only ProfilingEngine::Monitor / Trace (system
monitoring + kernel timing) and the dispatch-counter path are
supported. The NVIDIA-specific engines — PcSampling,
SassMetrics, RangeProfiler, Deep — are not yet implemented on
the ROCm backend. Setting them on AMD falls back to the
dispatch-counter path after a startup warning. AMD parity is on the
roadmap; for now, AMD users get kernel timing, occupancy, ISA
disassembly, and system telemetry, but not PC-level stall sampling.
Prerequisites
- ROCm 6.x or later
- HIP runtime
- ROCm SMI library
- rocprofiler-sdk
- CMake 3.31+
Build Setup
Fetch the library via CMake FetchContent (see
Installation for the canonical
boilerplate), then enable the AMD backend and link your HIP target:
# AMD backend is opt-in — set BEFORE FetchContent_MakeAvailable(gpufl)
# so the gpufl-client subproject configures with these flags.
set(GPUFL_ENABLE_AMD ON CACHE BOOL "" FORCE)
set(GPUFL_ENABLE_NVIDIA OFF CACHE BOOL "" FORCE)
hip_add_executable(my_app my_app.cpp)
target_link_libraries(my_app PRIVATE gpufl::gpufl hip::host)
HIP Example
#include <gpufl/gpufl.hpp>
#include <hip/hip_runtime.h>
__global__ void scaleKernel(int* data, int scale, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) data[idx] *= scale;
}
int main() {
gpufl::InitOptions opts;
opts.app_name = "hip_demo";
opts.continuous_system_sampling = true;
opts.system_sample_rate_ms = 50;
opts.profiling_engine = gpufl::ProfilingEngine::Trace; // capture kernels (default is Monitor = telemetry only)
gpufl::init(opts);
int* d_data;
hipMalloc(&d_data, N * sizeof(int));
GFL_SCOPE("scale_loop") {
for (int i = 0; i < 50; ++i) {
scaleKernel<<<N/256, 256>>>(d_data, 2, N);
}
hipDeviceSynchronize();
}
hipFree(d_data);
gpufl::shutdown();
gpufl::generateReport();
}
Extended AMD Metrics
GPUFlight collects additional metrics on AMD GPUs via ROCm SMI:
| Metric | Description |
|---|---|
| Junction Temperature | GPU junction (hotspot) temperature |
| Memory Temperature | VRAM temperature |
| Fan Speed | Fan speed percentage |
| Voltage | GFX voltage in millivolts |
| Energy | Cumulative energy consumption |
| PCIe Bandwidth | Combined PCIe read+write throughput |
| ECC Errors | Correctable and uncorrectable error counts |
These appear automatically in the system metrics section of the report when available.
Occupancy on AMD
GPUFlight computes theoretical occupancy for AMD kernels using:
- Wavefront size (typically 32 for RDNA, 64 for CDNA)
- Max wavefronts per CU from the GPU architecture
- VGPR usage per kernel (from rocprofiler code object metadata)
- LDS (shared memory) usage per workgroup
The limiting resource is identified as "waves", "registers", or "shared_mem".
AMD occupancy uses architecture VGPR count only (not combined SGPR+VGPR). SGPRs have a separate allocation pool and don't limit VGPR occupancy.
ISA Disassembly
AMD ISA disassembly is captured automatically when GPU code objects are loaded. GPUFlight:
- Captures the ELF code object during the
CODE_OBJECT_LOADcallback - Computes a CRC32 for deduplication
- Disassembles using
llvm-objdump(from the ROCm LLVM toolchain) - Emits per-function instruction listings with PC offsets
The disassembly appears in the web UI under the "ISA" column (vs "SASS" for NVIDIA).
Known Limitations
- No PC sampling on RDNA consumer GPUs: PC sampling requires MI200+ (CDNA) hardware
- No SASS-equivalent metrics: Instruction-level metric collection is not yet available via rocprofiler-sdk for RDNA
- CPU iGPU filtering: Systems with AMD APUs (Ryzen with integrated graphics) are automatically filtered out of telemetry to avoid polluted metrics