Installation

GPUFlight supports both NVIDIA CUDA and AMD ROCm backends. You need the C++ library for integration into your GPU application, and optionally the Python library for analysis and visualization.

C++ Library (Integration)

The recommended way to integrate gpufl into your C++ project is via CMake's FetchContent.

NVIDIA Prerequisites

CMake 3.31 or higher
CUDA Toolkit 13.x or later (including CUPTI)
A C++17 compatible compiler

AMD Prerequisites

CMake 3.31 or higher
ROCm 6.x with HIP runtime
ROCm SMI library
rocprofiler-sdk
A C++17 compatible compiler

CMake Integration

Add the following to your CMakeLists.txt:

include(FetchContent)

FetchContent_Declare(
    gpufl
    GIT_REPOSITORY https://github.com/gpu-flight/gpufl-client.git
    GIT_TAG        main
)

FetchContent_MakeAvailable(gpufl)

For NVIDIA targets:

target_link_libraries(my_app PRIVATE gpufl::gpufl CUDA::cudart)

# Windows: stage the CUPTI / NVPERF runtime DLLs next to the executable.
# Provided by gpufl; a no-op on Linux. Required — see the warning below.
gpufl_copy_runtime_dlls(my_app)

Windows: you must stage the CUPTI runtime DLLs

gpufl links the CUDA Toolkit's CUPTI (plus NVPERF / pcsamplingutil for PC sampling). On Windows those DLLs ship with the toolkit but are not on the system PATH, so an executable that links gpufl silently fails to start: it exits immediately with code 0xC0000135 (STATUS_DLL_NOT_FOUND) — printing nothing and writing no logs.

Call gpufl_copy_runtime_dlls(<target>) (provided by gpufl) for every executable that links gpufl. As a post-build step it copies the matching cupti64_*.dll, nvperf_host*.dll, nvperf_target*.dll, and pcsamplingutil*.dll next to your binary. The function is a no-op on non-Windows platforms, so it is safe to call unconditionally.

For AMD/HIP targets:

# Enable AMD backend
set(GPUFL_ENABLE_AMD ON CACHE BOOL "" FORCE)
set(GPUFL_ENABLE_NVIDIA OFF CACHE BOOL "" FORCE)

target_link_libraries(my_app PRIVATE gpufl::gpufl hip::host)

Build Options

Option	Default	Description
`GPUFL_ENABLE_NVIDIA`	`ON`	Enable NVIDIA backends (CUDA + NVML)
`GPUFL_ENABLE_AMD`	`OFF`	Enable AMD backends (ROCm + HIP)
`BUILD_TESTING`	`ON`	Build test suite
`BUILD_PYTHON`	`OFF`	Build Python bindings

Python Library (Analysis)

The Python library provides tools for analyzing, reporting, and visualizing the logs generated by the C++ library. It requires Python 3.12 or later.

Basic Installation

pip install gpufl

This installs the in-process upload API (gpufl.upload_logs(...)) and the cross-platform uploader CLI (python -m gpufl.cli upload <log_path>).

pip install gpufl does not create a gpufl command

The gpufl name on your PATH belongs to the native launcher binary (gpufl trace / gpufl monitor / gpufl upload), not the Python package. The Python uploader is invoked as python -m gpufl.cli upload.

Optional Extras

Extra	Pulls in	For
`analyzer`	`pandas`, `rich`	The `GpuFlightSession` terminal dashboard + report generation
`viz`	`pandas`, `matplotlib`	The `gpufl.viz` matplotlib timeline plots
`torch`	`torch`, `requests`	First-class PyTorch op-level capture (TorchDispatchMode → NVTX)
`cupy` / `jax` / `triton` / `numba`	framework deps	Framework integrations — stubs in v1.2 (they import cleanly and raise a `NotImplementedError` pointing to the manual-NVTX workaround; full support ships post-launch)
`all`	everything above	Convenience "install everything" extra

pip install "gpufl[analyzer]"   # terminal analyzer + reports
pip install "gpufl[torch]"      # PyTorch integration
pip install "gpufl[all]"        # everything

The Python library works with logs from both NVIDIA and AMD sessions — no backend-specific installation is needed for analysis.

Next: capture a trace

If you installed the native gpufl launcher, the quickest smoke test is to run an existing CUDA program under gpufl trace:

gpufl trace -- python train.py

This does not require linking the SDK into your application. GPUFlight injects into the launched process, writes local NDJSON logs, and records kernel timing, launch metadata, memory copies, synchronization events, and system metrics.

For machine-level telemetry without kernel profiling, use:

gpufl monitor --interval=1000

Continue with Capturing traces for the launcher workflow, then Sending data to the dashboard when you are ready to upload the local logs.

C++ Library (Integration)​

NVIDIA Prerequisites​

AMD Prerequisites​

CMake Integration​

Build Options​

Python Library (Analysis)​

Basic Installation​

Optional Extras​

Next: capture a trace​