Skip to main content

API Reference

C++ API

#include <gpufl/gpufl.hpp>

Initialization

namespace gpufl {

struct InitOptions {
// ── Identity ────────────────────────────────────────────────────
std::string app_name = "gpufl";
std::string log_path = ""; // defaults to "<app_name>.log"

// ── Cloud upload ────────────────────────────────────────────────
std::string backend_url = ""; // e.g. "https://api.gpuflight.com" (host only)
std::string api_key = ""; // sent as `Authorization: Bearer <key>`
std::string api_path = ""; // empty → "/api/v1"; override for proxy mounts
bool remote_upload = false; // DEPRECATED v1.1, removed v1.2 (no-op; see api-reference)

// ── What to capture ─────────────────────────────────────────────
bool enable_stack_trace = false; // capture CPU stacks on launch + sync events
bool enable_source_collection = true; // collect source for SASS correlation
bool enable_external_correlation = true; // honor framework-pushed external IDs (PyTorch/JAX/XLA)
bool enable_synchronization = true; // CUDA sync events (host-blocked time)
bool enable_memory_tracking = false; // cudaMalloc / cudaFree timing — opt-in
bool enable_cuda_graphs_tracking = false; // per-launch cudaGraphLaunch timing — opt-in

// ── Sampling ────────────────────────────────────────────────────
int system_sample_rate_ms = 0; // 0 = disabled; ~50–100 typical
int kernel_sample_rate_ms = 0; // DEPRECATED (1.0.1) — no longer has any effect
bool continuous_system_sampling = false; // renamed from sampling_auto_start

// ── Profiling engine ────────────────────────────────────────────
BackendKind backend = BackendKind::Auto;
ProfilingEngine profiling_engine = ProfilingEngine::Monitor; // telemetry only; opt up the ladder

// ── Advanced ────────────────────────────────────────────────────
std::string config_file = ""; // local JSON config applied after defaults
bool flush_logs_always = false;
bool enable_debug_output = false;
bool enabled = true; // global kill switch; false → init is a no-op
};

bool init(const InitOptions& opts);
void shutdown();
void generateReport(const std::string& output_path = "");
}

Field reference

Identity

FieldTypeDefaultNotes
app_namestring"gpufl"Shown in the dashboard.
log_pathstring"" (= "<app_name>")Base directory for NDJSON logs. Each run writes into a per-session subdirectory: <log_path>/<session_id>/{device,scope,system}.log. The gpufl-agent daemon tails these.

Cloud upload (see Sending data to the dashboard)

FieldTypeDefaultNotes
backend_urlstring""Backend host — do not include /api/v1.
api_keystring""Workspace API key (gpfl_xxx).
api_pathstring""Empty resolves to /api/v1. Override for reverse-proxy mounts.
remote_uploadboolfalseDEPRECATED in v1.1; removed in v1.2. Live HTTP streaming was removed. The flag stays as a one-release deprecation shim: Python customers see a DeprecationWarning and get an atexit handler that calls upload_logs() at interpreter exit; C++ customers see a deprecation log line at init and need to call gpufl::uploadLogs() explicitly themselves. New code should use gpufl::uploadLogs(opts) (C++) or gpufl.upload_logs(...) (Python) directly, or with gpufl.session(backend_url=..., api_key=...): to orchestrate it.

What to capture

FieldDefaultNotes
enable_stack_tracefalseCapture CPU stacks at kernel launch and sync points. Powers per-line attribution in the dashboard.
enable_source_collectiontrueRead source files referenced in stacks; needed for SASS/source correlation.
enable_external_correlationtrueHonor PyTorch/JAX/XLA-pushed external IDs so kernels are tagged with their framework op.
enable_synchronizationtrueCapture cudaStreamSynchronize / cudaDeviceSynchronize / etc. Time spent here = host blocked on GPU.
enable_memory_trackingfalsecudaMalloc / cudaFree / cudaMallocAsync timing. Opt-in due to high event volume in TF eager mode.
enable_cuda_graphs_trackingfalsePer-launch cudaGraphLaunch timing. Opt-in pending validation on Blackwell.

Sampling

FieldDefaultNotes
system_sample_rate_ms00 = disabled. ~50–100 ms typical for monitoring.
kernel_sample_rate_ms0Deprecated (1.0.1) — has no effect. It previously throttled kernel activity-record processing, but that corrupted kernel GPU-time totals (durations were over-counted on host-bound workloads), so it was disabled. All kernel activity records are now always captured. Still accepted (won't error) for backward compatibility; will be removed in a future major release.
continuous_system_samplingfalsePolicy for the system-metric sampler.
true — sample continuously from init() to shutdown() regardless of scopes. Use for fleet monitoring / dashboards / any always-on use case.
false — sampler is idle by default and activates only while inside a GFL_SCOPE region (auto-bracketing) or between explicit systemStart() / systemStop() calls. Outside those windows zero system-metric events are emitted.
Renamed from sampling_auto_start in 1.0.4. The old kwarg is still accepted from Python with a DeprecationWarning; C++ callers must use the new name.

Profiling engine

See Profiling Engines below.

Advanced

FieldDefaultNotes
config_file""Local JSON file applied after the built-in defaults.
flush_logs_alwaysfalsefsync after every write. Diagnostics; avoid in production.
enable_debug_outputfalseVerbose stderr logs from gpufl-client.
enabledtrueGlobal kill switch. When false, init() returns immediately without spawning any backend, opening a logger, or touching CUPTI/NVML — and every later call (Scope, shutdown, upload_logs, systemStart/systemStop) becomes a no-op. Lets you toggle gpufl off without removing the call sites. The GPUFL_DISABLED env var (see below) forces the same behavior and takes precedence over this field.

Environment variable overrides

These environment variables override their corresponding InitOptions fields when set. Programmatic options always win when you set them explicitly in code; env vars apply when the field is left at default.

Env varField
GPUFL_DISABLEDenabled (inverted). Set to 1 / true / yes / on (case-insensitive) to disable gpufl entirely — init() no-ops and every later call is inert. Takes precedence over the enabled kwarg, so you can switch profiling off for a one-off run without editing code: GPUFL_DISABLED=1 python train.py.
GPUFL_BACKEND_URLbackend_url
GPUFL_API_KEYapi_key
GPUFL_API_PATHapi_path
GPUFL_CONFIG_FILEconfig_file
GPUFL_REMOTE_UPLOADremote_uploadDEPRECATED v1.1, removed v1.2. Still read in v1.1 (routes through the Python atexit shim). Drop from container manifests when convenient.
GPUFL_PROFILING_ENGINEprofiling_engine — accepts engine names (Monitor, Trace, PcSampling, SassMetrics, RangeProfiler, Deep).

Scoping

// Macro-based (recommended)
GFL_SCOPE("name") {
// kernels launched here are attributed to "name"
}

// Object-based
{
gpufl::ScopedMonitor scope("name");
// ...
}

// Lambda-based
gpufl::monitor("name", [&]() {
// ...
});

System Monitoring

gpufl::systemStart("phase_name");
// ... GPU work ...
gpufl::systemStop("phase_name");

Backend Kind

enum class BackendKind { Auto, Nvidia, Amd, None };

Profiling engines

Profiling depth is one setting, profiling_engine, chosen from a six-level ladder. The default is Monitor (health metrics only, no CUPTI). Step up the ladder for more detail at higher cost; the production-safe sweet spot is PcSampling, and Deep is the full development-time profile.

Modes are additive

Each level layers on top of the one below. From Trace upward the SDK captures the full activity trace — kernel events (timing, grid/block dimensions, registers, theoretical occupancy), memcpy/memset, and sync events — plus NVML system metrics and host metrics; the higher levels add sampling data on top. Monitor is the exception: it runs no CUPTI at all, so it emits only NVML/host telemetry (and is therefore the lowest-overhead, safest mode).

enum class ProfilingEngine {
Monitor, // Health metrics only — no CUPTI. The default.
Trace, // + activity trace: kernels, memcpy, sync (no sampling)
PcSampling, // + PC-level stall-reason sampling
SassMetrics, // + per-instruction SASS counters
RangeProfiler, // + hardware throughput counters (Perfworks)
Deep, // PcSampling + SassMetrics in one run
};
ModeNVIDIAAMDOverheadWhat it captures
Monitor (default)MinimalNVML system metrics + host metrics only. No CUPTI.
TraceLow+ activity trace: kernel events (timing, grid/block, registers, occupancy), memcpy/memset, sync
PcSampling (production-safe)Low+ stall reasons per PC, hot-PC distribution, function/source-line correlation per sample
SassMetricsSignificant+ per-instruction execution counts, memory coalescing efficiency, divergence analysis
RangeProfilerModerate, per scope+ hardware counter exports per scope (achieved occupancy, DRAM throughput). Niche.
DeepSignificant kernel slowdownPcSampling + SassMetrics together — the deepest single-run profile
Deep-mode overhead is intrinsic

The Deep-mode kernel slowdown comes from instrumenting every executed SASS instruction with counter increments. The same constraint applies to any tool that collects per-instruction counters (including NVIDIA Nsight Compute, which addresses it with kernel replay instead of slower passes). Use Deep mode for the specific kernel you are investigating, not for fleet-wide deployment.

Deep / SassMetrics / RangeProfiler must be initialized early

These three use CUPTI's Profiler API, which must initialize against a clean CUDA context. Call gpufl.init() before your first CUDA kernel (in PyTorch, right after import torch) — initializing mid-program, after the framework has loaded modules and run kernels, can make the Profiler API fail to start. GPUFlight degrades gracefully (Deep → PC Sampling) and logs the cause, but early init is the fix. Monitor / Trace / PcSampling have no such constraint. See CUDA integration → Profiling Engines.

AMD users

On AMD today only Monitor / Trace and the dispatch-counter path are supported. PcSampling, SassMetrics, RangeProfiler, and Deep are NVIDIA-only — on an AMD backend they fall back to the dispatch-counter path after a startup warning. AMD parity is on the roadmap.


Python API

Core Functions

import gpufl as gfl

# Function-style init — every InitOptions field is a kwarg.
gfl.init(
app_name="my_app",
continuous_system_sampling=True,
system_sample_rate_ms=50,
backend=gfl.BackendKind.Auto,
profiling_engine=gfl.ProfilingEngine.PcSampling,
)

with gfl.Scope("phase_name"):
# GPU work here
pass

gfl.system_start("sampling")
gfl.system_stop("sampling")

gfl.shutdown()

Turning gpufl off without removing the call

init() takes an enabled kwarg (default True). Pass enabled=False to make init() — and every subsequent gpufl call — a no-op, so you can leave the instrumentation in place but switch it off:

gfl.init(app_name="my_app", enabled=False)   # init returns False; Scope/upload_logs/etc. all no-op

The GPUFL_DISABLED environment variable does the same thing and wins over the kwarg, which is handy for a one-off run you don't want to profile without touching the code:

GPUFL_DISABLED=1 python train.py

Truthy values are 1, true, yes, on (case-insensitive). When disabled, gpufl.upload_logs(...) returns an empty result (success=True, events_uploaded=0) and performs no network I/O.

BackendKind.None_ — the Python keyword workaround

BackendKind has a value literally named None. Because None is a reserved keyword in Python, you cannot write gfl.BackendKind.None — that's a SyntaxError. The bindings expose a trailing-underscore alias:

# "No backend" — for stub / test sessions.
gfl.init(app_name="m", backend=gfl.BackendKind.None_)

The alias points at the same value as the C++ BackendKind::None constant (mirrors the class_ / type_ pattern pybind11 uses elsewhere).

ProfilingEngine needs no such alias — its lowest level is Monitor (telemetry only, no CUPTI), a normal identifier you write directly: gfl.ProfilingEngine.Monitor.

Migrating from v0.1.0 / v0.1.1

The Python init() signature was trimmed in v1.0.0-prep to drop three legacy parameters that duplicated newer, more expressive ones. If you're on a pre-v1.0 release and see a TypeError: init() got an unexpected keyword argument …, swap as follows:

Old kwargNew equivalent
enable_profiling=Falseprofiling_engine=gpufl.ProfilingEngine.Monitor
enable_profiling=True (default)profiling_engine=gpufl.ProfilingEngine.PcSampling (the default is now Monitor, so set this explicitly)
enable_perf_scope=Trueprofiling_engine=gpufl.ProfilingEngine.RangeProfiler
remote_config="https://…"backend_url="https://…" (same meaning)

Analyzer

from gpufl.analyzer import GpuFlightSession

session = GpuFlightSession(log_dir, log_prefix="my_app", session_id=None)

session.print_summary() # Executive summary
session.inspect_scopes() # Scope timing analysis
session.inspect_hotspots(top_n=5) # Top kernels by GPU time
session.inspect_stalls() # PC sampling stall distribution
session.inspect_profile_samples() # SASS/PC sample details
session.inspect_perf_metrics() # Hardware counter results

Parsed DataFrames

After construction, GpuFlightSession exposes pandas DataFrames:

AttributeDescription
session.kernelsKernel events with timing and occupancy
session.memcpyMemory transfer events
session.scopesProfile sample data (SASS/PC)
session.scope_eventsScope begin/end pairs
session.device_metricsGPU utilization, temp, power samples
session.host_metricsCPU and RAM utilization samples
session.perfHardware performance counter results

Report

from gpufl.report import generate_report, TextReport

# One-liner
text = generate_report(log_dir, log_prefix="my_app", top_n=10, output_path=None)

# Class-based
from gpufl.analyzer import GpuFlightSession
session = GpuFlightSession(log_dir, log_prefix="my_app")
report = TextReport(session, top_n=10)
report.print() # stdout
report.save("report.txt") # file
text = report.generate() # string

Visualization

import gpufl.viz as viz   # v1.0.0+ only — see warning below

viz.init("./logs/*.log")
viz.show()
Broken in v0.1.x

gpufl.viz silently drops every batch row in releases 0.1.0 through 0.1.4 (the data layer was never updated for the columnar wire format the C++ client emits). Use the analyzer for visualization-grade insight until the v1.0.0 rewrite ships. See the Python Analysis guide for the full context.