Capturing traces with gpufl trace
gpufl trace is the fastest way to try GPUFlight on an existing
CUDA program. You do not need to link the SDK or edit application code.
The launcher starts your program with GPUFlight injected, writes local
NDJSON logs, and can optionally upload the result when the process exits.
A trace session captures the launched process. For an application that
already embeds GPUFlight and calls gpufl::init() itself, run the
application normally and configure the embedded SDK path instead of
assuming it can also be wrapped by gpufl trace.
Basic run
gpufl trace -- python train.py
On Windows:
gpufl.exe trace -- python train.py
GPUFlight creates a local session directory under ~/.gpufl/traces/
on Linux/macOS or %USERPROFILE%\.gpufl\traces\ on Windows. The
directory contains one subfolder per session id, with device.log,
scope.log, and system.log files.
Name and output directory
gpufl trace --name=resnet50-smoke --output=./runs/resnet50 -- python train.py
The output directory is the source of truth. You can inspect it locally,
upload it later with gpufl upload, or let gpufl-agent tail it in a
production setup.
Pick capture passes
By default, gpufl trace runs one Trace pass, which captures kernel,
memcpy/memset, synchronization, launch details, and system metrics.
For explicit control, pass --passes:
# Same as the default.
gpufl trace --passes=Trace -- python train.py
# Timeline plus PM hardware-counter sampling in isolated passes.
gpufl trace --passes=Trace,PmSampling -- python train.py
# Deep shorthand: Trace,PcSampling,SassMetrics.
gpufl trace --passes=Deep -- python train.py
PcSampling, SassMetrics, PmSampling, and Range Profiler engines
use NVIDIA performance-counter APIs. On Linux, NVIDIA may restrict
those counters to administrator users until you relax the driver setting.
See Linux Configuration.
For monitoring-only GPU/host telemetry, use gpufl monitor instead of
gpufl trace.
Add application context with NVTX
Raw kernel names are often not enough, especially for LLM inference or
large training scripts where the same kernels repeat across many phases.
When using gpufl trace, the lightweight way to add structure is NVTX:
#include <nvtx3/nvToolsExt.h>
nvtxRangePushA("prefill");
run_prefill();
nvtxRangePop();
nvtxRangePushA("decode");
run_decode();
nvtxRangePop();
NVTX ranges let the trace show application phases without linking the
GPUFlight SDK. If your application embeds GPUFlight directly, you can
also use GFL_SCOPE / gpufl.Scope for GPUFlight-owned scope events.
Multi-pass profiling
Some CUPTI engines cannot safely run together in one CUDA context.
gpufl trace handles this by running the same command several times,
one engine per pass, then tagging the passes so the backend can merge
them into one analysis.
gpufl trace --passes=Trace,PcSampling,RangeProfilerKernelReplay -- python train.py
Think of the merged result as a union of capabilities:
Traceowns canonical kernel timing and launch metadata.PcSamplingadds stall samples that do not overlap with trace timing.RangeProfilerKernelReplayadds per-kernel hardware counters.
Multi-pass profiling requires the command to run more than once. It is not a live attach mode for an already-running service.
Upload after capture
For a one-off run:
gpufl trace --upload -- python train.py
--upload requires:
export GPUFL_BACKEND_URL=https://api.gpuflight.com
export GPUFL_API_KEY=gpfl_xxxxx
Under the hood --upload runs gpufl-agent alongside your program and
streams the trace live, so the session appears on the dashboard while it
is still running. The launcher must be able to find the agent — a
gpufl-agent binary on PATH, or --agent-jar / GPUFL_AGENT_JAR
pointing at the fat JAR (with java on PATH). To tail a trace with an
agent you already run — a daemon or a shared box — see
Live upload with gpufl-agent.
You can also upload later:
gpufl upload ./runs/resnet50
See Sending data to the dashboard for all upload paths,
including gpufl-agent, browser upload, and gpufl monitor --upload.
Limits
Multi-pass trace runs the command more than once, so it only makes
sense for workloads that can be replayed. gpufl monitor is the
separate command for GPU/host telemetry and does not attach CUPTI to
another running process.