qtimlonnx is not available in the current build. Support for this plugin will be enabled in a future release.
Overview
qtimlonnx is a GStreamer inference element that executes ONNX models as part of AI and multimedia pipelines. The element operates entirely in tensor mode: it accepts input tensors on its sink pad and produces output tensors on its source pad according to the model’s input and output specifications. The element is limited to model execution. It does not perform preprocessing, tensor reshaping, batching, layout conversion, or model-specific post-processing. These functions are expected to be handled by adjacent elements in the pipeline. As a result, upstream elements must provide tensors that already match the model requirements, and downstream elements must interpret the output tensors produced by inference. qtimlonnx supports multiple ONNX Runtime execution providers, including CPU, GPU, and Qualcomm AI accelerator / NPU execution through the QNN runtime. This allows the same pipeline structure to be deployed across different hardware targets and optimized for different performance, latency, and power requirements. The element is intended for real-time and embedded AI pipelines where inference is one stage in a larger modular processing flow.Key Responsibilities
qtimlonnx is responsible for:- Loading and executing an ONNX model via the ONNX Runtime
- Accepting preformatted input tensors from upstream elements
- Producing output tensors that match the model output signature
- Negotiating tensor data types and dimensions with adjacent elements
- Propagating tensor metadata required by downstream elements
- Automatically extracting quantization parameters (scale and zero-point) from the ONNX model graph and dequantizing quantized outputs to
FLOAT32when required - Automatically detecting NCHW or NHWC memory layout for 4-D output tensors and advertising the layout in output caps
Example Pipeline
Download Required Files
| File | Download | Save as |
|---|---|---|
| YOLOX W8A8 model | Qualcomm AI Hub — YOLOX | yolo_x_w8a8.onnx |
| Detection labels | yolov8.json | yolov8.json |
| Sample video | Input video | ai_demo_sample.mp4 |
If any downloaded file is a
.zip archive, extract it on your host machine before copying:
unzip filename.zipHierarchy
GObjectGstObject
GstElement
GstBaseTransform
qtimlonnx
Pad Templates
sink
| Capabilities | |
|---|---|
neural-network/tensors | format: { INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, UINT64, FLOAT16, FLOAT32 } |
| Availability: Always | |
| Direction: sink |
src
| Capabilities | |
|---|---|
neural-network/tensors | format: { INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, UINT64, FLOAT16, FLOAT32 } |
| Availability: Always | |
| Direction: source |
Element Properties
| Property | Description |
|---|---|
backend-path | Absolute path to the QNN backend shared library such as libQnnHtp.so. Used only when execution-provider=qnn. Determines which Qualcomm hardware accelerator is targeted.Type: StringDefault: NULLFlags: readable/writable (construct) |
execution-provider | Selects the ONNX Runtime Execution Provider (EP) used for inference.Type: Enum Default: 0, "cpu"Range:(0): cpu - Default ONNX Runtime CPU execution. Runs all operations on the host CPU(1): qnn - Qualcomm QNN Execution Provider. Offloads inference to Qualcomm hardware via the QNN SDK. Requires backend-path to be setFlags: readable/writable |
htp-performance-mode | Controls the power and performance trade-off on the Qualcomm Hexagon HTP. Applicable only when execution-provider=qnn.Type: Enum Default: 0, "default"Range:(0): default - Default performance mode(1): burst - Maximum performance with highest power consumption(2): balanced - Balanced performance and power(3): low-balanced - Lower balanced performance(4): high-performance - High performance mode(5): extreme-power - Extreme power performance(6): low-power - Lowest power consumption(7): sustained-high-performance - Sustained high performance without throttlingFlags: readable/writable |
model | Path to the ONNX model file. This property is required and must reference a valid .onnx model file. The model is loaded when the element transitions from NULL to READY state.Type: StringDefault: NULLFlags: readable/writable (construct) |
optimization-level | Controls the ONNX Runtime graph optimization level applied when loading the model. Higher levels may reduce inference latency but increase model load time.Type: Enum Default: 2, "enable-extended"Range:(0): disable-all - Disable all graph optimizations(1): enable-basic - Basic optimizations such as constant folding(2): enable-extended - Extended optimizations including operator fusion(3): enable-all - All optimizations including layout and memory optimizationsFlags: readable/writable |
threads | Number of intra-operation threads assigned to the ONNX Runtime session. Primarily affects CPU execution. May have limited impact when using QNN execution provider.Type: Unsigned IntegerDefault: 1Range: 1 - 16Flags: readable/writable (construct) |
Input and Output Behavior
Input Tensors
qtimlonnx exposes a single sink pad, but it supports both single-input and multi-input models. For multi-input models, all required tensors are delivered through the same sink pad as a tensor set. Input tensors must be fully prepared before they reach qtimlonnx. Expected tensor layout, shape, data type, and batch size are determined by:- the ONNX model input signature (read at engine initialization)
- caps negotiation with upstream elements
qtimlvconverterfor scaling, color conversion, normalization, and quantizationqtibatchfor batch construction
Output Tensors
qtimlonnx exposes a single source pad and produces output tensors according to the model output signature. The single source pad does not limit the element to a single tensor. Models with multiple output tensors are fully supported, and all outputs are emitted together on the same pad. The element supports:- single-output and multi-output models
- arbitrary tensor ranks, including batch and depth dimensions
- quantized and floating-point outputs
Quantization and Dequantization
qtimlonnx can optionally dequantize quantized output tensors (such asUINT8 or INT8) into FLOAT32. This conversion uses quantization parameters (scale and zero-point) extracted directly from the ONNX model graph at engine initialization time.
Quantization Parameter Extraction
At initialization, qtimlonnx parses the ONNX model graph and locatesQuantizeLinear nodes whose outputs match the model’s declared output tensor names. For each such node, the plugin reads:
scale— from the second input of theQuantizeLinearnode (a graph initializer)zero-point— from the third input of theQuantizeLinearnode (a graph initializer)
Conditional Output Dequantization
Dequantization is performed only when the downstream path requiresFLOAT32 tensors. In practice, this is enabled when downstream caps negotiation indicates that floating-point output is needed.
When dequantization is applied, qtimlonnx:
- reads the per-tensor
scaleandzero_pointextracted from the model graph - applies the standard dequantization formula:
- produces
FLOAT32tensors for downstream processing
When Dequantization Is Skipped
Dequantization is not performed when:- downstream elements accept only quantized tensor types
- no downstream element negotiates
FLOAT32 - the model output tensor does not contain a
QuantizeLinearnode with valid quantization metadata
Supported Data Types
qtimlonnx supports the tensor data types provided by the ONNX Runtime and the selected execution provider, subject to caps negotiation with adjacent elements. Supported data types include:INT8UINT8INT16UINT16INT32UINT32INT64UINT64FLOAT16FLOAT32
NCHW / NHWC Layout Detection
For 4-D output tensors, qtimlonnx automatically detects the memory layout (NCHW or NHWC) at engine initialization by traversing the ONNX model graph backwards from each output tensor and inspectingTranspose node permutations:
- A
Transposewithperm=[0,2,3,1]indicates the output is NHWC (the node converted NCHW→NHWC). - A
Transposewithperm=[0,3,1,2]indicates the output is NCHW (the node converted NHWC→NCHW). - If no
Transposenode is found, the output defaults to NCHW (the ONNX standard).
layout=nchw to the source pad caps. Downstream elements (such as qtimlpostprocess) use this field to perform the necessary in-place NCHW→NHWC transpose before processing.
Batch and Depth Model Support
qtimlonnx supports models with batch and multi-dimensional tensor inputs and outputs, including tensors with explicit batch and depth dimensions. Examples include:- batched tensors:
N × H × W × C - multi-dimensional tensors:
N × D × H × W × C
qtibatch.
This behavior keeps inference predictable across single-frame, batched, and higher-dimensional workflows.
Execution Providers
An ONNX Runtime Execution Provider (EP) defines the hardware backend used to run a model. Execution providers allow qtimlonnx to offload inference from the default CPU interpreter to an optimized backend such as Qualcomm’s HTP/DSP. qtimlonnx supports two execution providers. The provider is selected through theexecution-provider property and controls how the ONNX Runtime dispatches model operations during inference.
CPU
Runs the model on the default ONNX Runtime CPU interpreter.- Backend: CPU
- Use case: reference execution, debugging, or systems without hardware acceleration
- Additional configuration: none required
QNN (Qualcomm Neural Network)
Offloads inference to the AI Accelerator / NPU via the ONNX Runtime’s QNN Execution Provider.- Backend: Qualcomm AI Accelerator / NPU
- Use case: hardware-accelerated inference on Qualcomm SoCs for quantized and floating-point models
- Additional configuration required:
backend-path— absolute path to the QNN backend shared library (e.g.libQnnHtp.so)htp-performance-mode— optional power/performance trade-off setting
execution-provider=qnn, the plugin performs the following initialization sequence:
- Registers
libonnxruntime_providers_qnn_abi.sowith the ONNX Runtime environment usingRegisterExecutionProviderLibrary. - Enumerates available QNN EP devices using
GetEpDevices. - Attaches the QNN EP to the session options using
SessionOptionsAppendExecutionProvider_V2, passingbackend_pathandhtp_performance_modeas provider options. - Creates the ONNX Runtime session with the configured QNN EP.
Runtime Memory Behavior and GAP Handling
qtimlonnx operates within the memory model of the ONNX Runtime. Input buffers from the pipeline are mapped read-only and wrapped directly as ONNX Runtime input tensors usingCreateTensorWithDataAsOrtValue, avoiding a copy of the input data. Output tensors are written into DMA-backed output buffers allocated from the element’s GstMLBufferPool.
ONNX Runtime Memory Model
The ONNX Runtime manages its own internal memory for:- intermediate activation tensors
- output tensors (allocated by the runtime and then copied into the pipeline output buffer)
GAP Buffer Handling
qtimlonnx is GAP-aware and correctly handles input buffers marked withGST_BUFFER_FLAG_GAP.
When a GAP buffer is received, the element skips inference and forwards the buffer downstream. This preserves timing and synchronization while explicitly indicating that no valid inference input is available for that timestamp.
GAP buffers commonly appear in conditional AI pipelines, such as cascaded workflows where later inference stages run only when earlier stages produce valid regions of interest.
Usage
Single-Stage AI Inference on Live Camera Stream
This example demonstrates real-time ONNX inference on a live camera stream using a single qtimlonnx instance with the QNN execution provider. Inference results are attached to eachGstBuffer as MLMeta, allowing downstream elements to access synchronized metadata directly from the frame. An overlay stage renders annotations such as bounding boxes, labels, or keypoints before display.
Download Required Files
| File | Download | Save as |
|---|---|---|
| YOLOX W8A8 model | Qualcomm AI Hub — YOLOX | yolo_x_w8a8.onnx |
| Detection labels | yolov8.json | yolov8.json |
If any downloaded file is a
.zip archive, extract it on your host machine before copying:
unzip filename.zipTwo-Stage Daisy Chain AI Inference on File Stream
This example demonstrates a two-stage ONNX inference workflow using two qtimlonnx instances. The first model operates on full video frames after preprocessing by aqtimlvconverter configured for full-frame input. Inference results, such as detected objects, are attached to the corresponding video buffer and propagated downstream. The second model runs once for each object detected by the first stage. A second qtimlvconverter, configured for ROI-based processing, crops each detected region from the input frame and prepares it as input for the second qtimlonnx instance.
Download Required Files
| File | Download | Save as |
|---|---|---|
| Detection model | Qualcomm AI Hub — YOLOX | yolo_x_w8a8.onnx |
| Detection labels | yolov8.json | yolov8.json |
| Classification model (InceptionV3) | Qualcomm AI Hub — InceptionV3 | mobilenet_v2_w8a8.onnx |
| Classification labels | mobilenet.json | mobilenet.json |
| Sample video | Input video | ai_demo_sample.mp4 |
If any downloaded file is a
.zip archive, extract it on your host machine before copying:
unzip filename.zip