Skip to main content
qtimlonnx is not available in the current build. Support for this plugin will be enabled in a future release.

Overview

qtimlonnx is a GStreamer inference element that executes ONNX models as part of AI and multimedia pipelines. The element operates entirely in tensor mode: it accepts input tensors on its sink pad and produces output tensors on its source pad according to the model’s input and output specifications. The element is limited to model execution. It does not perform preprocessing, tensor reshaping, batching, layout conversion, or model-specific post-processing. These functions are expected to be handled by adjacent elements in the pipeline. As a result, upstream elements must provide tensors that already match the model requirements, and downstream elements must interpret the output tensors produced by inference. qtimlonnx supports multiple ONNX Runtime execution providers, including CPU, GPU, and Qualcomm AI accelerator / NPU execution through the QNN runtime. This allows the same pipeline structure to be deployed across different hardware targets and optimized for different performance, latency, and power requirements. The element is intended for real-time and embedded AI pipelines where inference is one stage in a larger modular processing flow.

Key Responsibilities

qtimlonnx is responsible for:
  • Loading and executing an ONNX model via the ONNX Runtime
  • Accepting preformatted input tensors from upstream elements
  • Producing output tensors that match the model output signature
  • Negotiating tensor data types and dimensions with adjacent elements
  • Propagating tensor metadata required by downstream elements
  • Automatically extracting quantization parameters (scale and zero-point) from the ONNX model graph and dequantizing quantized outputs to FLOAT32 when required
  • Automatically detecting NCHW or NHWC memory layout for 4-D output tensors and advertising the layout in output caps
In practice, qtimlonnx serves as the inference stage in the pipeline, while tensor preparation and result interpretation are handled externally.

Example Pipeline

1

Download Required Files

FileDownloadSave as
YOLOX W8A8 modelQualcomm AI Hub — YOLOXyolo_x_w8a8.onnx
Detection labelsyolov8.jsonyolov8.json
Sample videoInput videoai_demo_sample.mp4
If any downloaded file is a .zip archive, extract it on your host machine before copying: unzip filename.zip
2

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.
# Run from your host machine — replace <user> and <device-ip>

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media,media/output}"
scp yolo_x_w8a8.onnx          <user>@<device-ip>:$HOME/models/
scp yolov8.json                  <user>@<device-ip>:$HOME/labels/
scp ai_demo_sample.mp4   <user>@<device-ip>:$HOME/media/
3

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>
4

Set environment variables

Run below command on your device
export MODEL_NAME=yolo_x_w8a8.onnx
export LABELS_NAME=yolov8.json
export SRC_VIDEO_NAME=ai_demo_sample.mp4
5

Run the pipeline

gst-launch-1.0 -e --gst-debug=2 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME ! qtdemux ! h264parse ! \
v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw,format=NV12 ! queue ! \
tee name=t ! qtimetamux name=obj_mux ! qtivoverlay ! waylandsink fullscreen=true sync=false \
t. ! queue ! qtimlvconverter ! queue ! qtimlonnx model=/data/onnx/yolox-float/model.onnx execution-provider=qnn backend-path="/usr/lib/libQnnHtp.so" ! queue ! \
qtimlpostprocess module=yolov8 labels=$HOME/labels/$LABELS_NAME settings="{\"confidence\": 51.0}" ! text/x-raw ! queue ! obj_mux.

Hierarchy

GObject
   GstObject
      GstElement
         GstBaseTransform
            qtimlonnx

Pad Templates

sink

Capabilities
neural-network/tensorsformat: { INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, UINT64, FLOAT16, FLOAT32 }
Availability: Always
Direction: sink

src

Capabilities
neural-network/tensorsformat: { INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, UINT64, FLOAT16, FLOAT32 }
Availability: Always
Direction: source

Element Properties

PropertyDescription
backend-pathAbsolute path to the QNN backend shared library such as libQnnHtp.so. Used only when execution-provider=qnn. Determines which Qualcomm hardware accelerator is targeted.

Type: String
Default: NULL
Flags: readable/writable (construct)
execution-providerSelects the ONNX Runtime Execution Provider (EP) used for inference.

Type: Enum
Default: 0, "cpu"
Range:
    (0): cpu - Default ONNX Runtime CPU execution. Runs all operations on the host CPU
    (1): qnn - Qualcomm QNN Execution Provider. Offloads inference to Qualcomm hardware via the QNN SDK. Requires backend-path to be set
Flags: readable/writable
htp-performance-modeControls the power and performance trade-off on the Qualcomm Hexagon HTP. Applicable only when execution-provider=qnn.

Type: Enum
Default: 0, "default"
Range:
    (0): default - Default performance mode
    (1): burst - Maximum performance with highest power consumption
    (2): balanced - Balanced performance and power
    (3): low-balanced - Lower balanced performance
    (4): high-performance - High performance mode
    (5): extreme-power - Extreme power performance
    (6): low-power - Lowest power consumption
    (7): sustained-high-performance - Sustained high performance without throttling
Flags: readable/writable
modelPath to the ONNX model file. This property is required and must reference a valid .onnx model file. The model is loaded when the element transitions from NULL to READY state.

Type: String
Default: NULL
Flags: readable/writable (construct)
optimization-levelControls the ONNX Runtime graph optimization level applied when loading the model. Higher levels may reduce inference latency but increase model load time.

Type: Enum
Default: 2, "enable-extended"
Range:
    (0): disable-all - Disable all graph optimizations
    (1): enable-basic - Basic optimizations such as constant folding
    (2): enable-extended - Extended optimizations including operator fusion
    (3): enable-all - All optimizations including layout and memory optimizations
Flags: readable/writable
threadsNumber of intra-operation threads assigned to the ONNX Runtime session. Primarily affects CPU execution. May have limited impact when using QNN execution provider.

Type: Unsigned Integer
Default: 1
Range: 1 - 16
Flags: readable/writable (construct)

Input and Output Behavior

Input Tensors

qtimlonnx exposes a single sink pad, but it supports both single-input and multi-input models. For multi-input models, all required tensors are delivered through the same sink pad as a tensor set. Input tensors must be fully prepared before they reach qtimlonnx. Expected tensor layout, shape, data type, and batch size are determined by:
  • the ONNX model input signature (read at engine initialization)
  • caps negotiation with upstream elements
Typical upstream elements include: qtimlonnx does not modify, reshape, batch, or reinterpret incoming tensors. It wraps them directly as ONNX Runtime input tensors (zero-copy) and passes them to the runtime as received.

Output Tensors

qtimlonnx exposes a single source pad and produces output tensors according to the model output signature. The single source pad does not limit the element to a single tensor. Models with multiple output tensors are fully supported, and all outputs are emitted together on the same pad. The element supports:
  • single-output and multi-output models
  • arbitrary tensor ranks, including batch and depth dimensions
  • quantized and floating-point outputs
Output tensors are typically consumed by downstream post-processing elements, which decode model-specific results such as classification scores, detection boxes, segmentation masks, landmarks, or other structured outputs.

Quantization and Dequantization

qtimlonnx can optionally dequantize quantized output tensors (such as UINT8 or INT8) into FLOAT32. This conversion uses quantization parameters (scale and zero-point) extracted directly from the ONNX model graph at engine initialization time.

Quantization Parameter Extraction

At initialization, qtimlonnx parses the ONNX model graph and locates QuantizeLinear nodes whose outputs match the model’s declared output tensor names. For each such node, the plugin reads:
  • scale — from the second input of the QuantizeLinear node (a graph initializer)
  • zero-point — from the third input of the QuantizeLinear node (a graph initializer)
These values are stored per output tensor and used during inference to perform dequantization.

Conditional Output Dequantization

Dequantization is performed only when the downstream path requires FLOAT32 tensors. In practice, this is enabled when downstream caps negotiation indicates that floating-point output is needed. When dequantization is applied, qtimlonnx:
  • reads the per-tensor scale and zero_point extracted from the model graph
  • applies the standard dequantization formula:
output_float = (quantized_value - zero_point) × scale
  • produces FLOAT32 tensors for downstream processing

When Dequantization Is Skipped

Dequantization is not performed when:
  • downstream elements accept only quantized tensor types
  • no downstream element negotiates FLOAT32
  • the model output tensor does not contain a QuantizeLinear node with valid quantization metadata
In these cases, the output tensor is forwarded in its original quantized representation. This behavior allows the same downstream processing path to support both quantized and floating-point models where applicable, while avoiding unnecessary conversion.

Supported Data Types

qtimlonnx supports the tensor data types provided by the ONNX Runtime and the selected execution provider, subject to caps negotiation with adjacent elements. Supported data types include:
  • INT8
  • UINT8
  • INT16
  • UINT16
  • INT32
  • UINT32
  • INT64
  • UINT64
  • FLOAT16
  • FLOAT32
The element does not impose additional data-type restrictions beyond those required by the ONNX Runtime, the selected execution provider, and the negotiated pipeline caps.

NCHW / NHWC Layout Detection

For 4-D output tensors, qtimlonnx automatically detects the memory layout (NCHW or NHWC) at engine initialization by traversing the ONNX model graph backwards from each output tensor and inspecting Transpose node permutations:
  • A Transpose with perm=[0,2,3,1] indicates the output is NHWC (the node converted NCHW→NHWC).
  • A Transpose with perm=[0,3,1,2] indicates the output is NCHW (the node converted NHWC→NCHW).
  • If no Transpose node is found, the output defaults to NCHW (the ONNX standard).
When any output tensor is detected as NCHW, the plugin adds layout=nchw to the source pad caps. Downstream elements (such as qtimlpostprocess) use this field to perform the necessary in-place NCHW→NHWC transpose before processing.

Batch and Depth Model Support

qtimlonnx supports models with batch and multi-dimensional tensor inputs and outputs, including tensors with explicit batch and depth dimensions. Examples include:
  • batched tensors: N × H × W × C
  • multi-dimensional tensors: N × D × H × W × C
The element treats these dimensions transparently and passes tensors to the ONNX Runtime according to the negotiated shape. It does not construct batches, reshape tensors, or reinterpret tensor dimensions internally. Batch construction must be handled by upstream elements such as qtibatch. This behavior keeps inference predictable across single-frame, batched, and higher-dimensional workflows.

Execution Providers

An ONNX Runtime Execution Provider (EP) defines the hardware backend used to run a model. Execution providers allow qtimlonnx to offload inference from the default CPU interpreter to an optimized backend such as Qualcomm’s HTP/DSP. qtimlonnx supports two execution providers. The provider is selected through the execution-provider property and controls how the ONNX Runtime dispatches model operations during inference.

CPU

Runs the model on the default ONNX Runtime CPU interpreter.
  • Backend: CPU
  • Use case: reference execution, debugging, or systems without hardware acceleration
  • Additional configuration: none required

QNN (Qualcomm Neural Network)

Offloads inference to the AI Accelerator / NPU via the ONNX Runtime’s QNN Execution Provider.
  • Backend: Qualcomm AI Accelerator / NPU
  • Use case: hardware-accelerated inference on Qualcomm SoCs for quantized and floating-point models
  • Additional configuration required:
    • backend-path — absolute path to the QNN backend shared library (e.g. libQnnHtp.so)
    • htp-performance-mode — optional power/performance trade-off setting
When execution-provider=qnn, the plugin performs the following initialization sequence:
  1. Registers libonnxruntime_providers_qnn_abi.so with the ONNX Runtime environment using RegisterExecutionProviderLibrary.
  2. Enumerates available QNN EP devices using GetEpDevices.
  3. Attaches the QNN EP to the session options using SessionOptionsAppendExecutionProvider_V2, passing backend_path and htp_performance_mode as provider options.
  4. Creates the ONNX Runtime session with the configured QNN EP.

Runtime Memory Behavior and GAP Handling

qtimlonnx operates within the memory model of the ONNX Runtime. Input buffers from the pipeline are mapped read-only and wrapped directly as ONNX Runtime input tensors using CreateTensorWithDataAsOrtValue, avoiding a copy of the input data. Output tensors are written into DMA-backed output buffers allocated from the element’s GstMLBufferPool.

ONNX Runtime Memory Model

The ONNX Runtime manages its own internal memory for:
  • intermediate activation tensors
  • output tensors (allocated by the runtime and then copied into the pipeline output buffer)

GAP Buffer Handling

qtimlonnx is GAP-aware and correctly handles input buffers marked with GST_BUFFER_FLAG_GAP. When a GAP buffer is received, the element skips inference and forwards the buffer downstream. This preserves timing and synchronization while explicitly indicating that no valid inference input is available for that timestamp. GAP buffers commonly appear in conditional AI pipelines, such as cascaded workflows where later inference stages run only when earlier stages produce valid regions of interest.

Usage

Single-Stage AI Inference on Live Camera Stream

This example demonstrates real-time ONNX inference on a live camera stream using a single qtimlonnx instance with the QNN execution provider. Inference results are attached to each GstBuffer as MLMeta, allowing downstream elements to access synchronized metadata directly from the frame. An overlay stage renders annotations such as bounding boxes, labels, or keypoints before display.
1

Download Required Files

FileDownloadSave as
YOLOX W8A8 modelQualcomm AI Hub — YOLOXyolo_x_w8a8.onnx
Detection labelsyolov8.jsonyolov8.json
If any downloaded file is a .zip archive, extract it on your host machine before copying: unzip filename.zip
2

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.
# Run from your host machine — replace <user> and <device-ip>

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels}"
scp yolo_x_w8a8.onnx            <user>@<device-ip>:$HOME/models/
scp yolov8.json   <user>@<device-ip>:$HOME/labels/
3

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>
4

Set environment variables

Run below command on your device
export MODEL_NAME=yolo_x_w8a8.onnx
export LABELS_NAME=yolov8.json
5

Run the pipeline

Run the pipeline
gst-launch-1.0 -e --gst-debug=2 \
qticamsrc ! video/x-raw,width=1920,height=1080,format=NV12,framerate=30/1 ! queue ! tee name=t \
t. ! queue ! qtimetamux name=metamux ! queue ! qtivoverlay ! waylandsink fullscreen=true sync=false \
t. ! queue ! qtimlvconverter ! queue ! qtimlonnx model=$HOME/models/$MODEL_NAME execution-provider=qnn backend-path="libQnnHtp.so" ! queue ! \
qtimlpostprocess results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME ! video/x-raw,format=BGRA,width=960,height=540 ! queue ! metamux.

Two-Stage Daisy Chain AI Inference on File Stream

This example demonstrates a two-stage ONNX inference workflow using two qtimlonnx instances. The first model operates on full video frames after preprocessing by a qtimlvconverter configured for full-frame input. Inference results, such as detected objects, are attached to the corresponding video buffer and propagated downstream. The second model runs once for each object detected by the first stage. A second qtimlvconverter, configured for ROI-based processing, crops each detected region from the input frame and prepares it as input for the second qtimlonnx instance.
1

Download Required Files

FileDownloadSave as
Detection modelQualcomm AI Hub — YOLOXyolo_x_w8a8.onnx
Detection labelsyolov8.jsonyolov8.json
Classification model (InceptionV3)Qualcomm AI Hub — InceptionV3mobilenet_v2_w8a8.onnx
Classification labelsmobilenet.jsonmobilenet.json
Sample videoInput videoai_demo_sample.mp4
If any downloaded file is a .zip archive, extract it on your host machine before copying: unzip filename.zip
2

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.
# Run from your host machine — replace <user> and <device-ip>

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media}"
scp <yolo_x_w8a8.onnx>      <user>@<device-ip>:$HOME/models/
scp yolov8.json               <user>@<device-ip>:$HOME/labels/
scp <mobilenet_v2_w8a8.onnx>      <user>@<device-ip>:$HOME/models/
scp mobilenet.json                <user>@<device-ip>:$HOME/labels/
scp ai_demo_sample.mp4   <user>@<device-ip>:$HOME/media/
3

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>
4

Set environment variables

Run below command on your device
export MODEL_NAME_1=yolo_x_w8a8.onnx
export LABELS_NAME_1=yolov8.json
export MODEL_NAME_2=<mobilenet_v2_w8a8.onnx>
export LABELS_NAME_2=mobilenet.json
export SRC_VIDEO_NAME=ai_demo_sample.mp4
5

Run the pipeline

Run the pipeline
gst-launch-1.0 -e --gst-debug=2 \
qtimlvconverter name=stage_01_preproc \
qtimlonnx model=$HOME/models/$MODEL_NAME_1 execution-provider=qnn backend-path="libQnnHtp.so" htp-performance-mode=1 name=stage_01_inference \
qtimlpostprocess name=stage_01_postproc results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME_1 \
qtimlvconverter name=stage_02_preproc \
qtimlonnx model=$HOME/models/$MODEL_NAME_2 execution-provider=qnn backend-path="libQnnHtp.so" htp-performance-mode=1 name=stage_02_inference \
qtimlpostprocess name=stage_02_postproc module=mobilenet labels=$HOME/labels/$LABELS_NAME_2 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME ! qtdemux ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw,format=NV12 ! queue ! tee name=t_split_1 \
t_split_1. ! queue ! metamux_1. \
t_split_1. ! queue ! stage_01_preproc. stage_01_preproc. ! queue ! stage_01_inference. stage_01_inference. ! queue ! stage_01_postproc. stage_01_postproc. ! text/x-raw ! queue ! metamux_1. \
qtimetamux name=metamux_1 ! queue ! qtiobjtracker algo=bytetrack ! queue ! tee name=t_split_2 \
t_split_2. ! queue ! metamux_2. \
t_split_2. ! queue ! stage_02_preproc. stage_02_preproc. ! queue ! stage_02_inference. stage_02_inference. ! queue ! stage_02_postproc. stage_02_postproc. ! text/x-raw ! queue ! metamux_2. \
qtimetamux name=metamux_2 ! queue ! qtivoverlay ! queue ! waylandsink fullscreen=true