Overview

The qtimlvconverter element is an essential component of the AI pipeline, responsible for AI preprocessing—specifically, preparing video frame data for neural network inference. It efficiently processes incoming buffers in YUV or RGB formats and converts them into tensors that are compatible with machine learning models.

This transformation encompasses several critical operations:

Cropping: Selects a specific region of the input frame to focus on. This uses provided ROI metadata to determine the crop region for each frame.
Rescaling: Adjusts the spatial dimensions of input frames to match the expected tensor size for the target model to ensure compatibility and consistent performance.
Format Conversion: Translates pixel data between supported formats (e.g., YUV to RGB) to meet the input requirements of the target model.
Batching: Aggregates multiple frames or images into batches to optimize inference throughput and leverage parallel processing capabilities.
Normalization: Applies pixel value scaling and normalization techniques (such as mean subtraction and standard deviation division) to standardize input data for improved model accuracy.
Temporal Batching: Prepares input tensors with shape NDHWC by stacking multiple temporally consecutive frames along the depth dimension. This enables compatibility with video-based models, which require temporal context across frames.

To perform efficient transformation the qtimlvconverter runs on GPU and executes these steps in a chain. It resizes the entire input frame or the cropped region to fit the target tensor dimensions while preserving the full field of view. In this step, only the resolution changes, while the format and data range remain the same. By default it maintains the original aspect ratio, ensuring that geometric shapes of the objects remain undistorted in the output. If the input frame’s aspect ratio differs from that of the target tensor, the resized frame is positioned in the top-left corner, and any remaining area is filled with a black background. This method ensures that the entire tensor is populated, preserves the integrity of the original image content, and avoids cropping or distortion. This makes it well-suited for models that require consistent spatial representation. The plugin also intelligently analyzes the dimensional order, count, and size to automatically select the appropriate tensor layout. Output tensors may be produced in either interleaved (NHWC) or planar (NCHW) formats, supporting RGBA, RGB, or GRAYSCALE pixel arrangements. This selection is performed seamlessly by the plugin, eliminating the need for manual configuration. In addition to conventional four-dimensional tensor outputs, the plugin also supports five-dimensional tensors where the fifth dimension represents depth (NDHWC). The detection and application of this advanced configuration are managed automatically, further enhancing flexibility for complex inference models. To accommodate models trained with alternative pixel arrangements, the default tensor format - typically RGBA, RGB, or GRAYSCALE - can be modified via the subpixel-layout property. This feature enables support for formats such as BGRA or BGR, ensuring compatibility with a diverse range of neural network architectures and training methodologies. Through this automated and highly configurable approach, the plugin streamlines the transformation of video data into neural network-ready tensors, facilitating robust and efficient inference across varied deployment scenarios.

More details about tensor options available at the end of this page

Example Pipeline

Download Required Files

File	Download	Save as
YOLOX W8A8 model	Qualcomm AI Hub — YOLOX	`yolo_x_w8a8.tflite`
Detection labels	yolov8.json	`yolov8.json`
Sample video	Input video	`ai_demo_sample.mp4`

If any downloaded file is a .zip archive, extract it on your host machine before copying: unzip filename.zip

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.
# Run from your host machine — replace <user> and <device-ip>

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media,media/output}"
scp yolo_x_w8a8.tflite           <user>@<device-ip>:$HOME/models/
scp yolov8.json                  <user>@<device-ip>:$HOME/labels/
scp ai_demo_sample.mp4    <user>@<device-ip>:$HOME/media/

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>

Set environment variables

Run below command on your device

export MODEL_NAME=yolo_x_w8a8.tflite
export LABELS_NAME=yolov8.json
export SRC_VIDEO_NAME=ai_demo_sample.mp4

Run the pipeline

gst-launch-1.0 -e --gst-debug=2 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME ! qtdemux ! h264parse ! \
v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw,format=NV12 ! queue ! \
tee name=t ! qtimetamux name=obj_mux ! qtivoverlay ! waylandsink fullscreen=true sync=false \
t. ! queue ! qtimlvconverter ! queue ! \
qtimltflite model=$HOME/models/$MODEL_NAME delegate=external external-delegate-path=libQnnTFLiteDelegate.so \
external-delegate-options="QNNExternalDelegate,backend_type=htp,log_level=(string)1;" ! queue ! \
qtimlpostprocess module=yolov8 labels=$HOME/labels/$LABELS_NAME \
settings="{\"confidence\": 51.0}" ! text/x-raw ! queue ! obj_mux.

Hierarchy

GObject
   GstObject
      GstElement
         GstBaseTransform
            qtimlvconverter

Pad Templates

sink

Capabilities
`video/x-raw`	`format: { RGBA, BGRA, ABGR, ARGB, RGBx, BGRx, xRGB, xBGR, BGR, RGB, GRAY8, NV12, NV21, YUY2, UYVY, NV12_Q08C }` `width: [1, 32767]` `height: [1, 32767]` `framerate: [0/1, 255/1]`
Availability: Always
Direction: sink

src

Capabilities
`neural-network/tensors`	`format: { INT8, UINT8, INT16, UINT16, INT32, UINT32, INT64, UINT64, FLOAT16, FLOAT32 }` `width: [1, 32767]` `height: [1, 32767]` `framerate: [0/1, 255/1]`
Availability: Always
Direction: source

Element Properties

Property	Description
`engine`	Engine backend used for the conversion operations. `Type: Enum` `Default: 2, "gles"` `Range:` `(0): none - No backend used` `(2): gles - Use OpenGLES based video converter` `(3): fev - Use FastCV based video converter` `(4): ocv - Use OpenCV based video converter` `Flags: readable/writable`
`image-disposition`	Aspect ratio and placement of the image inside the output tensor. `Type: Enum` `Default: 0, "top-left"` `Range:` `(0): top-left - Preserve aspect ratio during resize and place it in the top-left corner of the output tensor` `(1): centre - Preserve aspect ratio during resize and place it in the centre of the output tensor` `(2): stretch - Ignore aspect ratio and if required stretch it's AR in order to fit completely inside the output tensor` `(3): centre-crop - Ignore the aspect ratio and if required crop the source around its center to fit completely inside the output tensor` `Flags: readable/writable`
`mean`	Channel mean subtraction values for FLOAT tensors such as `{R, G, B}`, `{R, G, B, A}`, or `{<G>}`. `Type: GstValueArray of type gdouble` `Default: "< >"` `Flags: readable/writable`
`mode`	Conversion mode. `Type: Enum` `Default: 0, "image-batch-non-cumulative"` `Range:` `(0): image-batch-non-cumulative - ROI metadata is ignored.Immediately process incoming buffers irrelevant of whether there are enough image memory blocks to fill the requested tensor batch size.` `(1): image-batch-cumulative - ROI metadata is ignored. Accumulate buffers until there are enough image memory blocks to fill the requested tensor batch size. Accumulation is interrupted early if a GAP buffer is received.` `(2): roi-batch-non-cumulative - Use only ROI metas to fill tensor batch size. Immediately process incoming buffers irrelevant of whether there are enough ROI metas to fill the requested tensor batch size. In case no ROI meta is present a GAP buffer will be produced.` `(3): roi-batch-cumulative - Use only ROI metas to fill tensor batch size. Accumulate buffers until there are enough ROI metas to fill the requested tensor batch size. Accumulation is interrupted early if a GAP buffer is received or if there are no ROI metas present inside the received buffer.` `Flags: readable/writable`
`sigma`	Channel divisor values for FLOAT tensors such as `{R, G, B}`, `{R, G, B, A}`, or `{<G>}`. `Type: GstValueArray of type gdouble` `Default: "< >"` `Flags: readable/writable`
`subpixel-layout`	Arrangement of the image pixels in the output tensor. `Type: Enum` `Default: 0, "regular"` `Range:` `(0): regular - RGB, RGBA, RGBx` `(1): reverse - BGR, BGRA, BGRx` `Flags: readable/writable`

Image/Video Tensor Characteristics

An image or video tensor has several key characteristics that define how data is represented and processed in machine learning pipelines:

🛈 Layouts
 N - Tensor batch size. How many frames can be processed at the same time by the model.
 D - Tensor depth. Represent the number of history frames from single source/stream.
 H - Tensor height in pixels.
 W - Tensor width in pixels.
 C - Number of pixel components(4 == RGBA/BGRA, 3 == RGB/BGR, 1 == GRAYSCALE).

Tensor Dimensions / Descriptor This defines the shape and layout of the tensor. Common descriptors include:

NCHW (Batch, Channels, Height, Width): All pixels for one channel are stored contiguously. Preferred by many deep learning frameworks (e.g., PyTorch, Caffe) because it optimizes convolution operations on GPUs. For example, in RGB:
- First all Red values for the entire image
- Then all Green values
- Then all Blue values
NHWC(Batch, Height, Width, Channels) Channels are interleaved per pixel. Common in TensorFlow and IM SDK because it aligns better with memory access patterns for certain hardware accelerators. For example, in RGB:
- Pixel 1: R, G, B
- Pixel 2: R, G, B
- and so on
NDHWC (Batch,Depth,Height,Width,Channels) This format extends the NHWC layout by adding a Depth (D) dimension, making it suitable for video sequences or volumetric data (e.g., temporal stacks for action recognition). Frames are stored sequentially, and within each frame, pixels are arranged in NHWC order (interleaved channels per pixel). For example:
- Frame 1: R,G,B for each pixel
- Frame 2: R,G,B for each pixel
- and so on.

NCHW and NHWC formats contain the same data but organize it differently. The order of the letters indicates how the data is arranged within the tensor. Lets take as example [1,3,480,640] and [1,480,640,3] tensors. In both cases the batch size is 1, width is 640, height is 480 and the number of channels is 3. Let’s assume that image format is RGB. The key difference in these formats is how RGB pixels are arranged within the buffer. Data Format Specifies the type of values stored in the tensor like UINT8, INT8, UINT16, FLOAT32, FLOAT16, etc Data Range Defines the expected value range of the tensor elements:

0-255 - Typical for raw UINT8 image data
-128-127 - Typical for raw INT8 image data
0.0-1.0 - Common for normalized floating-point data for FLOAT16 and FLOAT32 tensors
-1.0-1.0 - Used in some models that expect centered data, the user must explicitly configure normalization parameters(offset and scale)
Custom ranges may be used depending on the model’s training setup

Color Format Indicates how color information is represented:

RGB – Red, Green, Blue (standard format for most models)
BGR – Blue, Green, Red (used by some OpenCV-based models)
Grayscale – Single-channel format for monochrome images

Supported Tensors by qtimlvconverter

Property	Description
Tensor Shape	NHWC, NCHW, NDHWC
Data Format	uint8, int8, float32, float16
Data Range	any
Color Format	RGB, BGR, Grayscale

Image Placement Matters in Model Performance

When working with computer vision models, we often focus on the content of the image - but where the image is placed inside the input tensor can also make a big difference. By default, it maintains the original aspect ratio, ensuring that geometric shapes of the objects remain undistorted in the output. If the input frame’s aspect ratio differs from that of the target tensor, the resized frame is positioned in the top-left corner, and any remaining area is filled with a black background. This method ensures that the entire tensor is populated, preserves the integrity of the original image content, and avoids cropping or distortion, making it well-suited for models that require consistent spatial representation.

However, some models perform better when the image is centered or stretched to fill entire tensor without padding. To support such use cases, qtimlvconverter provides three configurable placement modes:

top-left (default) - Keeps the original aspect ratio of the image and places it in the top-left corner of the output tensor.
centre - Also preserves the aspect ratio, but centers the image within the tensor. This is a common choice for models that expect the main object to be in the middle.
stretch - Ignores the original aspect ratio and stretches the image to completely fill the tensor. This can introduce distortion but ensures full coverage, which some models require
centre-crop - Ignore the source image AR (Aspect Ratio) and if required crop the source around its center to fit completely inside the output tensor.

What if You Need to Crop Instead? In some cases, rather than resizing or repositioning the entire image, you may need to crop a specific region - for example, a random or targeted part of the input image - and feed only that portion to the model. This approach is useful when the model is trained to focus on localized features. While image-disposition helps with placement and aspect ratio handling, cropping is a separate preprocessing step that gives you more control over which part of the image the model sees. To achieve this, you can insert a qtivtransform step before qtimlvconverter. It allows you to perform the necessary cropping operation on the input image before it reaches the converter. This gives you precise control over which part of the image is used.

Here’s how each component works:

source : This is your input stream.
qtivtransform: This stage allows you to apply transformations to the image, such as cropping, resizing, or rotating. In this context, you can define a crop region via properties passed directly to qtivtransform, enabling you to extract a specific part of the input image before it reaches the model.

⚠️ Important: The output resolution of qtivtransform must be specified manually.
To avoid stretching or distortion, the aspect ratio of the output resolution
must match the aspect ratio of the crop window. If you're unsure what
resolution to choose, you can simply set the crop width and height as
the output resolution - this ensures a 1:1 mapping without scaling.

qtimlvconverter: After cropping, this component prepares the image for inference. It handles scaling, color conversion, normalization and positioning based on the image-disposition property (e.g., top-left, centre, stretch)
sink - This is the next element where the tensor is passed

By combining input cropping via qtivtransform with placement control via image-disposition in qtimlvconverter, you can achieve any desired transformation - whether it’s focusing on a specific region, preserving aspect ratio, or aligning the image within the tensor. This flexibility is especially valuable when adapting to different model requirements or optimizing inference performance The image below demonstrates how a specific region of the input image can be converted into a tensor.

This example showcases how to convert a predefined region. While qtivtransform supports runtime crop window updates, this solution is not always scalable. In more advanced use cases, qtimlvconverter can perform cropping on its own, without relying on qtivtransform. For instance, if you have two models working sequentially - a detection model followed by a pose estimation model - qtimlvconverter can automatically crop and generate a tensor for each bounding box detected by the first stage.

Normalization

Normalization of pixel values is performed automatically, tailored to the negotiated tensor data type. For floating-point tensor types (FLOAT16, FLOAT32), pixel values are normalized to the range (0,1). For signed and unsigned integer types (such as INT8, UINT8, INT16, UINT16), normalization is applied according to the full value range of each type - for example, INT8 is normalized to [−128 , 127], UINT8 to [0 , 255], and INT16 to [−32,768 , 32,767]. Beyond this automatic normalization, the plugin provides further customization through the mean and sigma properties, enabling per-channel normalization using the following formula:

🛈 Normalization

normalized_value = (<Pixel Channel Value> - mean​) * sigma

🛈 Example

An INT16 RGB image with values R=2000, G=-344, B=0 and property values 
mean="<1280.0, 1560.0, -1240.0>" and sigma="<0.75, 0.34, 0.02>" would
have the following channel transformations:
R=(2000 - 1280.0) x 0.75, G=(-324 - 1560.0) x 0.34, B=(0 + 1240.0) x 0.02.

This custom normalization is applied subsequently, after the initial range-based normalization, allowing precise adjustment of pixel values to meet the requirements of specific neural network models or training regimes. This dual-stage normalization approach ensures that tensor data is both type-appropriate and optimally conditioned for inference, supporting robust and accurate model performance across diverse deployment scenarios.

Batching

Batching is a key optimization technique in AI inference pipelines, allowing multiple frames to be processed simultaneously. This improves throughput and enables parallel execution on hardware accelerators. In the context of qtimlvconverter, incoming video buffers may originate from:

A single source (e.g., one camera or video file), where each GstBuffer contains a single GstMemory block.
A multiplexed stream (e.g., multiple cameras or video files), where each GstBuffer contains multiple GstMemory blocks. This configuration is enabled via the qtibatch plugin, which aggregates frames from different sources into a single batched buffer.

Batching is especially useful when working with models trained to process multiple inputs simultaneously (e.g., batch size = 4). It allows the inference engine to utilize GPU resources more efficiently and reduces per-frame processing overhead. How it Works:

Each GstMemory block in the buffer represents one frame
*qtimlvconverter iterates over all memory blocks and applies preprocessing steps (cropping, resizing, format conversion, normalization) to each frame individually.
The processed frames are then packed into a single tensor with shape [N, H, W, C] or [N, C, H, W], depending on the selected layout.
The batch size N is inferred from the number of memory blocks in the buffer.

Benefits:

Parallel Inference: Enables models to process multiple frames simultaneously, improving throughput and latency.
Efficient GPU Utilization: Reduces per-frame overhead and maximizes hardware acceleration.
Flexible Input Handling: Supports both single-stream and multi-stream scenarios without requiring manual configuration.

Tensor Layouts: Depending on the model and configuration, the output tensor may use:

NHWC: Interleaved layout (e.g., [4, 480, 640, 3] for batch size 4, RGB).
NCHW: Planar layout (e.g., [4, 3, 480, 640]), often used in CNNs.
NDHWC: For models requiring temporal depth (e.g., [1, 4, 480, 640, 3]), where D represents history frames.

qtimlvconverter automatically selects the appropriate layout based on the model’s input signature and the number of frames in the batch.

Temporal Batching

Temporal batching extends the concept of standard batching by grouping multiple frames from the same stream over time into a single tensor. This approach is essential for models that require temporal context, such as:

Action recognition (e.g., detecting gestures or activities across consecutive frames).
Object tracking (e.g., maintaining identity across frames).
3D CNNs or RNN-based vision models that process sequences rather than single images.

How It Works:

Instead of aggregating frames from different sources, temporal batching collects D consecutive frames from the same stream.
These frames are packed into a tensor with shape:
- NDHWC: [N, D, H, W, C]
  - N = batch size(number of sequences processed together)
  - D = depth(number of frames per sequence)
  - H,W = spatial dimensions
  - C = channels(e.g.,RGB)
qtimlvconverter automatically handles:
- Frame accumulation based on the configured depth (D).
- Preprocessing for each frame (resize, normalization, format conversion).
- Tensor assembly in the correct layout for the model.

Benefits:

Preserves temporal continuity, enabling models to learn motion patterns.
Improves inference accuracy for tasks that depend on frame-to-frame relationships.
Fully utilizes GPU resources by processing sequences in parallel.

Example:

For a model expecting 4-frame sequences:
- Tensor shape: [1, 4, 480, 640, 3] (batch size = 1, depth = 4, RGB).
- Frames are normalized and resized individually, then stacked in temporal order.

Multi-Stage Inference Pipelines with ROI-Based Processing

In advanced AI pipelines, inference is often performed in multiple stages - each consisting of preprocessing, inference, and postprocessing. These stages may operate on the full image or on specific regions of interest (ROIs) identified by earlier stages. To support this, qtimlvconverter can process buffers containing multiple GstVideoRegionOfInterestMeta entries. These ROIs are typically generated by upstream ML components or custom plugins and represent targeted regions within the input frame that require further analysis. By default, qtimlvconverter processes the entire image, ignoring any ROI metadata. However, when configured via the mode property, it can switch to ROI-based processing, enabling selective transformation of only the regions marked for further inference. ROI Processing Modes The mode property defines how input regions are handled and batched before being converted into tensors. There are two main categories:

Image batch mode:
- image-batch-non-cumulative: Processes each buffer immediately, regardless of batch size.
- image-batch-cumulative: Accumulates full-frame inputs until batch size is met.
ROI Batch Modes:
- roi-batch-non-cumulative: Processes ROI metadata immediately, discarding excess entries.
- roi-batch-cumulative: Accumulates ROI entries until batch size is met.

These modes allow fine-grained control over latency and resource utilization, depending on the model’s batch size (N) and the expected input rate. Submode Behaviour

Non-Cumulative: In non-cumulative submode, incoming buffers are processed immediately upon receipt, regardless of whether the number of image memory blocks or ROI metadata entries meets the model’s specified tensor batch size (N). This approach is recommended when the number of multiplexed streams and/or ROI metadata is not expected to exceed the batch size (N) of the model. Any ROI metadata or multiplexed GstMemory blocks exceeding the batch size (N) are discarded. A potential drawback of this submode is that if the batch is not fully populated - for example, if the batch size is set to N=4 but only three positions are filled - the inference plugin will still process the entire batch, resulting in resource inefficiency due to unutilized positions. However, the primary advantage of non-cumulative submode is the elimination of processing latency: buffers are not held back to accumulate additional inputs to fulfill the batch size requirement. This ensures prompt processing and rapid generation of prediction results, making it suitable for real-time applications where minimal delay is critical.
Cumulative: In cumulative submode, incoming buffers are aggregated until the number of ROI metadata entries and/or multiplexed GstMemory blocks meets the model’s required tensor batch size (N). Accumulation may be interrupted prematurely if a GAP buffer is received, or, in ROI batch modes, if the incoming buffer contains no ROI metadata. This accumulation strategy introduces variable latency into the processing pipeline, influenced by factors such as frame production intervals and the number of ROIs or multiplexed images received per buffer. The principal advantage of cumulative submode is that all incoming ROI metadata and multiplexed GstMemory blocks are processed, ensuring that none are discarded due to batch size constraints. This mode is particularly recommended when the model’s batch size (N) exceeds one and inference processing time is sufficiently low, allowing for efficient utilization of resources and comprehensive processing of available input regions.

Non-Cumulative:

Cumulative:

Preprocessing Metadata

In some cases, post-processing requires information about how each input frame was pre-processed - such as its placement within the input tensor, its dimensions, or which frames in a batch are valid. To support this, qtimlvconverter attaches metadata describing the preprocessing details to each tensor using GstProtectionMeta. The inference plugin then propagates this metadata from its input to its output, enabling the post-processing plugin to access the necessary information about how each frame was handled during preprocessing. The GstProtectionMeta includes the following fields:

Input Tensor Dimensions:
- input-tensor-width [G_TYPE_UINT] - Specifies the width (in pixels) of the tensor that the frame was mapped into. This is the final width after any resizing operations performed by qtimlvconverter.
- input-tensor-height [G_TYPE_UINT] - Specifies the height (in pixels) of the tensor after preprocessing. This value reflects the target model’s expected input dimensions.
Region Occupied by Actual Data within the Input Tensor:
- input-region-x [G_TYPE_INT] - The X-coordinate (horizontal offset) of the region within the tensor where the actual image data is placed. Useful for determining padding or positioning when aspect ratio is preserved.
- input-region-y [G_TYPE_INT] - The Y-coordinate (vertical offset) of the region within the tensor where the image data starts.
- input-region-width [G_TYPE_INT] - The width of the actual image content inside the tensor. This may differ from input-tensor-width if padding was applied.
- input-region-height [G_TYPE_INT] - The height of the actual image content inside the tensor. Indicates how much of the tensor is occupied by real image data.
Batch Sequence Information:
- sequence-index [G_TYPE_UINT] - The index of this entry within the current batch. For example, in a batch of size 4, valid values are 0–3.
- sequence-num-entries [G_TYPE_UINT] - The total number of entries in the batch. This helps post-processing plugins understand the batch context.
Timestamp of the buffer:
- timestamp [G_TYPE_UINT64] - The timestamp of the buffer when it was processed by qtimlvconverter. Used for synchronization and latency measurements.
The stream ID from which this batch entry was produced:
- stream-id [G_TYPE_INT] [Optional] - Identifies the source stream that produced this batch entry. Useful in multi-stream pipelines for correlating inference results with their origin.
Timestamp of the stream buffer from which this batch entry was produced:
- stream-timestamp [G_TYPE_UINT64] [Optional] - The original timestamp of the frame from the source stream before preprocessing. Preserves temporal context for tracking or analytics.
The ID of the ROI meta from which this batch entry was produced:
- source-region-id [G_TYPE_INT] [Optional] - If the frame was derived from an ROI, this field contains the ID of the GstVideoRegionOfInterestMeta entry that defined the crop region. Enables downstream components to link inference results back to the original ROI.

Usage

Single Camera Stream — Save Tensors to File

Single camera stream with manually set UINT8 ML GstCaps and output tensors saved in separate files. Common data types:

UINT8 — typical for quantized models, range 0–255
INT8 — used in signed quantized models, range −128 to 127
FLOAT16 / FLOAT32 — for models requiring high precision, normalized to 0.0–1.0 or −1.0–1.0 (requires offset and scale)

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>

Set environment variables

Run below command on your device

mkdir -p $HOME/media/output

Run the pipeline

gst-launch-1.0 -e --gst-debug=2 \
qticamsrc ! video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! queue ! \
qtimlvconverter ! neural-network/tensors,type=UINT8,dimensions="<<1,300,300,3>>" ! multifilesink location=$HOME/media/output/tensor_u8_%d.bin

Two-Stage Person Detection and Pose Estimation on Live Camera Stream

Demo pipeline with 2 ML stages running on a live camera stream. The first stage performs person detection and attaches the results to each frame. qtiobjtracker then associates detected persons across frames and adds persistent tracking IDs. The second stage uses ROI-based preprocessing to crop each tracked person and runs pose estimation, producing skeleton keypoints overlaid on the display.

Download Required Files

File	Download	Save as
Person foot detection model	Qualcomm AI Hub — Person Foot Detection	`foot_track_net_w8a8.tflite`
Person detection labels	foot_track_net.json	`foot_track_net.json`
Foot track net settings	foot_track_net_settings.json	`foot_track_net_settings.json`
HRNet pose model	Qualcomm AI Hub — HRNet Pose	`hrnet_pose_w8a8.tflite`
Pose labels	hrnet.json	`hrnet.json`
HRNet settings	hrnet_settings.json	`hrnet_settings.json`

If any downloaded file is a .zip archive, extract it on your host machine before copying: unzip filename.zip

Copy files to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels}"
scp foot_track_net_w8a8.tflite         <user>@<device-ip>:$HOME/models/
scp foot_track_net.json                <user>@<device-ip>:$HOME/labels/
scp foot_track_net_settings.json       <user>@<device-ip>:$HOME/labels/
scp hrnet_pose_w8a8.tflite             <user>@<device-ip>:$HOME/models/
scp hrnet.json                         <user>@<device-ip>:$HOME/labels/
scp hrnet_settings.json                <user>@<device-ip>:$HOME/labels/

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>

Set environment variables

Run below command on your device

export MODEL_NAME_1=foot_track_net_w8a8.tflite
export LABELS_NAME_1=foot_track_net.json
export LABELS_NAME_2=foot_track_net_settings.json
export MODEL_NAME_2=hrnet_pose_w8a8.tflite
export LABELS_NAME_3=hrnet.json
export LABELS_NAME_4=hrnet_settings.json

Run the pipeline

gst-launch-1.0 -e --gst-debug=2 \
qtimlvconverter name=stage_01_preproc mode=image-batch-non-cumulative \
qtimltflite name=stage_01_inference delegate=external external-delegate-path=libQnnTFLiteDelegate.so \
external-delegate-options="QNNExternalDelegate,backend_type=htp,log_level=(string)1;" \
model=$HOME/models/$MODEL_NAME_1 \
qtimlpostprocess name=stage_01_postproc results=10 module=qpd labels=$HOME/labels/$LABELS_NAME_1 \
settings=$HOME/labels/$LABELS_NAME_2 \
qtimlvconverter name=stage_02_preproc image-disposition=centre mode=roi-batch-cumulative \
qtimltflite name=stage_02_inference delegate=external external-delegate-path=libQnnTFLiteDelegate.so \
external-delegate-options="QNNExternalDelegate,backend_type=htp,htp_performance_mode=(string)2,log_level=(string)1;" \
model=$HOME/models/$MODEL_NAME_2 \
qtimlpostprocess name=stage_02_postproc results=1 module=hrnet labels=$HOME/labels/$LABELS_NAME_3 \
settings=$HOME/labels/$LABELS_NAME_4 \
qticamsrc ! video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! queue ! tee name=t_split_1 \
t_split_1. ! queue ! metamux_1. \
t_split_1. ! queue ! stage_01_preproc. stage_01_preproc. ! queue ! stage_01_inference. stage_01_inference. ! queue ! \
stage_01_postproc. stage_01_postproc. ! text/x-raw ! queue ! metamux_1. \
qtimetamux name=metamux_1 ! queue ! tee name=t_split_2 \
t_split_2. ! queue ! metamux_2. \
t_split_2. ! queue ! stage_02_preproc. stage_02_preproc. ! queue ! stage_02_inference. stage_02_inference. ! queue ! \
stage_02_postproc. stage_02_postproc. ! text/x-raw ! queue ! metamux_2. \
qtimetamux name=metamux_2 ! queue ! qtivoverlay ! queue ! waylandsink fullscreen=true sync=false async=false

Four-Source Batched Object Detection with Compositor

Demo pipeline with batching of 4 sources. A detection inference is run and the results are overlaid via composer on the screen.

Download Required Files

File	Download	Save as
Yolov8 Detection W8A8 Batch 4 model	Export from Qualcomm AI Hub	`yolov8_det_w8a8_batch_4.tflite`
Detection labels	yolov8.json	`yolov8.json`
Sample video	Input video	`ai_demo_sample.mp4`

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.
# Run from your host machine — replace <user> and <device-ip>

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media,media/output}"
scp yolov8_det_w8a8_batch_4.tflite        <user>@<device-ip>:$HOME/models/
scp yolov8.json                           <user>@<device-ip>:$HOME/labels/
scp ai_demo_sample.mp4             <user>@<device-ip>:$HOME/media/

Connect to device

# Run from your host machine — replace <user> and <device-ip>
ssh <user>@<device-ip>

Set environment variables

Run below command on your device

export MODEL_NAME=yolov8_det_w8a8_batch_4.tflite
export LABELS_NAME=yolov8.json
export SRC_VIDEO_NAME_1=ai_demo_sample.mp4
export SRC_VIDEO_NAME_2=ai_demo_sample.mp4
export SRC_VIDEO_NAME_3=ai_demo_sample.mp4
export SRC_VIDEO_NAME_4=ai_demo_sample.mp4

Run the pipeline

gst-launch-1.0 -e --gst-debug=2 \
qtimltflite name=inference delegate=external external-delegate-path=libQnnTFLiteDelegate.so external-delegate-options="QNNExternalDelegate,backend_type=htp,htp_performance_mode=(string)2;" model=$HOME/models/$MODEL_NAME \
filesrc location=$HOME/media/$SRC_VIDEO_NAME_1 ! qtdemux ! queue ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw ! qtivtransform ! video/x-raw,format=NV12,width=640,height=360 ! queue ! tee name=tee_0 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME_2 ! qtdemux ! queue ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw ! qtivtransform ! video/x-raw,format=NV12,width=640,height=360 ! queue ! tee name=tee_1 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME_3 ! qtdemux ! queue ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw ! qtivtransform ! video/x-raw,format=NV12,width=640,height=360 ! queue ! tee name=tee_2 \
filesrc location=$HOME/media/$SRC_VIDEO_NAME_4 ! qtdemux ! queue ! h264parse ! v4l2h264dec capture-io-mode=4 output-io-mode=4 ! video/x-raw ! qtivtransform ! video/x-raw,format=NV12,width=640,height=360 ! queue ! tee name=tee_3 \
tee_0. ! video/x-raw,format=NV12 ! mixer. \
tee_0. ! video/x-raw,format=NV12 ! batch. \
tee_1. ! video/x-raw,format=NV12 ! mixer. \
tee_1. ! video/x-raw,format=NV12 ! batch. \
tee_2. ! video/x-raw,format=NV12 ! mixer. \
tee_2. ! video/x-raw,format=NV12 ! batch. \
tee_3. ! video/x-raw,format=NV12 ! mixer. \
tee_3. ! video/x-raw,format=NV12 ! batch. \
qtibatch name=batch ! queue ! qtimlvconverter ! queue ! inference. inference. ! queue ! qtimldemux name=mldemux \
mldemux. ! queue ! qtimlpostprocess results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME settings="{\"confidence\": 70.0}" ! video/x-raw,width=640,height=360 ! queue ! mixer. \
mldemux. ! queue ! qtimlpostprocess results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME settings="{\"confidence\": 70.0}" ! video/x-raw,width=640,height=360 ! queue ! mixer. \
mldemux. ! queue ! qtimlpostprocess results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME settings="{\"confidence\": 70.0}" ! video/x-raw,width=640,height=360 ! queue ! mixer. \
mldemux. ! queue ! qtimlpostprocess results=10 module=yolov8 labels=$HOME/labels/$LABELS_NAME settings="{\"confidence\": 70.0}" ! video/x-raw,width=640,height=360 ! queue ! mixer. \
qtivcomposer name=mixer \
sink_0::position="<0, 0>" sink_0::dimensions="<960, 540>" \
sink_1::position="<960,  0>" sink_1::dimensions="<960, 540>" \
sink_2::position="<0, 540>" sink_2::dimensions="<960, 540>" \
sink_3::position="<960, 540>" sink_3::dimensions="<960, 540>" \
sink_4::position="<0, 0>" sink_4::dimensions="<960, 540>" \
sink_5::position="<960, 0>" sink_5::dimensions="<960, 540>" \
sink_6::position="<0, 540>" sink_6::dimensions="<960, 540>" \
sink_7::position="<960, 540>" sink_7::dimensions="<960, 540>" \
mixer. ! video/x-raw,format=NV12 ! queue ! waylandsink sync=false fullscreen=true

​Overview

​Example Pipeline

​Hierarchy

​Pad Templates

​sink

​src

​Element Properties

​Image/Video Tensor Characteristics

​Image Placement Matters in Model Performance

​Normalization

​Batching

​Temporal Batching

​Multi-Stage Inference Pipelines with ROI-Based Processing

​Preprocessing Metadata

​Usage

​Single Camera Stream — Save Tensors to File

​Two-Stage Person Detection and Pose Estimation on Live Camera Stream

​Four-Source Batched Object Detection with Compositor

Overview

Example Pipeline

Hierarchy

Pad Templates

sink

src

Element Properties

Image/Video Tensor Characteristics

Image Placement Matters in Model Performance

Normalization

Batching

Temporal Batching

Multi-Stage Inference Pipelines with ROI-Based Processing

Preprocessing Metadata

Usage

Single Camera Stream — Save Tensors to File

Two-Stage Person Detection and Pose Estimation on Live Camera Stream

Four-Source Batched Object Detection with Compositor