Building a Real-Time Hand Gesture Recognition AI Pipeline Using the QIM SDK

QIMSDK · Qualcomm

Computer Vision

Build a four-stage real-time hand gesture recognition pipeline using Qualcomm IM SDK — covering palm detection, hand landmark estimation, gesture embedding, and gesture classification, all running on-device with hardware-accelerated inference.

QIMSDK Team·Jun 10, 2026·← All posts

Introduction

Real-time gesture recognition is redefining how humans interact with machines — enabling touchless, natural interfaces across augmented reality, gaming, accessibility, and industrial control. But accurately interpreting hand gestures demands more than simple object detection; it requires a multi-stage AI pipeline capable of progressively refining raw visual input into high-level, actionable intent. The QIM SDK brings this capability directly to the edge. By routing compute-intensive tasks through hardware-accelerated GStreamer plugins, the SDK offloads video decoding, frame preparation, multi-stage inference, and encoding entirely to dedicated hardware blocks — delivering low-latency, power-efficient execution even for complex multi-model workloads. At the core of this use case is a four-stage sequential inference pipeline:

Stage 1

Palm Detection

Stage 2

Hand Landmark Estimation

Stage 3

Gesture Embedding

Stage 4

Gesture Classification

Each stage builds directly on the output of the previous, progressively transforming raw video frames into structured, high-level gesture data. Between stages, intermediate transformations — such as rotating and cropping detected hand regions for correct spatial alignment — ensure consistent, accurate results across varying hand positions and orientations. Metadata is hierarchically structured throughout the pipeline. Each detected hand establishes a root metadata entry, with landmark detections, embeddings, and gesture classifications attached as child metadata linked to the originating detection. Visualization is fully hardware-accelerated: bounding boxes, key-points, and gesture labels are composited directly onto video frames via optimized overlay rendering. The pipeline accepts input from USB cameras, RTSP streams, ISP (on-device) cameras, and local video files, and delivers results through real-time on-screen visualization or remote streaming over RTSP and WebRTC — with structured inference metadata transmitted in parallel. The complete application source code is available here.

Use Case Overview

Source

The pipeline accepts continuous video input from a USB camera, RTSP stream, ISP (on-device) camera, or local video file.

Palm Detection

Each incoming frame is processed by the palm detection model, which identifies the presence, location, and rotation angle of any hands in the scene.

Hand Landmark Detection

For each detected hand, the landmark model identifies 21 key points — fingertips, joints, and wrist — capturing the full structural pose. The hand region is first rotated and cropped via affine transformation for correct spatial alignment.

Gesture Embedding

The detected landmarks are encoded into a compact numerical representation that summarizes the hand pose in a form optimized for classification.

Gesture Classification

The embedding is passed to the gesture classification model, which maps the hand pose to a predefined gesture label — such as open hand, fist, or thumbs up.

Metadata Synchronization

Results from all inference stages are structured hierarchically. qtimetamux synchronizes this structured metadata with the original video frames, maintaining per-frame consistency throughout the pipeline.

Output

Annotated frames are H.264-encoded and delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.

Pipeline diagram

Elements used in pipeline

Element	Description
`source`	Accepts video input from a USB camera, ISP camera, RTSP stream, or local video file.
`tee`	Splits the stream into multiple parallel branches for simultaneous processing.
`qtimlvconverter`	Hardware-accelerated resize, YUV→RGB conversion, and pixel normalization to meet each model’s input requirements.
`qtimltflite`	Executes TFLite inference models, producing raw output tensors.
`qtimlpostprocess`	Decodes raw tensors into structured bounding boxes, keypoints, labels, and confidence scores via dynamically loaded modules.
`qtimetamux`	Synchronizes inference results with the original video stream as structured per-frame metadata.
`qtimetatransform`	Transforms metadata as it flows through the pipeline — modifying coordinate systems to ensure compatibility with downstream elements.
`qtivoverlay`	Composites bounding boxes, keypoints, and labels onto video frames using hardware-accelerated overlay rendering.
`qtimlmetaparser`	Serializes per-frame inference metadata into JSON for integration with external systems.
`v4l2h264enc` / `h264parse`	Hardware-accelerated H.264 encoding of the processed video stream.
`waylandsink`	Renders the output to the local display via the Wayland compositor.

How It Works

Stage 1 — Palm Detection

The first model processes the full video frame, identifying the location, orientation, and bounding box of each detected hand. This stage also produces key points used to estimate the hand’s rotation angle. A qtimetatransform element uses this information to compute an affine transformation matrix, which is attached as metadata and carried forward to the next stage.

Affine Crop Generation

The hand landmark model requires a normalized, upright view of the hand — not raw bounding box coordinates. A dedicated qtimlvconverter instance consumes the affine transformation matrix and applies it to crop and rotate the detected hand region from the original frame, producing a correctly aligned input for the next stage.

Stage 2 — Hand Landmark Detection

The landmark model is invoked once per detected hand, processing the cropped and aligned region to produce 21 keypoints — fingertips, joints, and wrist — along with handedness and confidence scores. A post-processing step then applies a reverse transformation to remap the keypoints accurately back onto the original frame coordinate space.

Stage 3 — Gesture Embedding

The landmark keypoints are passed to the gesture embedder, which encodes the hand pose into a compact numerical embedding vector. This stage operates directly on tensor data — no image cropping or geometric transformation is required.

Stage 4 — Gesture Classification

The embedding vector is passed to the gesture classifier, which maps the hand pose representation to a predefined gesture label — such as open hand, fist, or thumbs up. Like the embedder, this stage operates purely on tensors with no additional preprocessing.

Hierarchical Metadata

To preserve logical relationships across all four stages, the pipeline employs a hierarchical metadata model based on unique IDs and parent IDs. The palm detection stage creates a root metadata entry for each detected hand; each subsequent stage — landmark detection, embedding, and classification — attaches its results as child metadata referencing the parent. This ensures every classified gesture is explicitly traceable to its originating hand detection.

Output

Annotated frames with bounding boxes, keypoints, and gesture labels are rendered via hardware-accelerated overlay. The H.264-encoded stream is delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.

Run application on device

Setup Requirements

Hardware

Component	Description
Edge Device	RB3 Gen 2, IQ8, or IQ9 — Primary processing unit for AI inference and video composition.
Camera Source	IP/RTSP camera, ISP (on-device) camera, or USB camera. A local file source may be substituted if no physical camera is available.
HDMI Display Monitor	Connected to the edge device for rendering and visualizing pipeline output.
PoE Switch	Powers IP/RTSP cameras and provides network connectivity over a single Ethernet cable per camera. (Required for IP/RTSP camera setups only.)
Local Network	Ensures the edge device, RTSP camera, and host machine are reachable on the same network. (Required when using RTSP camera input or streaming results via RTSP or WebRTC.)

Software

Flash your Qualcomm Edge device by following the device setup and flashing instructions here Once your device is ready, follow the instructions below to set up the Gesture Recognition AI Pipeline:

AI Model and config files

Download the gesture recognizer models from Google MediaPipe:

# Download the gesture recognizer task bundle
wget https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task

# Extract the top-level task
unzip gesture_recognizer.task

# Extract hand landmarker models
unzip hand_landmarker.task
# save hand_detector.tflite as palm_detection.tflite
# save hand_landmarks_detector.tflite as hand_landmark.tflite

# Extract gesture recognizer models
unzip hand_gesture_recognizer.task
# → gesture_embedder.tflite, canned_gesture_classifier.tflite

These are FLOAT precision models.

File	Download	Save as
Palm detection model	See download steps above	`palm_detection.tflite`
Hand landmark model	See download steps above	`hand_landmark.tflite`
Gesture embedder model	See download steps above	`gesture_embedder.tflite`
Gesture classifier model	See download steps above	`canned_gesture_classifier.tflite`
Palm detection labels	palmd_labels.json	`palmd_labels.json`
Palm detection settings	palmd_settings.json	`palmd_settings.json`
Hand landmark labels	hlandmark_labels.json	`hlandmark_labels.json`
Hand landmark settings	hlandmark_settings.json	`hlandmark_settings.json`
Gesture labels	gesture_labels.json	`gesture_labels.json`
Sample video	Input video	`video.mp4`

Copy files to device

# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media}"
scp palm_detection.tflite              <user>@<device-ip>:$HOME/models/
scp hand_landmark.tflite               <user>@<device-ip>:$HOME/models/
scp gesture_embedder.tflite            <user>@<device-ip>:$HOME/models/
scp canned_gesture_classifier.tflite   <user>@<device-ip>:$HOME/models/
scp palmd_labels.json                  <user>@<device-ip>:$HOME/labels/
scp palmd_settings.json                <user>@<device-ip>:$HOME/labels/
scp hlandmark_labels.json              <user>@<device-ip>:$HOME/labels/
scp hlandmark_settings.json            <user>@<device-ip>:$HOME/labels/
scp gesture_labels.json                <user>@<device-ip>:$HOME/labels/
scp video.mp4                          <user>@<device-ip>:$HOME/media/

Connect to device

ssh <user>@<device-ip>

Run the Gesture Recognition Application

A display must be connected to the device. If no display is available, use the --no-display flag to run in headless mode.

Use the following base path for model and label files based on your OS:

USB camera

gst-gesture-recognition \
  --input-type=usb \
  --input-config=/dev/video0 \
  --output-type=rtsp \
  --output-config=8900

RTSP camera

gst-gesture-recognition \
  --input-type=rtsp \
  --input-config=rtsp://<ip>:<port>/stream \
  --output-type=rtsp \
  --output-config=8900

File input

gst-gesture-recognition \
  --input-type=file \
  --input-config=$HOME/media/video.mp4 \
  --output-type=rtsp \
  --output-config=8900

Headless (no display)

gst-gesture-recognition \
  --input-type=file \
  --input-config=$HOME/media/video.mp4 \
  --output-type=rtsp \
  --output-config=8900 \
  --no-display

Note: This example uses an offline video file as input. To use an IP/RTSP camera or USB camera instead, update the --input-type argument accordingly — refer to the Command-Line Options section below for details.

It produces two key output results: an AI-annotated video stream and a JSON metadata stream. To visualize these results, refer to the Host-Side Visualization section below.

Visualize the Results - Host-Side Visualization (Windows + WSL)

This section describes how to run the visualization client on a Windows host machine using WSL (Windows Subsystem for Linux). The client renders the live video stream alongside a real-time AI metadata panel. 📥 The visualization client script can be downloaded here: rtsp_webrtc_client.zip It displays:

Left panel — Live video stream with AI overlays (bounding boxes, keypoints, gesture labels)
Right panel — Real-time AI metadata (JSON): object detections, bounding boxes, and confidence scores.

Step 1 — Install WSL and Ubuntu If WSL is not already installed, run the following from a Windows terminal:

wsl --install Ubuntu-24.04

Once installed, open the Ubuntu terminal and update the system:

sudo apt update && sudo apt upgrade -y

Step 2 — Install System Dependencies The visualization script requires GStreamer and Python GObject Introspection (GI) bindings. Install all required packages with:

sudo apt install -y \
  python3 python3-pip python3-gi python3-gi-cairo \
  gir1.2-gstreamer-1.0 \
  gir1.2-gst-plugins-base-1.0 \
  gir1.2-gst-plugins-bad-1.0 \
  gstreamer1.0-tools \
  gstreamer1.0-plugins-base \
  gstreamer1.0-plugins-good \
  gstreamer1.0-plugins-bad \
  gstreamer1.0-plugins-ugly \
  gstreamer1.0-libav \
  python3-websocket \
  libnice10 \
  libnice-dev \
  gstreamer1.0-nice

Step 3 — Run the Visualization Client Script Navigate to the directory containing the script and run:

RTSP

python3 rtsp_webrtc_client.py rtsp://<DEVICE_IP>:8900/live

WebRTC

python3 rtsp_webrtc_client.py --source webrtc --signalling-server wss://webrtc.nirbheek.in:8443 --peer-id 1010

Step 4 — Expected Output Once the client connects, the UI will display:

Panel	Content
Left	Real-time decoded video stream with gesture overlays
Right	Live AI metadata — detected hands, keypoints, and gesture labels

After following the steps, the video and metadata streams should be up and running. Gesture Recognition AI Pipeline is configured for single-hand gesture recognition — the most common interaction scenario, where one dominant gesture is performed at a time. This keeps computational overhead low while maintaining reliable detection and classification. When multiple hands are present, the model prioritizes the detection with the highest confidence score, ensuring stable and accurate gesture recognition even in dynamic conditions.

The pipeline generates structured JSON metadata in the following format:

{
  "object_detection": [
    {
      "label": "palm",
      "confidence": 85.94279479980469,
      "color": 16711935,
      "rectangle": {
        "x": 0.3484375,
        "y": 0.15555555555555556,
        "width": 0.22708333333333333,
        "height": 0.40370370370370373
      },
      "xtraparams": {
        "affine-matrix": [
          0.42060701741537004,
          1.2300771264034409,
          38.53241177825402,
          -1.2300771264034409,
          0.42060701741537004,
          1313.4940252711156,
          0.0,
          0.0,
          1.0
        ]
      },
      "video_landmarks": [
        {
          "keypoints": [
            { "keypoint": "wrist", "x": 0.5411458333333333, "y": 0.42407407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb cmc", "x": 0.5432291666666667, "y": 0.3648148148148148, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb mcp", "x": 0.5359375, "y": 0.3037037037037037, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb ip", "x": 0.5348958333333333, "y": 0.24537037037037038, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb tip", "x": 0.5395833333333333, "y": 0.1925925925925926, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "index finger mcp", "x": 0.47760416666666666, "y": 0.32962962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger pip", "x": 0.43854166666666666, "y": 0.29907407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger dip", "x": 0.4109375, "y": 0.2814814814814815, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger tip", "x": 0.3848958333333333, "y": 0.26666666666666666, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "middle finger mcp", "x": 0.4635416666666667, "y": 0.37407407407407406, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger pip", "x": 0.41822916666666665, "y": 0.36203703703703705, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger dip", "x": 0.3880208333333333, "y": 0.35648148148148145, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger tip", "x": 0.36041666666666666, "y": 0.3537037037037037, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "ring finger mcp", "x": 0.46197916666666666, "y": 0.41759259259259257, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger pip", "x": 0.4192708333333333, "y": 0.4212962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger dip", "x": 0.39114583333333336, "y": 0.425, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger tip", "x": 0.36614583333333334, "y": 0.42777777777777776, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "pinky mcp", "x": 0.46979166666666666, "y": 0.45462962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky pip", "x": 0.43802083333333336, "y": 0.475, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky dip", "x": 0.41822916666666665, "y": 0.49074074074074076, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky tip", "x": 0.39947916666666666, "y": 0.5027777777777778, "confidence": 99.70703125, "color": 16711935 }
          ],
          "links": [
            { "start": 0, "end": 17 },
            { "start": 1, "end": 0 },
            { "start": 2, "end": 1 },
            { "start": 3, "end": 2 },
            { "start": 4, "end": 3 },
            { "start": 5, "end": 0 },
            { "start": 6, "end": 5 },
            { "start": 7, "end": 6 },
            { "start": 8, "end": 7 },
            { "start": 9, "end": 5 },
            { "start": 10, "end": 9 },
            { "start": 11, "end": 10 },
            { "start": 12, "end": 11 },
            { "start": 13, "end": 9 },
            { "start": 14, "end": 13 },
            { "start": 15, "end": 14 },
            { "start": 16, "end": 15 },
            { "start": 17, "end": 13 },
            { "start": 18, "end": 17 },
            { "start": 19, "end": 18 },
            { "start": 20, "end": 19 }
          ]
        }
      ],
      "image_classification": [
        {
          "label": "Open Palm",
          "confidence": 0.7060546875,
          "color": 4294902015
        }
      ]
    }
  ],
  "parameters": {
    "timestamp": "28356149027"
  }
}

Beyond the default setup, the application offers flexible input and output configurations that can be tailored via command-line options, as described below:

Command-Line Options

--input-type

Selects the video input source for the pipeline.

Value	Description
`usb`	USB camera. Requires `--input-config=/dev/video0`.
`isp`	Built-in ISP (on-device) camera. Optionally specify a camera ID via `--input-config=0`.
`rtsp`	External IP/RTSP camera or stream. Requires `--input-config=rtsp://...`.
`file`	Local H.264-encoded video file. Requires `--input-config=/path/to/video.mp4`.

--input-config

Specifies the input source configuration corresponding to the selected --input-type.

Input Type	Value
USB	`/dev/videoX`
ISP	`<camera ID>`
RTSP	`rtsp://<ip-or-url>`
File	`/path/to/video.mp4`

--output-type

Defines how the processed output stream is delivered.

Value	Description
`none`	No video output (headless mode).
`file`	Save encoded output to a file. Requires `--output-config=/path/to/output.mp4`.
`rtsp`	Stream over RTSP. Requires `--output-config=<port>`. Access at `rtsp://<device-ip>:<port>/live`.
`webrtc`	Stream over WebRTC. Requires `--output-config=ws://<signalling-server>:<port>`.

--output-config

Specifies the output destination configuration corresponding to the selected --output-type.

Output Type	Value
File	`/path/to/output.mp4`
RTSP	`<port>`
WebRTC	`ws://<signalling-server>:<port>`

--model-base-path

Root directory for model, label, and config files. The application resolves assets automatically:

Asset type	Resolved path
Model files (`*.tflite`)	`<base-path>/models/<file>`
Label / settings files (`*.json`)	`<base-path>/labels/<file>`

--model-base-path=/root        # QLI
--model-base-path=/home/ubuntu # Ubuntu

--no-display

Disables local on-screen rendering. Recommended for headless deployments, remote streaming setups (RTSP/WebRTC), or performance optimization.

--width / --height / --framerate

Sets the raw input video resolution and frame rate. Applicable only to ISP and USB inputs.

--width=1920 --height=1080 --framerate=30

--webrtc-id

Specifies the local WebRTC signaling client ID used for peer connection setup with the signaling server.

--webrtc-id=1010

JSON Metadata Output

The pipeline generates structured per-frame metadata. Each detected hand produces a root entry with child landmark, embedding, and classification results:

Sample JSON output

{
  "object_detection": [
    {
      "label": "palm",
      "confidence": 85.94,
      "rectangle": {
        "x": 0.348, "y": 0.155,
        "width": 0.227, "height": 0.403
      },
      "xtraparams": {
        "affine-matrix": [
          0.4206, 1.2300, 38.53,
          -1.2300, 0.4206, 1313.49,
          0.0, 0.0, 1.0
        ]
      },
      "video_landmarks": [
        {
          "keypoints": [
            { "keypoint": "wrist",             "x": 0.541, "y": 0.424, "confidence": 99.7 },
            { "keypoint": "thumb tip",         "x": 0.539, "y": 0.192, "confidence": 99.7 },
            { "keypoint": "index finger tip",  "x": 0.384, "y": 0.266, "confidence": 99.7 },
            { "keypoint": "middle finger tip", "x": 0.360, "y": 0.353, "confidence": 99.7 },
            { "keypoint": "ring finger tip",   "x": 0.366, "y": 0.427, "confidence": 99.7 },
            { "keypoint": "pinky tip",         "x": 0.399, "y": 0.502, "confidence": 99.7 }
          ]
        }
      ],
      "image_classification": [
        { "label": "Open Palm", "confidence": 0.706 }
      ]
    }
  ],
  "parameters": { "timestamp": "28356149027" }
}

Implementation Deep-Dive

1. Application Configuration and Runtime Context

The application separates user configuration from runtime state using two structs:

typedef struct GstAppConfig {
  gchar *input_type;
  gchar *input_config;
  gchar *output_type;
  gchar *output_config;
  gchar *model_base_path;
  gboolean no_display;
  gint width, height, framerate, webrtc_id;
} GstAppConfig;

typedef struct GstAppContext {
  GstAppConfig config;
  GstElement *pipeline;
  GMainLoop  *mloop;
  GstElement *webrtc;
  gboolean    is_shutting_down;
} GstAppContext;

2. Pipeline Assembly

The pipeline is composed of three independent branches: input, processing, and output. Construction order is deliberate — input first, output second, processing branch last.

static gboolean gst_app_create_pipe (GstAppContext *appctx) {
  GstElement *input_tail = NULL, *output_head = NULL, *meta_head = NULL;
  appctx->pipeline = gst_pipeline_new ("gst-gesture-recognition");

  if (!gst_app_create_input_pipe  (appctx, &input_tail))  return FALSE;
  if (!gst_app_create_output_pipe (appctx, &output_head, &meta_head)) return FALSE;
  if (!gst_app_create_user_pipe   (appctx, input_tail, output_head, meta_head)) return FALSE;
  return TRUE;
}

3. Multi-Stage Inference Model Configuration

Each stage is configured with a GPU-delegated qtimltflite instance and a task-specific qtimlpostprocess module:

/* Stage 1 — Palm Detection */
palm_inf = gst_app_make_element ("qtimltflite", "palm_inf");
gst_element_set_enum_property (palm_inf, "delegate", "gpu");
g_object_set (palm_inf, "model", "<base>/models/palm_detection.tflite", NULL);

palm_post = gst_app_make_element ("qtimlpostprocess", "palm_post");
gst_element_set_enum_property (palm_post, "module", "palmd");
g_object_set (palm_post, "results", 1, NULL);

/* Stage 2 — Hand Landmark */
hand_pre = gst_app_make_element ("qtimlvconverter", "hand_pre");
gst_element_set_enum_property (hand_pre, "mode", "roi-batch-non-cumulative");
hand_inf = gst_app_make_element ("qtimltflite", "hand_inf");
gst_element_set_enum_property (hand_inf, "delegate", "gpu");

hand_post = gst_app_make_element ("qtimlpostprocess", "hand_post");
gst_element_set_enum_property (hand_post, "module", "hlandmark");
g_object_set (hand_post, "results", 6, NULL);

/* Stage 3 — Gesture Embedding */
gesture_pre = gst_app_make_element ("qtimlpostprocess", "gesture_pre");
gst_element_set_enum_property (gesture_pre, "module", "tensor");
gesture_embed = gst_app_make_element ("qtimltflite", "gesture_embed");
gst_element_set_enum_property (gesture_embed, "delegate", "gpu");

/* Stage 4 — Gesture Classification */
gesture_class = gst_app_make_element ("qtimltflite", "gesture_class");
gst_element_set_enum_property (gesture_class, "delegate", "gpu");
gesture_post = gst_app_make_element ("qtimlpostprocess", "gesture_post");
gst_element_set_enum_property (gesture_post, "module", "mobilenet");
g_object_set (gesture_post, "results", 8, NULL);

4. Linking Palm and Hand Landmark Branches

/* input → tee1 → metamux1 (passthrough) + palm detection */
gst_element_link (input_tail, tee1);
gst_element_link_many (tee1, q, metamux1, metatransform, tee2, NULL);
gst_element_link_many (tee1, q, palm_pre, palm_inf, palm_post, caps1, metamux1, NULL);

/* tee2 → overlay branch + hand landmark branch */
gst_element_link_many (tee2, q, metamux2, overlay, tee3, NULL);
gst_element_link_many (tee2, q, hand_pre, hand_inf, tee4, NULL);
gst_element_link_many (tee4, q, hand_post, caps2, metamux2, NULL);

5. Gesture Branch and Output

/* tee4 → embedding → classification → metamux2 */
gst_element_link_many (tee4, q,
    gesture_pre, gesture_embed, gesture_class, gesture_post,
    caps3, metamux2, NULL);

/* tee3 → video output */
gst_element_link_many (tee3, q, output_head, NULL);

/* tee3 → metadata output (optional) */
if (meta_head != NULL)
    gst_element_link_many (tee3, q, parser, meta_head, NULL);

6. WebRTC Signaling

WebRTC communication uses explicit SDP offer/answer exchange and ICE candidate negotiation via WebSocket using libsoup:

/* Create data channel for metadata */
g_signal_emit_by_name (webrtcbin, "create-data-channel", name, NULL, &ch);

/* Send SDP offer */
GstPromise *promise = gst_promise_new_with_change_func (on_offer_created, appctx, NULL);
g_signal_emit_by_name (webrtcbin, "create-offer", NULL, promise);

/* ICE candidates */
g_signal_connect (appctx->webrtc, "on-ice-candidate",
    G_CALLBACK (on_webrtc_ice_candidate), appctx);

Callback	Responsibility
`on_offer_created`	Constructs and sends the SDP offer to the remote peer
`on_ice_candidate`	Transmits ICE candidates to the signaling server
`on_ws_message`	Handles incoming signaling messages from the WebSocket

Build the Application

Source code: gst-gesture-recognition
Build instructions: Steps to build custom application

Conclusion

The QIM SDK’s modular architecture enables developers to compose intelligent video analytics pipelines with speed and flexibility. Models run in parallel or sequentially, with each stage’s output automatically attached to the corresponding video frame as structured metadata. Decoupled video and inference processing allows multiple models to execute concurrently — maximizing throughput without sacrificing accuracy — with all results unified through a single GStreamer element for clean, scalable integration. Whether building touchless interfaces or gesture-driven applications, the QIM SDK provides a solid, production-ready foundation for advanced edge AI.

​Introduction

Stage 1

Stage 2

Stage 3

Stage 4

​Use Case Overview

​Pipeline diagram

​Elements used in pipeline

​How It Works

​Run application on device

​Setup Requirements

​Hardware

​Software

AI Model and config files

​Visualize the Results - Host-Side Visualization (Windows + WSL)

​Command-Line Options

​JSON Metadata Output

​Implementation Deep-Dive

​Build the Application

​Conclusion

Introduction

Use Case Overview

Pipeline diagram

Elements used in pipeline

How It Works

Run application on device

Setup Requirements

Hardware

Software

Visualize the Results - Host-Side Visualization (Windows + WSL)

Command-Line Options

JSON Metadata Output

Implementation Deep-Dive

Build the Application

Conclusion