Skip to main content
QIMSDK · Qualcomm
Computer Vision

Build a four-stage real-time hand gesture recognition pipeline using Qualcomm IM SDK — covering palm detection, hand landmark estimation, gesture embedding, and gesture classification, all running on-device with hardware-accelerated inference.

QIMSDK Team·Jun 10, 2026·← All posts

Introduction

Real-time gesture recognition is redefining how humans interact with machines — enabling touchless, natural interfaces across augmented reality, gaming, accessibility, and industrial control. But accurately interpreting hand gestures demands more than simple object detection; it requires a multi-stage AI pipeline capable of progressively refining raw visual input into high-level, actionable intent. The QIM SDK brings this capability directly to the edge. By routing compute-intensive tasks through hardware-accelerated GStreamer plugins, the SDK offloads video decoding, frame preparation, multi-stage inference, and encoding entirely to dedicated hardware blocks — delivering low-latency, power-efficient execution even for complex multi-model workloads. At the core of this use case is a four-stage sequential inference pipeline:

Stage 1

Palm Detection

Stage 2

Hand Landmark Estimation

Stage 3

Gesture Embedding

Stage 4

Gesture Classification
Each stage builds directly on the output of the previous, progressively transforming raw video frames into structured, high-level gesture data. Between stages, intermediate transformations — such as rotating and cropping detected hand regions for correct spatial alignment — ensure consistent, accurate results across varying hand positions and orientations. Metadata is hierarchically structured throughout the pipeline. Each detected hand establishes a root metadata entry, with landmark detections, embeddings, and gesture classifications attached as child metadata linked to the originating detection. Visualization is fully hardware-accelerated: bounding boxes, key-points, and gesture labels are composited directly onto video frames via optimized overlay rendering. The pipeline accepts input from USB cameras, RTSP streams, ISP (on-device) cameras, and local video files, and delivers results through real-time on-screen visualization or remote streaming over RTSP and WebRTC — with structured inference metadata transmitted in parallel. The complete application source code is available here.

Use Case Overview

1

Source

The pipeline accepts continuous video input from a USB camera, RTSP stream, ISP (on-device) camera, or local video file.
2

Palm Detection

Each incoming frame is processed by the palm detection model, which identifies the presence, location, and rotation angle of any hands in the scene.
3

Hand Landmark Detection

For each detected hand, the landmark model identifies 21 key points — fingertips, joints, and wrist — capturing the full structural pose. The hand region is first rotated and cropped via affine transformation for correct spatial alignment.
4

Gesture Embedding

The detected landmarks are encoded into a compact numerical representation that summarizes the hand pose in a form optimized for classification.
5

Gesture Classification

The embedding is passed to the gesture classification model, which maps the hand pose to a predefined gesture label — such as open hand, fist, or thumbs up.
6

Metadata Synchronization

Results from all inference stages are structured hierarchically. qtimetamux synchronizes this structured metadata with the original video frames, maintaining per-frame consistency throughout the pipeline.
7

Output

Annotated frames are H.264-encoded and delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.

Pipeline diagram

Hand Gesture Recognition Pipeline

Elements used in pipeline

ElementDescription
sourceAccepts video input from a USB camera, ISP camera, RTSP stream, or local video file.
teeSplits the stream into multiple parallel branches for simultaneous processing.
qtimlvconverterHardware-accelerated resize, YUV→RGB conversion, and pixel normalization to meet each model’s input requirements.
qtimltfliteExecutes TFLite inference models, producing raw output tensors.
qtimlpostprocessDecodes raw tensors into structured bounding boxes, keypoints, labels, and confidence scores via dynamically loaded modules.
qtimetamuxSynchronizes inference results with the original video stream as structured per-frame metadata.
qtimetatransformTransforms metadata as it flows through the pipeline — modifying coordinate systems to ensure compatibility with downstream elements.
qtivoverlayComposites bounding boxes, keypoints, and labels onto video frames using hardware-accelerated overlay rendering.
qtimlmetaparserSerializes per-frame inference metadata into JSON for integration with external systems.
v4l2h264enc / h264parseHardware-accelerated H.264 encoding of the processed video stream.
waylandsinkRenders the output to the local display via the Wayland compositor.

How It Works

1

Stage 1 — Palm Detection

The first model processes the full video frame, identifying the location, orientation, and bounding box of each detected hand. This stage also produces key points used to estimate the hand’s rotation angle. A qtimetatransform element uses this information to compute an affine transformation matrix, which is attached as metadata and carried forward to the next stage.
2

Affine Crop Generation

The hand landmark model requires a normalized, upright view of the hand — not raw bounding box coordinates. A dedicated qtimlvconverter instance consumes the affine transformation matrix and applies it to crop and rotate the detected hand region from the original frame, producing a correctly aligned input for the next stage.
3

Stage 2 — Hand Landmark Detection

The landmark model is invoked once per detected hand, processing the cropped and aligned region to produce 21 keypoints — fingertips, joints, and wrist — along with handedness and confidence scores. A post-processing step then applies a reverse transformation to remap the keypoints accurately back onto the original frame coordinate space.
4

Stage 3 — Gesture Embedding

The landmark keypoints are passed to the gesture embedder, which encodes the hand pose into a compact numerical embedding vector. This stage operates directly on tensor data — no image cropping or geometric transformation is required.
5

Stage 4 — Gesture Classification

The embedding vector is passed to the gesture classifier, which maps the hand pose representation to a predefined gesture label — such as open hand, fist, or thumbs up. Like the embedder, this stage operates purely on tensors with no additional preprocessing.
6

Hierarchical Metadata

To preserve logical relationships across all four stages, the pipeline employs a hierarchical metadata model based on unique IDs and parent IDs. The palm detection stage creates a root metadata entry for each detected hand; each subsequent stage — landmark detection, embedding, and classification — attaches its results as child metadata referencing the parent. This ensures every classified gesture is explicitly traceable to its originating hand detection.
7

Output

Annotated frames with bounding boxes, keypoints, and gesture labels are rendered via hardware-accelerated overlay. The H.264-encoded stream is delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.

Run application on device

Setup Requirements

Hardware

HW Setup
ComponentDescription
Edge DeviceRB3 Gen 2, IQ8, or IQ9 — Primary processing unit for AI inference and video composition.
Camera SourceIP/RTSP camera, ISP (on-device) camera, or USB camera. A local file source may be substituted if no physical camera is available.
HDMI Display MonitorConnected to the edge device for rendering and visualizing pipeline output.
PoE SwitchPowers IP/RTSP cameras and provides network connectivity over a single Ethernet cable per camera. (Required for IP/RTSP camera setups only.)
Local NetworkEnsures the edge device, RTSP camera, and host machine are reachable on the same network. (Required when using RTSP camera input or streaming results via RTSP or WebRTC.)

Software

Flash your Qualcomm Edge device by following the device setup and flashing instructions here Once your device is ready, follow the instructions below to set up the Gesture Recognition AI Pipeline:
AI Model and config files
Download the gesture recognizer models from Google MediaPipe:
# Download the gesture recognizer task bundle
wget https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task

# Extract the top-level task
unzip gesture_recognizer.task

# Extract hand landmarker models
unzip hand_landmarker.task
# save hand_detector.tflite as palm_detection.tflite
# save hand_landmarks_detector.tflite as hand_landmark.tflite

# Extract gesture recognizer models
unzip hand_gesture_recognizer.task
# → gesture_embedder.tflite, canned_gesture_classifier.tflite
These are FLOAT precision models.
FileDownloadSave as
Palm detection modelSee download steps abovepalm_detection.tflite
Hand landmark modelSee download steps abovehand_landmark.tflite
Gesture embedder modelSee download steps abovegesture_embedder.tflite
Gesture classifier modelSee download steps abovecanned_gesture_classifier.tflite
Palm detection labelspalmd_labels.jsonpalmd_labels.json
Palm detection settingspalmd_settings.jsonpalmd_settings.json
Hand landmark labelshlandmark_labels.jsonhlandmark_labels.json
Hand landmark settingshlandmark_settings.jsonhlandmark_settings.json
Gesture labelsgesture_labels.jsongesture_labels.json
Sample videoInput videovideo.mp4
Copy files to device
# Replace $HOME to the appropriate device path before running the commands.
# For QLI:    /root
# For Ubuntu: /home/ubuntu
# Modify this based on your platform and ensure files are copied to the correct location on the device.

ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media}"
scp palm_detection.tflite              <user>@<device-ip>:$HOME/models/
scp hand_landmark.tflite               <user>@<device-ip>:$HOME/models/
scp gesture_embedder.tflite            <user>@<device-ip>:$HOME/models/
scp canned_gesture_classifier.tflite   <user>@<device-ip>:$HOME/models/
scp palmd_labels.json                  <user>@<device-ip>:$HOME/labels/
scp palmd_settings.json                <user>@<device-ip>:$HOME/labels/
scp hlandmark_labels.json              <user>@<device-ip>:$HOME/labels/
scp hlandmark_settings.json            <user>@<device-ip>:$HOME/labels/
scp gesture_labels.json                <user>@<device-ip>:$HOME/labels/
scp video.mp4                          <user>@<device-ip>:$HOME/media/
Connect to device
ssh <user>@<device-ip>
Run the Gesture Recognition Application
A display must be connected to the device. If no display is available, use the --no-display flag to run in headless mode.
Use the following base path for model and label files based on your OS:
gst-gesture-recognition \
  --input-type=usb \
  --input-config=/dev/video0 \
  --output-type=rtsp \
  --output-config=8900
gst-gesture-recognition \
  --input-type=rtsp \
  --input-config=rtsp://<ip>:<port>/stream \
  --output-type=rtsp \
  --output-config=8900
gst-gesture-recognition \
  --input-type=file \
  --input-config=$HOME/media/video.mp4 \
  --output-type=rtsp \
  --output-config=8900
gst-gesture-recognition \
  --input-type=file \
  --input-config=$HOME/media/video.mp4 \
  --output-type=rtsp \
  --output-config=8900 \
  --no-display
Note: This example uses an offline video file as input. To use an IP/RTSP camera or USB camera instead, update the --input-type argument accordingly — refer to the Command-Line Options section below for details.
It produces two key output results: an AI-annotated video stream and a JSON metadata stream. To visualize these results, refer to the Host-Side Visualization section below.

Visualize the Results - Host-Side Visualization (Windows + WSL)

This section describes how to run the visualization client on a Windows host machine using WSL (Windows Subsystem for Linux). The client renders the live video stream alongside a real-time AI metadata panel. 📥 The visualization client script can be downloaded here: rtsp_webrtc_client.zip It displays:
  • Left panel — Live video stream with AI overlays (bounding boxes, keypoints, gesture labels)
  • Right panel — Real-time AI metadata (JSON): object detections, bounding boxes, and confidence scores.
Step 1 — Install WSL and Ubuntu If WSL is not already installed, run the following from a Windows terminal:
wsl --install Ubuntu-24.04
Once installed, open the Ubuntu terminal and update the system:
sudo apt update && sudo apt upgrade -y
Step 2 — Install System Dependencies The visualization script requires GStreamer and Python GObject Introspection (GI) bindings. Install all required packages with:
sudo apt install -y \
  python3 python3-pip python3-gi python3-gi-cairo \
  gir1.2-gstreamer-1.0 \
  gir1.2-gst-plugins-base-1.0 \
  gir1.2-gst-plugins-bad-1.0 \
  gstreamer1.0-tools \
  gstreamer1.0-plugins-base \
  gstreamer1.0-plugins-good \
  gstreamer1.0-plugins-bad \
  gstreamer1.0-plugins-ugly \
  gstreamer1.0-libav \
  python3-websocket \
  libnice10 \
  libnice-dev \
  gstreamer1.0-nice
Step 3 — Run the Visualization Client Script Navigate to the directory containing the script and run:
python3 rtsp_webrtc_client.py rtsp://<DEVICE_IP>:8900/live
python3 rtsp_webrtc_client.py --source webrtc --signalling-server wss://webrtc.nirbheek.in:8443 --peer-id 1010
Step 4 — Expected Output Once the client connects, the UI will display:
PanelContent
LeftReal-time decoded video stream with gesture overlays
RightLive AI metadata — detected hands, keypoints, and gesture labels
After following the steps, the video and metadata streams should be up and running. Gesture Recognition AI Pipeline is configured for single-hand gesture recognition — the most common interaction scenario, where one dominant gesture is performed at a time. This keeps computational overhead low while maintaining reliable detection and classification. When multiple hands are present, the model prioritizes the detection with the highest confidence score, ensuring stable and accurate gesture recognition even in dynamic conditions. Expected Output The pipeline generates structured JSON metadata in the following format:
{
  "object_detection": [
    {
      "label": "palm",
      "confidence": 85.94279479980469,
      "color": 16711935,
      "rectangle": {
        "x": 0.3484375,
        "y": 0.15555555555555556,
        "width": 0.22708333333333333,
        "height": 0.40370370370370373
      },
      "xtraparams": {
        "affine-matrix": [
          0.42060701741537004,
          1.2300771264034409,
          38.53241177825402,
          -1.2300771264034409,
          0.42060701741537004,
          1313.4940252711156,
          0.0,
          0.0,
          1.0
        ]
      },
      "video_landmarks": [
        {
          "keypoints": [
            { "keypoint": "wrist", "x": 0.5411458333333333, "y": 0.42407407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb cmc", "x": 0.5432291666666667, "y": 0.3648148148148148, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb mcp", "x": 0.5359375, "y": 0.3037037037037037, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb ip", "x": 0.5348958333333333, "y": 0.24537037037037038, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb tip", "x": 0.5395833333333333, "y": 0.1925925925925926, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "index finger mcp", "x": 0.47760416666666666, "y": 0.32962962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger pip", "x": 0.43854166666666666, "y": 0.29907407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger dip", "x": 0.4109375, "y": 0.2814814814814815, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger tip", "x": 0.3848958333333333, "y": 0.26666666666666666, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "middle finger mcp", "x": 0.4635416666666667, "y": 0.37407407407407406, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger pip", "x": 0.41822916666666665, "y": 0.36203703703703705, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger dip", "x": 0.3880208333333333, "y": 0.35648148148148145, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger tip", "x": 0.36041666666666666, "y": 0.3537037037037037, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "ring finger mcp", "x": 0.46197916666666666, "y": 0.41759259259259257, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger pip", "x": 0.4192708333333333, "y": 0.4212962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger dip", "x": 0.39114583333333336, "y": 0.425, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger tip", "x": 0.36614583333333334, "y": 0.42777777777777776, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "pinky mcp", "x": 0.46979166666666666, "y": 0.45462962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky pip", "x": 0.43802083333333336, "y": 0.475, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky dip", "x": 0.41822916666666665, "y": 0.49074074074074076, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky tip", "x": 0.39947916666666666, "y": 0.5027777777777778, "confidence": 99.70703125, "color": 16711935 }
          ],
          "links": [
            { "start": 0, "end": 17 },
            { "start": 1, "end": 0 },
            { "start": 2, "end": 1 },
            { "start": 3, "end": 2 },
            { "start": 4, "end": 3 },
            { "start": 5, "end": 0 },
            { "start": 6, "end": 5 },
            { "start": 7, "end": 6 },
            { "start": 8, "end": 7 },
            { "start": 9, "end": 5 },
            { "start": 10, "end": 9 },
            { "start": 11, "end": 10 },
            { "start": 12, "end": 11 },
            { "start": 13, "end": 9 },
            { "start": 14, "end": 13 },
            { "start": 15, "end": 14 },
            { "start": 16, "end": 15 },
            { "start": 17, "end": 13 },
            { "start": 18, "end": 17 },
            { "start": 19, "end": 18 },
            { "start": 20, "end": 19 }
          ]
        }
      ],
      "image_classification": [
        {
          "label": "Open Palm",
          "confidence": 0.7060546875,
          "color": 4294902015
        }
      ]
    }
  ],
  "parameters": {
    "timestamp": "28356149027"
  }
}

Beyond the default setup, the application offers flexible input and output configurations that can be tailored via command-line options, as described below:

Command-Line Options

Selects the video input source for the pipeline.
ValueDescription
usbUSB camera. Requires --input-config=/dev/video0.
ispBuilt-in ISP (on-device) camera. Optionally specify a camera ID via --input-config=0.
rtspExternal IP/RTSP camera or stream. Requires --input-config=rtsp://....
fileLocal H.264-encoded video file. Requires --input-config=/path/to/video.mp4.
Specifies the input source configuration corresponding to the selected --input-type.
Input TypeValue
USB/dev/videoX
ISP<camera ID>
RTSPrtsp://<ip-or-url>
File/path/to/video.mp4
Defines how the processed output stream is delivered.
ValueDescription
noneNo video output (headless mode).
fileSave encoded output to a file. Requires --output-config=/path/to/output.mp4.
rtspStream over RTSP. Requires --output-config=<port>. Access at rtsp://<device-ip>:<port>/live.
webrtcStream over WebRTC. Requires --output-config=ws://<signalling-server>:<port>.
Specifies the output destination configuration corresponding to the selected --output-type.
Output TypeValue
File/path/to/output.mp4
RTSP<port>
WebRTCws://<signalling-server>:<port>
Root directory for model, label, and config files. The application resolves assets automatically:
Asset typeResolved path
Model files (*.tflite)<base-path>/models/<file>
Label / settings files (*.json)<base-path>/labels/<file>
--model-base-path=/root        # QLI
--model-base-path=/home/ubuntu # Ubuntu
Disables local on-screen rendering. Recommended for headless deployments, remote streaming setups (RTSP/WebRTC), or performance optimization.
Sets the raw input video resolution and frame rate. Applicable only to ISP and USB inputs.
--width=1920 --height=1080 --framerate=30
Specifies the local WebRTC signaling client ID used for peer connection setup with the signaling server.
--webrtc-id=1010

JSON Metadata Output

The pipeline generates structured per-frame metadata. Each detected hand produces a root entry with child landmark, embedding, and classification results:
{
  "object_detection": [
    {
      "label": "palm",
      "confidence": 85.94,
      "rectangle": {
        "x": 0.348, "y": 0.155,
        "width": 0.227, "height": 0.403
      },
      "xtraparams": {
        "affine-matrix": [
          0.4206, 1.2300, 38.53,
          -1.2300, 0.4206, 1313.49,
          0.0, 0.0, 1.0
        ]
      },
      "video_landmarks": [
        {
          "keypoints": [
            { "keypoint": "wrist",             "x": 0.541, "y": 0.424, "confidence": 99.7 },
            { "keypoint": "thumb tip",         "x": 0.539, "y": 0.192, "confidence": 99.7 },
            { "keypoint": "index finger tip",  "x": 0.384, "y": 0.266, "confidence": 99.7 },
            { "keypoint": "middle finger tip", "x": 0.360, "y": 0.353, "confidence": 99.7 },
            { "keypoint": "ring finger tip",   "x": 0.366, "y": 0.427, "confidence": 99.7 },
            { "keypoint": "pinky tip",         "x": 0.399, "y": 0.502, "confidence": 99.7 }
          ]
        }
      ],
      "image_classification": [
        { "label": "Open Palm", "confidence": 0.706 }
      ]
    }
  ],
  "parameters": { "timestamp": "28356149027" }
}

Implementation Deep-Dive

The application separates user configuration from runtime state using two structs:
typedef struct GstAppConfig {
  gchar *input_type;
  gchar *input_config;
  gchar *output_type;
  gchar *output_config;
  gchar *model_base_path;
  gboolean no_display;
  gint width, height, framerate, webrtc_id;
} GstAppConfig;

typedef struct GstAppContext {
  GstAppConfig config;
  GstElement *pipeline;
  GMainLoop  *mloop;
  GstElement *webrtc;
  gboolean    is_shutting_down;
} GstAppContext;
The pipeline is composed of three independent branches: input, processing, and output. Construction order is deliberate — input first, output second, processing branch last.
static gboolean gst_app_create_pipe (GstAppContext *appctx) {
  GstElement *input_tail = NULL, *output_head = NULL, *meta_head = NULL;
  appctx->pipeline = gst_pipeline_new ("gst-gesture-recognition");

  if (!gst_app_create_input_pipe  (appctx, &input_tail))  return FALSE;
  if (!gst_app_create_output_pipe (appctx, &output_head, &meta_head)) return FALSE;
  if (!gst_app_create_user_pipe   (appctx, input_tail, output_head, meta_head)) return FALSE;
  return TRUE;
}
Each stage is configured with a GPU-delegated qtimltflite instance and a task-specific qtimlpostprocess module:
/* Stage 1 — Palm Detection */
palm_inf = gst_app_make_element ("qtimltflite", "palm_inf");
gst_element_set_enum_property (palm_inf, "delegate", "gpu");
g_object_set (palm_inf, "model", "<base>/models/palm_detection.tflite", NULL);

palm_post = gst_app_make_element ("qtimlpostprocess", "palm_post");
gst_element_set_enum_property (palm_post, "module", "palmd");
g_object_set (palm_post, "results", 1, NULL);

/* Stage 2 — Hand Landmark */
hand_pre = gst_app_make_element ("qtimlvconverter", "hand_pre");
gst_element_set_enum_property (hand_pre, "mode", "roi-batch-non-cumulative");
hand_inf = gst_app_make_element ("qtimltflite", "hand_inf");
gst_element_set_enum_property (hand_inf, "delegate", "gpu");

hand_post = gst_app_make_element ("qtimlpostprocess", "hand_post");
gst_element_set_enum_property (hand_post, "module", "hlandmark");
g_object_set (hand_post, "results", 6, NULL);

/* Stage 3 — Gesture Embedding */
gesture_pre = gst_app_make_element ("qtimlpostprocess", "gesture_pre");
gst_element_set_enum_property (gesture_pre, "module", "tensor");
gesture_embed = gst_app_make_element ("qtimltflite", "gesture_embed");
gst_element_set_enum_property (gesture_embed, "delegate", "gpu");

/* Stage 4 — Gesture Classification */
gesture_class = gst_app_make_element ("qtimltflite", "gesture_class");
gst_element_set_enum_property (gesture_class, "delegate", "gpu");
gesture_post = gst_app_make_element ("qtimlpostprocess", "gesture_post");
gst_element_set_enum_property (gesture_post, "module", "mobilenet");
g_object_set (gesture_post, "results", 8, NULL);
/* input → tee1 → metamux1 (passthrough) + palm detection */
gst_element_link (input_tail, tee1);
gst_element_link_many (tee1, q, metamux1, metatransform, tee2, NULL);
gst_element_link_many (tee1, q, palm_pre, palm_inf, palm_post, caps1, metamux1, NULL);

/* tee2 → overlay branch + hand landmark branch */
gst_element_link_many (tee2, q, metamux2, overlay, tee3, NULL);
gst_element_link_many (tee2, q, hand_pre, hand_inf, tee4, NULL);
gst_element_link_many (tee4, q, hand_post, caps2, metamux2, NULL);
/* tee4 → embedding → classification → metamux2 */
gst_element_link_many (tee4, q,
    gesture_pre, gesture_embed, gesture_class, gesture_post,
    caps3, metamux2, NULL);

/* tee3 → video output */
gst_element_link_many (tee3, q, output_head, NULL);

/* tee3 → metadata output (optional) */
if (meta_head != NULL)
    gst_element_link_many (tee3, q, parser, meta_head, NULL);
WebRTC communication uses explicit SDP offer/answer exchange and ICE candidate negotiation via WebSocket using libsoup:
/* Create data channel for metadata */
g_signal_emit_by_name (webrtcbin, "create-data-channel", name, NULL, &ch);

/* Send SDP offer */
GstPromise *promise = gst_promise_new_with_change_func (on_offer_created, appctx, NULL);
g_signal_emit_by_name (webrtcbin, "create-offer", NULL, promise);

/* ICE candidates */
g_signal_connect (appctx->webrtc, "on-ice-candidate",
    G_CALLBACK (on_webrtc_ice_candidate), appctx);
CallbackResponsibility
on_offer_createdConstructs and sends the SDP offer to the remote peer
on_ice_candidateTransmits ICE candidates to the signaling server
on_ws_messageHandles incoming signaling messages from the WebSocket

Build the Application

Conclusion

The QIM SDK’s modular architecture enables developers to compose intelligent video analytics pipelines with speed and flexibility. Models run in parallel or sequentially, with each stage’s output automatically attached to the corresponding video frame as structured metadata. Decoupled video and inference processing allows multiple models to execute concurrently — maximizing throughput without sacrificing accuracy — with all results unified through a single GStreamer element for clean, scalable integration. Whether building touchless interfaces or gesture-driven applications, the QIM SDK provides a solid, production-ready foundation for advanced edge AI.