> ## Documentation Index
> Fetch the complete documentation index at: https://imsdkdocs.qualcomm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Building a Real-Time Hand Gesture Recognition AI Pipeline Using the QIM SDK

> A four-stage sequential AI pipeline — palm detection, hand landmark estimation, gesture embedding, and gesture classification — built with Qualcomm IM SDK for real-time edge deployment.

<div
  style={{
width: "100%", borderRadius: "14px", overflow: "hidden",
backgroundImage: "url('https://mintcdn.com/qimsdk/p8bRJ_K0_Mx14HV0/blogs/images/gesture_title.png?fit=max&auto=format&n=p8bRJ_K0_Mx14HV0&q=85&s=9e8d3e1148f6708870155540ae2cb833')",
backgroundSize: "cover", backgroundPosition: "center",
height: "260px", display: "flex", alignItems: "center", justifyContent: "center",
position: "relative", marginBottom: "1.5rem"
}}
>
  <div
    style={{
position: "absolute", bottom: "16px", left: "50%", transform: "translateX(-50%)",
background: "rgba(255,255,255,0.15)", border: "1px solid rgba(255,255,255,0.4)",
color: "#fff", fontSize: "0.75rem", fontWeight: 700, letterSpacing: "1px",
padding: "5px 14px", borderRadius: "20px", textTransform: "uppercase", whiteSpace: "nowrap",
zIndex: 1
}}
  >
    QIMSDK · Qualcomm
  </div>
</div>

<div style={{ marginBottom: "2rem" }}>
  <div
    style={{
fontSize: "0.72rem", fontWeight: 700, color: "#31017D",
letterSpacing: "1.5px", textTransform: "uppercase", marginBottom: "0.5rem"
}}
  >
    Computer Vision
  </div>

  <p style={{ fontSize: "0.95rem", color: "#555", lineHeight: 1.7, margin: "0 0 0.75rem" }}>
    Build a four-stage real-time hand gesture recognition pipeline using Qualcomm IM SDK — covering
    palm detection, hand landmark estimation, gesture embedding, and gesture classification, all
    running on-device with hardware-accelerated inference.
  </p>

  <div style={{ fontSize: "0.85rem", color: "#888", display: "flex", gap: "0.5rem", flexWrap: "wrap", alignItems: "center" }}>
    <span>QIMSDK Team</span>
    <span>·</span>
    <span>Jun 10, 2026</span>
    <span>·</span>
    <a href="/blogs" style={{ color: "#31017D", fontWeight: 600, textDecoration: "none" }}>← All posts</a>
  </div>
</div>

<hr style={{ border: "none", borderTop: "1px solid #eee", margin: "0 0 2rem" }} />

## Introduction

Real-time gesture recognition is redefining how humans interact with machines — enabling touchless, natural interfaces across augmented reality, gaming, accessibility, and industrial control. But accurately interpreting hand gestures demands more than simple object detection; it requires a multi-stage AI pipeline capable of progressively refining raw visual input into high-level, actionable intent.

The QIM SDK brings this capability directly to the edge. By routing compute-intensive tasks through hardware-accelerated GStreamer plugins, the SDK offloads video decoding, frame preparation, multi-stage inference, and encoding entirely to dedicated hardware blocks — delivering low-latency, power-efficient execution even for complex multi-model workloads.

At the core of this use case is a **four-stage sequential inference pipeline**:

<CardGroup cols={4}>
  <Card title="Stage 1" icon="hand">
    Palm Detection
  </Card>

  <Card title="Stage 2" icon="circle-nodes">
    Hand Landmark Estimation
  </Card>

  <Card title="Stage 3" icon="diagram-project">
    Gesture Embedding
  </Card>

  <Card title="Stage 4" icon="tag">
    Gesture Classification
  </Card>
</CardGroup>

Each stage builds directly on the output of the previous, progressively transforming raw video frames into structured, high-level gesture data. Between stages, intermediate transformations — such as rotating and cropping detected hand regions for correct spatial alignment — ensure consistent, accurate results across varying hand positions and orientations.

Metadata is **hierarchically structured** throughout the pipeline. Each detected hand establishes a root metadata entry, with landmark detections, embeddings, and gesture classifications attached as child metadata linked to the originating detection. Visualization is fully hardware-accelerated: bounding boxes, key-points, and gesture labels are composited directly onto video frames via optimized overlay rendering.

The pipeline accepts input from USB cameras, RTSP streams, ISP (on-device) cameras, and local video files, and delivers results through real-time on-screen visualization or remote streaming over RTSP and WebRTC — with structured inference metadata transmitted in parallel.

The complete application source code is available [here](https://github.com/qualcomm/gst-plugins-imsdk/tree/main/gst-sample-apps/gst-gesture-recognition).

## Use Case Overview

<Steps>
  <Step title="Source">
    The pipeline accepts continuous video input from a USB camera, RTSP stream, ISP (on-device) camera, or local video file.
  </Step>

  <Step title="Palm Detection">
    Each incoming frame is processed by the palm detection model, which identifies the presence, location, and rotation angle of any hands in the scene.
  </Step>

  <Step title="Hand Landmark Detection">
    For each detected hand, the landmark model identifies 21 key points — fingertips, joints, and wrist — capturing the full structural pose. The hand region is first rotated and cropped via affine transformation for correct spatial alignment.
  </Step>

  <Step title="Gesture Embedding">
    The detected landmarks are encoded into a compact numerical representation that summarizes the hand pose in a form optimized for classification.
  </Step>

  <Step title="Gesture Classification">
    The embedding is passed to the gesture classification model, which maps the hand pose to a predefined gesture label — such as open hand, fist, or thumbs up.
  </Step>

  <Step title="Metadata Synchronization">
    Results from all inference stages are structured hierarchically. [`qtimetamux`](../plugin-reference/qtimetamux) synchronizes this structured metadata with the original video frames, maintaining per-frame consistency throughout the pipeline.
  </Step>

  <Step title="Output">
    Annotated frames are H.264-encoded and delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.
  </Step>
</Steps>

## Pipeline diagram

<img src="https://mintcdn.com/qimsdk/p8bRJ_K0_Mx14HV0/blogs/images/gesture_pipelines.png?fit=max&auto=format&n=p8bRJ_K0_Mx14HV0&q=85&s=15030ae1991eccfeb36411895b1adbe0" alt="Hand Gesture Recognition Pipeline" width="1862" height="676" data-path="blogs/images/gesture_pipelines.png" />

## Elements used in pipeline

| Element                                                    | Description                                                                                                                           |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `source`                                                   | Accepts video input from a USB camera, ISP camera, RTSP stream, or local video file.                                                  |
| `tee`                                                      | Splits the stream into multiple parallel branches for simultaneous processing.                                                        |
| [`qtimlvconverter`](../plugin-reference/qtimlvconverter)   | Hardware-accelerated resize, YUV→RGB conversion, and pixel normalization to meet each model's input requirements.                     |
| [`qtimltflite`](../plugin-reference/qtimltflite)           | Executes TFLite inference models, producing raw output tensors.                                                                       |
| [`qtimlpostprocess`](../plugin-reference/qtimlpostprocess) | Decodes raw tensors into structured bounding boxes, keypoints, labels, and confidence scores via dynamically loaded modules.          |
| [`qtimetamux`](../plugin-reference/qtimetamux)             | Synchronizes inference results with the original video stream as structured per-frame metadata.                                       |
| `qtimetatransform`                                         | Transforms metadata as it flows through the pipeline — modifying coordinate systems to ensure compatibility with downstream elements. |
| [`qtivoverlay`](../plugin-reference/qtivoverlay)           | Composites bounding boxes, keypoints, and labels onto video frames using hardware-accelerated overlay rendering.                      |
| [`qtimlmetaparser`](../plugin-reference/qtimetaparser)     | Serializes per-frame inference metadata into JSON for integration with external systems.                                              |
| `v4l2h264enc` / `h264parse`                                | Hardware-accelerated H.264 encoding of the processed video stream.                                                                    |
| [`waylandsink`](../plugin-reference/waylandsink)           | Renders the output to the local display via the Wayland compositor.                                                                   |

## How It Works

<Steps>
  <Step title="Stage 1 — Palm Detection">
    The first model processes the full video frame, identifying the location, orientation, and bounding box of each detected hand. This stage also produces key points used to estimate the hand's rotation angle. A `qtimetatransform` element uses this information to compute an **affine transformation matrix**, which is attached as metadata and carried forward to the next stage.
  </Step>

  <Step title="Affine Crop Generation">
    The hand landmark model requires a normalized, upright view of the hand — not raw bounding box coordinates. A dedicated [`qtimlvconverter`](../plugin-reference/qtimlvconverter) instance consumes the affine transformation matrix and applies it to crop and rotate the detected hand region from the original frame, producing a correctly aligned input for the next stage.
  </Step>

  <Step title="Stage 2 — Hand Landmark Detection">
    The landmark model is invoked once per detected hand, processing the cropped and aligned region to produce **21 keypoints** — fingertips, joints, and wrist — along with handedness and confidence scores. A post-processing step then applies a reverse transformation to remap the keypoints accurately back onto the original frame coordinate space.
  </Step>

  <Step title="Stage 3 — Gesture Embedding">
    The landmark keypoints are passed to the gesture embedder, which encodes the hand pose into a compact numerical embedding vector. This stage operates directly on tensor data — no image cropping or geometric transformation is required.
  </Step>

  <Step title="Stage 4 — Gesture Classification">
    The embedding vector is passed to the gesture classifier, which maps the hand pose representation to a predefined gesture label — such as open hand, fist, or thumbs up. Like the embedder, this stage operates purely on tensors with no additional preprocessing.
  </Step>

  <Step title="Hierarchical Metadata">
    To preserve logical relationships across all four stages, the pipeline employs a **hierarchical metadata model** based on unique IDs and parent IDs. The palm detection stage creates a **root metadata entry** for each detected hand; each subsequent stage — landmark detection, embedding, and classification — attaches its results as **child metadata** referencing the parent. This ensures every classified gesture is explicitly traceable to its originating hand detection.
  </Step>

  <Step title="Output">
    Annotated frames with bounding boxes, keypoints, and gesture labels are rendered via hardware-accelerated overlay. The H.264-encoded stream is delivered over RTSP or WebRTC, with structured inference metadata transmitted in parallel as a JSON stream.
  </Step>
</Steps>

## Run application on device

### Setup Requirements

#### Hardware

<img src="https://mintcdn.com/qimsdk/p8bRJ_K0_Mx14HV0/blogs/images/gesture_hw-setup.png?fit=max&auto=format&n=p8bRJ_K0_Mx14HV0&q=85&s=e4f80abdafbc2a52b2c1dc80e94cebbd" alt="HW Setup" width="1902" height="807" data-path="blogs/images/gesture_hw-setup.png" />

| Component                | Description                                                                                                                                                                |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Edge Device**          | RB3 Gen 2, IQ8, or IQ9 — Primary processing unit for AI inference and video composition.                                                                                   |
| **Camera Source**        | IP/RTSP camera, ISP (on-device) camera, or USB camera. A local file source may be substituted if no physical camera is available.                                          |
| **HDMI Display Monitor** | Connected to the edge device for rendering and visualizing pipeline output.                                                                                                |
| **PoE Switch**           | Powers IP/RTSP cameras and provides network connectivity over a single Ethernet cable per camera. (Required for IP/RTSP camera setups only.)                               |
| **Local Network**        | Ensures the edge device, RTSP camera, and host machine are reachable on the same network. (Required when using RTSP camera input or streaming results via RTSP or WebRTC.) |

#### Software

**Flash your Qualcomm Edge device** by following the device setup and flashing instructions [here](../installation)

**Once your device is ready**, follow the instructions below to set up the Gesture Recognition AI Pipeline:

##### AI Model and config files

Download the gesture recognizer models from Google MediaPipe:

```bash theme={null}
# Download the gesture recognizer task bundle
wget https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/latest/gesture_recognizer.task

# Extract the top-level task
unzip gesture_recognizer.task

# Extract hand landmarker models
unzip hand_landmarker.task
# save hand_detector.tflite as palm_detection.tflite
# save hand_landmarks_detector.tflite as hand_landmark.tflite

# Extract gesture recognizer models
unzip hand_gesture_recognizer.task
# → gesture_embedder.tflite, canned_gesture_classifier.tflite
```

<Note>
  These are FLOAT precision models.
</Note>

| File                     | Download                                                                                                                                               | Save as                            |
| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------- |
| Palm detection model     | See download steps above                                                                                                                               | `palm_detection.tflite`            |
| Hand landmark model      | See download steps above                                                                                                                               | `hand_landmark.tflite`             |
| Gesture embedder model   | See download steps above                                                                                                                               | `gesture_embedder.tflite`          |
| Gesture classifier model | See download steps above                                                                                                                               | `canned_gesture_classifier.tflite` |
| Palm detection labels    | <a href="../labels/palmd_labels.json" download="palmd_labels.json">palmd\_labels.json</a>                                                              | `palmd_labels.json`                |
| Palm detection settings  | <a href="../labels/palmd_settings.json" download="palmd_settings.json">palmd\_settings.json</a>                                                        | `palmd_settings.json`              |
| Hand landmark labels     | <a href="../labels/hlandmark_labels.json" download="hlandmark_labels.json">hlandmark\_labels.json</a>                                                  | `hlandmark_labels.json`            |
| Hand landmark settings   | <a href="../labels/hlandmark_settings.json" download="hlandmark_settings.json">hlandmark\_settings.json</a>                                            | `hlandmark_settings.json`          |
| Gesture labels           | <a href="../labels/gesture_labels.json" download="gesture_labels.json">gesture\_labels.json</a>                                                        | `gesture_labels.json`              |
| Sample video             | <a href="https://github.com/qualcomm/sample-apps-for-qualcomm-linux/raw/refs/heads/main/qualcomm-linux/artifacts/videos/demo_samples/">Input video</a> | `video.mp4`                        |

**Copy files to device**

<CodeGroup>
  ```bash SCP (SSH) theme={null}
  # Replace $HOME to the appropriate device path before running the commands.
  # For QLI:    /root
  # For Ubuntu: /home/ubuntu
  # Modify this based on your platform and ensure files are copied to the correct location on the device.

  ssh <user>@<device-ip> "mkdir -p $HOME/{models,labels,media}"
  scp palm_detection.tflite              <user>@<device-ip>:$HOME/models/
  scp hand_landmark.tflite               <user>@<device-ip>:$HOME/models/
  scp gesture_embedder.tflite            <user>@<device-ip>:$HOME/models/
  scp canned_gesture_classifier.tflite   <user>@<device-ip>:$HOME/models/
  scp palmd_labels.json                  <user>@<device-ip>:$HOME/labels/
  scp palmd_settings.json                <user>@<device-ip>:$HOME/labels/
  scp hlandmark_labels.json              <user>@<device-ip>:$HOME/labels/
  scp hlandmark_settings.json            <user>@<device-ip>:$HOME/labels/
  scp gesture_labels.json                <user>@<device-ip>:$HOME/labels/
  scp video.mp4                          <user>@<device-ip>:$HOME/media/
  ```
</CodeGroup>

**Connect to device**

```bash theme={null}
ssh <user>@<device-ip>
```

**Run the Gesture Recognition Application**

<Note>
  A display must be connected to the device. If no display is available, use the `--no-display` flag to run in headless mode.
</Note>

Use the following base path for model and label files based on your OS:

<AccordionGroup>
  <Accordion title="USB camera">
    ```bash theme={null}
    gst-gesture-recognition \
      --input-type=usb \
      --input-config=/dev/video0 \
      --output-type=rtsp \
      --output-config=8900
    ```
  </Accordion>

  <Accordion title="RTSP camera">
    ```bash theme={null}
    gst-gesture-recognition \
      --input-type=rtsp \
      --input-config=rtsp://<ip>:<port>/stream \
      --output-type=rtsp \
      --output-config=8900
    ```
  </Accordion>

  <Accordion title="File input">
    ```bash theme={null}
    gst-gesture-recognition \
      --input-type=file \
      --input-config=$HOME/media/video.mp4 \
      --output-type=rtsp \
      --output-config=8900
    ```
  </Accordion>

  <Accordion title="Headless (no display)">
    ```bash theme={null}
    gst-gesture-recognition \
      --input-type=file \
      --input-config=$HOME/media/video.mp4 \
      --output-type=rtsp \
      --output-config=8900 \
      --no-display
    ```
  </Accordion>
</AccordionGroup>

> **Note:** This example uses an offline video file as input. To use an IP/RTSP camera or USB camera instead, update the `--input-type` argument accordingly — refer to the **Command-Line Options** section below for details.

It produces two key output results: an AI-annotated video stream and a JSON metadata stream. To visualize these results, refer to the **Host-Side Visualization** section below.

## Visualize the Results - Host-Side Visualization (Windows + WSL)

This section describes how to run the visualization client on a Windows host machine using **WSL (Windows Subsystem for Linux)**. The client renders the live video stream alongside a real-time AI metadata panel.

📥 The visualization client script can be downloaded here: <a href="../labels/rtsp_webrtc_client.zip" download="rtsp_webrtc_client.zip">rtsp\_webrtc\_client.zip</a>

It displays:

* **Left panel** — Live video stream with AI overlays (bounding boxes, keypoints, gesture labels)
* **Right panel** — Real-time AI metadata (JSON): object detections, bounding boxes, and confidence scores.

**Step 1 — Install WSL and Ubuntu**

If WSL is not already installed, run the following from a Windows terminal:

```bash theme={null}
wsl --install Ubuntu-24.04
```

Once installed, open the Ubuntu terminal and update the system:

```bash theme={null}
sudo apt update && sudo apt upgrade -y
```

**Step 2 — Install System Dependencies**

The visualization script requires GStreamer and Python GObject Introspection (GI) bindings. Install all required packages with:

```bash theme={null}
sudo apt install -y \
  python3 python3-pip python3-gi python3-gi-cairo \
  gir1.2-gstreamer-1.0 \
  gir1.2-gst-plugins-base-1.0 \
  gir1.2-gst-plugins-bad-1.0 \
  gstreamer1.0-tools \
  gstreamer1.0-plugins-base \
  gstreamer1.0-plugins-good \
  gstreamer1.0-plugins-bad \
  gstreamer1.0-plugins-ugly \
  gstreamer1.0-libav \
  python3-websocket \
  libnice10 \
  libnice-dev \
  gstreamer1.0-nice
```

**Step 3 — Run the Visualization Client Script**

Navigate to the directory containing the script and run:

<AccordionGroup>
  <Accordion title="RTSP">
    ```bash theme={null}
    python3 rtsp_webrtc_client.py rtsp://<DEVICE_IP>:8900/live
    ```
  </Accordion>

  <Accordion title="WebRTC">
    ```bash theme={null}
    python3 rtsp_webrtc_client.py --source webrtc --signalling-server wss://webrtc.nirbheek.in:8443 --peer-id 1010
    ```
  </Accordion>
</AccordionGroup>

**Step 4 — Expected Output**

Once the client connects, the UI will display:

| Panel | Content                                                          |
| ----- | ---------------------------------------------------------------- |
| Left  | Real-time decoded video stream with gesture overlays             |
| Right | Live AI metadata — detected hands, keypoints, and gesture labels |

After following the steps, the video and metadata streams should be up and running.

Gesture Recognition AI Pipeline is configured for single-hand gesture recognition — the most common interaction scenario, where one dominant gesture is performed at a time. This keeps computational overhead low while maintaining reliable detection and classification. When multiple hands are present, the model prioritizes the detection with the highest confidence score, ensuring stable and accurate gesture recognition even in dynamic conditions.

<img src="https://mintcdn.com/qimsdk/p8bRJ_K0_Mx14HV0/blogs/images/gesture_expected-output.png?fit=max&auto=format&n=p8bRJ_K0_Mx14HV0&q=85&s=66f220136a7b36db292fe0f81b36c81b" alt="Expected Output" width="1683" height="725" data-path="blogs/images/gesture_expected-output.png" />

The pipeline generates structured JSON metadata in the following format:

```bash theme={null}
{
  "object_detection": [
    {
      "label": "palm",
      "confidence": 85.94279479980469,
      "color": 16711935,
      "rectangle": {
        "x": 0.3484375,
        "y": 0.15555555555555556,
        "width": 0.22708333333333333,
        "height": 0.40370370370370373
      },
      "xtraparams": {
        "affine-matrix": [
          0.42060701741537004,
          1.2300771264034409,
          38.53241177825402,
          -1.2300771264034409,
          0.42060701741537004,
          1313.4940252711156,
          0.0,
          0.0,
          1.0
        ]
      },
      "video_landmarks": [
        {
          "keypoints": [
            { "keypoint": "wrist", "x": 0.5411458333333333, "y": 0.42407407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb cmc", "x": 0.5432291666666667, "y": 0.3648148148148148, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb mcp", "x": 0.5359375, "y": 0.3037037037037037, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb ip", "x": 0.5348958333333333, "y": 0.24537037037037038, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "thumb tip", "x": 0.5395833333333333, "y": 0.1925925925925926, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "index finger mcp", "x": 0.47760416666666666, "y": 0.32962962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger pip", "x": 0.43854166666666666, "y": 0.29907407407407405, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger dip", "x": 0.4109375, "y": 0.2814814814814815, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "index finger tip", "x": 0.3848958333333333, "y": 0.26666666666666666, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "middle finger mcp", "x": 0.4635416666666667, "y": 0.37407407407407406, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger pip", "x": 0.41822916666666665, "y": 0.36203703703703705, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger dip", "x": 0.3880208333333333, "y": 0.35648148148148145, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "middle finger tip", "x": 0.36041666666666666, "y": 0.3537037037037037, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "ring finger mcp", "x": 0.46197916666666666, "y": 0.41759259259259257, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger pip", "x": 0.4192708333333333, "y": 0.4212962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger dip", "x": 0.39114583333333336, "y": 0.425, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "ring finger tip", "x": 0.36614583333333334, "y": 0.42777777777777776, "confidence": 99.70703125, "color": 16711935 },

            { "keypoint": "pinky mcp", "x": 0.46979166666666666, "y": 0.45462962962962963, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky pip", "x": 0.43802083333333336, "y": 0.475, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky dip", "x": 0.41822916666666665, "y": 0.49074074074074076, "confidence": 99.70703125, "color": 16711935 },
            { "keypoint": "pinky tip", "x": 0.39947916666666666, "y": 0.5027777777777778, "confidence": 99.70703125, "color": 16711935 }
          ],
          "links": [
            { "start": 0, "end": 17 },
            { "start": 1, "end": 0 },
            { "start": 2, "end": 1 },
            { "start": 3, "end": 2 },
            { "start": 4, "end": 3 },
            { "start": 5, "end": 0 },
            { "start": 6, "end": 5 },
            { "start": 7, "end": 6 },
            { "start": 8, "end": 7 },
            { "start": 9, "end": 5 },
            { "start": 10, "end": 9 },
            { "start": 11, "end": 10 },
            { "start": 12, "end": 11 },
            { "start": 13, "end": 9 },
            { "start": 14, "end": 13 },
            { "start": 15, "end": 14 },
            { "start": 16, "end": 15 },
            { "start": 17, "end": 13 },
            { "start": 18, "end": 17 },
            { "start": 19, "end": 18 },
            { "start": 20, "end": 19 }
          ]
        }
      ],
      "image_classification": [
        {
          "label": "Open Palm",
          "confidence": 0.7060546875,
          "color": 4294902015
        }
      ]
    }
  ],
  "parameters": {
    "timestamp": "28356149027"
  }
}

```

Beyond the default setup, the application offers flexible input and output configurations that can be tailored via command-line options, as described below:

## Command-Line Options

<AccordionGroup>
  <Accordion title="--input-type">
    Selects the video input source for the pipeline.

    | Value  | Description                                                                             |
    | ------ | --------------------------------------------------------------------------------------- |
    | `usb`  | USB camera. Requires `--input-config=/dev/video0`.                                      |
    | `isp`  | Built-in ISP (on-device) camera. Optionally specify a camera ID via `--input-config=0`. |
    | `rtsp` | External IP/RTSP camera or stream. Requires `--input-config=rtsp://...`.                |
    | `file` | Local H.264-encoded video file. Requires `--input-config=/path/to/video.mp4`.           |
  </Accordion>

  <Accordion title="--input-config">
    Specifies the input source configuration corresponding to the selected `--input-type`.

    | Input Type | Value                |
    | ---------- | -------------------- |
    | USB        | `/dev/videoX`        |
    | ISP        | `<camera ID>`        |
    | RTSP       | `rtsp://<ip-or-url>` |
    | File       | `/path/to/video.mp4` |
  </Accordion>

  <Accordion title="--output-type">
    Defines how the processed output stream is delivered.

    | Value    | Description                                                                                      |
    | -------- | ------------------------------------------------------------------------------------------------ |
    | `none`   | No video output (headless mode).                                                                 |
    | `file`   | Save encoded output to a file. Requires `--output-config=/path/to/output.mp4`.                   |
    | `rtsp`   | Stream over RTSP. Requires `--output-config=<port>`. Access at `rtsp://<device-ip>:<port>/live`. |
    | `webrtc` | Stream over WebRTC. Requires `--output-config=ws://<signalling-server>:<port>`.                  |
  </Accordion>

  <Accordion title="--output-config">
    Specifies the output destination configuration corresponding to the selected `--output-type`.

    | Output Type | Value                             |
    | ----------- | --------------------------------- |
    | File        | `/path/to/output.mp4`             |
    | RTSP        | `<port>`                          |
    | WebRTC      | `ws://<signalling-server>:<port>` |
  </Accordion>

  <Accordion title="--model-base-path">
    Root directory for model, label, and config files. The application resolves assets automatically:

    | Asset type                        | Resolved path               |
    | --------------------------------- | --------------------------- |
    | Model files (`*.tflite`)          | `<base-path>/models/<file>` |
    | Label / settings files (`*.json`) | `<base-path>/labels/<file>` |

    ```bash theme={null}
    --model-base-path=/root        # QLI
    --model-base-path=/home/ubuntu # Ubuntu
    ```
  </Accordion>

  <Accordion title="--no-display">
    Disables local on-screen rendering. Recommended for headless deployments, remote streaming setups (RTSP/WebRTC), or performance optimization.
  </Accordion>

  <Accordion title="--width / --height / --framerate">
    Sets the raw input video resolution and frame rate. Applicable only to ISP and USB inputs.

    ```bash theme={null}
    --width=1920 --height=1080 --framerate=30
    ```
  </Accordion>

  <Accordion title="--webrtc-id">
    Specifies the local WebRTC signaling client ID used for peer connection setup with the signaling server.

    ```bash theme={null}
    --webrtc-id=1010
    ```
  </Accordion>
</AccordionGroup>

## JSON Metadata Output

The pipeline generates structured per-frame metadata. Each detected hand produces a root entry with child landmark, embedding, and classification results:

<Accordion title="Sample JSON output">
  ```json theme={null}
  {
    "object_detection": [
      {
        "label": "palm",
        "confidence": 85.94,
        "rectangle": {
          "x": 0.348, "y": 0.155,
          "width": 0.227, "height": 0.403
        },
        "xtraparams": {
          "affine-matrix": [
            0.4206, 1.2300, 38.53,
            -1.2300, 0.4206, 1313.49,
            0.0, 0.0, 1.0
          ]
        },
        "video_landmarks": [
          {
            "keypoints": [
              { "keypoint": "wrist",             "x": 0.541, "y": 0.424, "confidence": 99.7 },
              { "keypoint": "thumb tip",         "x": 0.539, "y": 0.192, "confidence": 99.7 },
              { "keypoint": "index finger tip",  "x": 0.384, "y": 0.266, "confidence": 99.7 },
              { "keypoint": "middle finger tip", "x": 0.360, "y": 0.353, "confidence": 99.7 },
              { "keypoint": "ring finger tip",   "x": 0.366, "y": 0.427, "confidence": 99.7 },
              { "keypoint": "pinky tip",         "x": 0.399, "y": 0.502, "confidence": 99.7 }
            ]
          }
        ],
        "image_classification": [
          { "label": "Open Palm", "confidence": 0.706 }
        ]
      }
    ],
    "parameters": { "timestamp": "28356149027" }
  }
  ```
</Accordion>

## Implementation Deep-Dive

<AccordionGroup>
  <Accordion title="1. Application Configuration and Runtime Context">
    The application separates user configuration from runtime state using two structs:

    ```c theme={null}
    typedef struct GstAppConfig {
      gchar *input_type;
      gchar *input_config;
      gchar *output_type;
      gchar *output_config;
      gchar *model_base_path;
      gboolean no_display;
      gint width, height, framerate, webrtc_id;
    } GstAppConfig;

    typedef struct GstAppContext {
      GstAppConfig config;
      GstElement *pipeline;
      GMainLoop  *mloop;
      GstElement *webrtc;
      gboolean    is_shutting_down;
    } GstAppContext;
    ```
  </Accordion>

  <Accordion title="2. Pipeline Assembly">
    The pipeline is composed of three independent branches: input, processing, and output. Construction order is deliberate — input first, output second, processing branch last.

    ```c theme={null}
    static gboolean gst_app_create_pipe (GstAppContext *appctx) {
      GstElement *input_tail = NULL, *output_head = NULL, *meta_head = NULL;
      appctx->pipeline = gst_pipeline_new ("gst-gesture-recognition");

      if (!gst_app_create_input_pipe  (appctx, &input_tail))  return FALSE;
      if (!gst_app_create_output_pipe (appctx, &output_head, &meta_head)) return FALSE;
      if (!gst_app_create_user_pipe   (appctx, input_tail, output_head, meta_head)) return FALSE;
      return TRUE;
    }
    ```
  </Accordion>

  <Accordion title="3. Multi-Stage Inference Model Configuration">
    Each stage is configured with a GPU-delegated `qtimltflite` instance and a task-specific `qtimlpostprocess` module:

    ```c theme={null}
    /* Stage 1 — Palm Detection */
    palm_inf = gst_app_make_element ("qtimltflite", "palm_inf");
    gst_element_set_enum_property (palm_inf, "delegate", "gpu");
    g_object_set (palm_inf, "model", "<base>/models/palm_detection.tflite", NULL);

    palm_post = gst_app_make_element ("qtimlpostprocess", "palm_post");
    gst_element_set_enum_property (palm_post, "module", "palmd");
    g_object_set (palm_post, "results", 1, NULL);

    /* Stage 2 — Hand Landmark */
    hand_pre = gst_app_make_element ("qtimlvconverter", "hand_pre");
    gst_element_set_enum_property (hand_pre, "mode", "roi-batch-non-cumulative");
    hand_inf = gst_app_make_element ("qtimltflite", "hand_inf");
    gst_element_set_enum_property (hand_inf, "delegate", "gpu");

    hand_post = gst_app_make_element ("qtimlpostprocess", "hand_post");
    gst_element_set_enum_property (hand_post, "module", "hlandmark");
    g_object_set (hand_post, "results", 6, NULL);

    /* Stage 3 — Gesture Embedding */
    gesture_pre = gst_app_make_element ("qtimlpostprocess", "gesture_pre");
    gst_element_set_enum_property (gesture_pre, "module", "tensor");
    gesture_embed = gst_app_make_element ("qtimltflite", "gesture_embed");
    gst_element_set_enum_property (gesture_embed, "delegate", "gpu");

    /* Stage 4 — Gesture Classification */
    gesture_class = gst_app_make_element ("qtimltflite", "gesture_class");
    gst_element_set_enum_property (gesture_class, "delegate", "gpu");
    gesture_post = gst_app_make_element ("qtimlpostprocess", "gesture_post");
    gst_element_set_enum_property (gesture_post, "module", "mobilenet");
    g_object_set (gesture_post, "results", 8, NULL);
    ```
  </Accordion>

  <Accordion title="4. Linking Palm and Hand Landmark Branches">
    ```c theme={null}
    /* input → tee1 → metamux1 (passthrough) + palm detection */
    gst_element_link (input_tail, tee1);
    gst_element_link_many (tee1, q, metamux1, metatransform, tee2, NULL);
    gst_element_link_many (tee1, q, palm_pre, palm_inf, palm_post, caps1, metamux1, NULL);

    /* tee2 → overlay branch + hand landmark branch */
    gst_element_link_many (tee2, q, metamux2, overlay, tee3, NULL);
    gst_element_link_many (tee2, q, hand_pre, hand_inf, tee4, NULL);
    gst_element_link_many (tee4, q, hand_post, caps2, metamux2, NULL);
    ```
  </Accordion>

  <Accordion title="5. Gesture Branch and Output">
    ```c theme={null}
    /* tee4 → embedding → classification → metamux2 */
    gst_element_link_many (tee4, q,
        gesture_pre, gesture_embed, gesture_class, gesture_post,
        caps3, metamux2, NULL);

    /* tee3 → video output */
    gst_element_link_many (tee3, q, output_head, NULL);

    /* tee3 → metadata output (optional) */
    if (meta_head != NULL)
        gst_element_link_many (tee3, q, parser, meta_head, NULL);
    ```
  </Accordion>

  <Accordion title="6. WebRTC Signaling">
    WebRTC communication uses explicit SDP offer/answer exchange and ICE candidate negotiation via WebSocket using `libsoup`:

    ```c theme={null}
    /* Create data channel for metadata */
    g_signal_emit_by_name (webrtcbin, "create-data-channel", name, NULL, &ch);

    /* Send SDP offer */
    GstPromise *promise = gst_promise_new_with_change_func (on_offer_created, appctx, NULL);
    g_signal_emit_by_name (webrtcbin, "create-offer", NULL, promise);

    /* ICE candidates */
    g_signal_connect (appctx->webrtc, "on-ice-candidate",
        G_CALLBACK (on_webrtc_ice_candidate), appctx);
    ```

    | Callback           | Responsibility                                         |
    | ------------------ | ------------------------------------------------------ |
    | `on_offer_created` | Constructs and sends the SDP offer to the remote peer  |
    | `on_ice_candidate` | Transmits ICE candidates to the signaling server       |
    | `on_ws_message`    | Handles incoming signaling messages from the WebSocket |
  </Accordion>
</AccordionGroup>

## Build the Application

* **Source code:** [gst-gesture-recognition](https://github.com/qualcomm/gst-plugins-imsdk/tree/main/gst-sample-apps/gst-gesture-recognition)
* **Build instructions:** [Steps to build custom application](../advanced/ubuntu-build#steps-to-build-custom-application)

## Conclusion

The QIM SDK's modular architecture enables developers to compose intelligent video analytics pipelines with speed and flexibility. Models run in parallel or sequentially, with each stage's output automatically attached to the corresponding video frame as structured metadata. Decoupled video and inference processing allows multiple models to execute concurrently — maximizing throughput without sacrificing accuracy — with all results unified through a single GStreamer element for clean, scalable integration.

Whether building touchless interfaces or gesture-driven applications, the QIM SDK provides a solid, production-ready foundation for advanced edge AI.
