Skip to content

Edge Deployment of Qwen3 Series

Introduction

Qwen3 is the latest generation of Large Language Models in the Qwen series, offering a comprehensive suite of dense models and Mixture-of-Experts (MoE) models. Based on large-scale training, Qwen3 has achieved breakthrough progress in reasoning, instruction following, agentic capabilities, and multilingual support.

This chapter demonstrates how to complete the deployment, loading, and conversation workflow for the Qwen3 series on edge devices. Two deployment methods are provided:

  • AidGen C++ API
  • AidGenSE OpenAI API

In this case, the LLM inference runs on the device side. It receives user input and returns conversation results in real-time through code calling relevant interfaces.

  • Device: Rhino Pi-X1
  • System: Ubuntu 22.04
  • Model: Qwen3-1.7B

Supported Platforms

PlatformExecution Method
Rhino Pi-X1Ubuntu 22.04, AidLux

Prerequisites

  1. Rhino Pi-X1 hardware.
  2. Ubuntu 22.04 system or AidLux system.

AidGen Case Deployment

Step 1: Install AidGen SDK

bash
# Install AidGen SDK
sudo aid-pkg update
sudo aid-pkg -i aidgen-sdk

# Copy test code
cd /home/aidlux
cp -r /usr/local/share/aidgen/examples/cpp/aidllm .

Step 2: Download Model Resources

Since Qwen3-1.7B is currently in the Model Farm preview section, it must be obtained via the mms command.

bash
# Login
mms login

# Search for the model
mms list qwen3

# Download the model
mms get -m Qwen3-1.7B -p w4a16 -c qcs8550 -b qnn2.36 -d /home/aidlux/aidllm/qwen3-1.7b

cd /home/aidlux/aidllm/qwen3-1.7b
unzip qnn236_qcs8550_cl2048.zip
mv qnn236_qcs8550_cl2048/* /home/aidlux/aidllm/

Step 3: Create Configuration File

bash
cd /home/aidlux/aidllm
vim qwen3-1.7b-aidgen-config.json

Create the following json configuration file:

json
{
    "backend_type": "genie",
    "prefix_path": "kv-cache.primary.qnn-htp",
    "model": {
        "path": [
            "qwen3-1.7b_qnn236_qcs8550_cl2048_1_of_3.serialized.bin.aidem",
            "qwen3-1.7b_qnn236_qcs8550_cl2048_2_of_3.serialized.bin.aidem",
            "qwen3-1.7b_qnn236_qcs8550_cl2048_3_of_3.serialized.bin.aidem"
        ]
    }
}

Step 4: Verify Resource Files

The file distribution should be as follows:

bash
/home/aidlux/aidllm
├── CMakeLists.txt
├── test_prompt_abort.cpp
├── test_prompt_serial.cpp
├── aidgen_chat_template.txt
├── chat.txt
├── htp_backend_ext_config.json
├── qwen3-1.7b-htp.json
├── qwen3-1.7b-aidgen-config.json
├── kv-cache.primary.qnn-htp
├── qwen3-1.7b-tokenizer.json
├── qwen3-1.7b_qnn236_qcs8550_cl2048_1_of_3.serialized.bin.aidem
├── qwen3-1.7b_qnn236_qcs8550_cl2048_2_of_3.serialized.bin.aidem
└── qwen3-1.7b_qnn236_qcs8550_cl2048_3_of_3.serialized.bin.aidem

Step 5: Set Conversation Template

💡Note

Please refer to the aidgen_chat_template.txt file in the model resource package for the conversation template.

Modify the test_prompt_serial.cpp file according to the LLM template:

cpp
// test_prompt_serial.cpp
// ...
// lines 43-47
    std::string prompt_template_type = "qwen3";
    if(prompt_template_type == "qwen3"){
        prompt_template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{0}/no_think<|im_end|>\n<|im_start|>assistant\n";
    }

Step 6: Compilation and Execution

bash
# Install dependencies
sudo apt update
sudo apt install libfmt-dev

# Compile
mkdir build && cd build
cmake .. && make

# Run after successful compilation
# The first parameter `1` enables profiler statistics
# The second parameter `1` specifies the number of inference iterations

mv test_prompt_serial /home/aidlux/aidllm/
cd /home/aidlux/aidllm/
./test_prompt_serial qwen3-1.7b-aidgen-config.json 1 1
  • Enter your text in the terminal to start the conversation.

AidGenSE Case Deployment

Step 1: Install AidGenSE

bash
sudo aid-pkg update

# Ensure that aidgense is the latest version.
sudo aid-pkg remove aidgense
sudo aid-pkg -i aidgense

Step 2: Model Query & Acquisition

bash
# Check available models
aidllm remote-list api | grep qwen3

#------------------------ Qwen3 series models can be seen ------------------------

Current Soc : 8550

Name                                 Url                                         CreateTime
-----                                ---------                                   ---------
Qwen3-0.6B-8550                      aplux/Qwen3-0.6B-8550                       2025-09-26 09:54:15
Qwen3-1.7B-8550                      aplux/Qwen3-1.7B-8550                       2025-09-26 09:54:15
Qwen3-4B-8550                        aplux/Qwen3-4B-8550                         2025-09-26 09:54:15
Qwen3-8B-8550                        aplux/Qwen3-8B-8550                         2025-09-26 09:54:15
...

# Download qwen3-1.7B-8550
aidllm pull api aplux/Qwen3-1.7B-8550

Step 3: Start HTTP Service

bash
# Start the OpenAI API service for the corresponding model
aidllm start api -m Qwen3-1.7B-8550

# Check status
aidllm status api

# Stop service: aidllm stop api
# Restart service: aidllm restart api

💡Note

The default port number is 8888.

Step 4: Conversation Test

Chat Test via Web UI

bash
# Install UI frontend service
sudo aidllm install ui

# Start UI service
aidllm start ui

# Check UI service status: aidllm status ui
# Stop UI service: aidllm stop ui

After the UI service starts, visit http://ip:51104

Chat Test via Python

python
import os
import requests
import json

def stream_chat_completion(messages, model="Qwen3-1.7B-8550"):

    url = "http://127.0.0.1:8888/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    payload = {
        "model": model,
        "messages": messages,
        "stream": True    # Enable streaming
    }

    # Initiate request with stream=True
    response = requests.post(url, headers=headers, json=payload, stream=True)
    response.raise_for_status()

    # Read line by line and parse SSE format
    for line in response.iter_lines():
        if not line:
            continue
        line_data = line.decode('utf-8')
        # SSE each line starts with "data: " prefix
        if line_data.startswith("data: "):
            data = line_data[len("data: "):]
            # End flag
            if data.strip() == "[DONE]":
                break
            try:
                chunk = json.loads(data)
            except json.JSONDecodeError:
                print("Unable to parse JSON:", data)
                continue

            # Extract the token output by the model
            content = chunk["choices"][0]["delta"].get("content")
            if content:
                print(content, end="", flush=True)

if __name__ == "__main__":
    # Example conversation
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me a short introduction to large language models."}
    ]
    print("Assistant:", end=" ")
    stream_chat_completion(messages)
    print()  # New line