Edge Deployment of Qwen3 Series
Introduction
Qwen3 is the latest generation of Large Language Models in the Qwen series, offering a comprehensive suite of dense models and Mixture-of-Experts (MoE) models. Based on large-scale training, Qwen3 has achieved breakthrough progress in reasoning, instruction following, agentic capabilities, and multilingual support.
This chapter demonstrates how to complete the deployment, loading, and conversation workflow for the Qwen3 series on edge devices. Two deployment methods are provided:
- AidGen C++ API
- AidGenSE OpenAI API
In this case, the LLM inference runs on the device side. It receives user input and returns conversation results in real-time through code calling relevant interfaces.
- Device: Rhino Pi-X1
- System: Ubuntu 22.04
- Model: Qwen3-1.7B
Supported Platforms
| Platform | Execution Method |
|---|---|
| Rhino Pi-X1 | Ubuntu 22.04, AidLux |
Prerequisites
- Rhino Pi-X1 hardware.
- Ubuntu 22.04 system or AidLux system.
AidGen Case Deployment
Step 1: Install AidGen SDK
# Install AidGen SDK
sudo aid-pkg update
sudo aid-pkg -i aidgen-sdk
# Copy test code
cd /home/aidlux
cp -r /usr/local/share/aidgen/examples/cpp/aidllm .Step 2: Download Model Resources
Since Qwen3-1.7B is currently in the Model Farm preview section, it must be obtained via the
mmscommand.
# Login
mms login
# Search for the model
mms list qwen3
# Download the model
mms get -m Qwen3-1.7B -p w4a16 -c qcs8550 -b qnn2.36 -d /home/aidlux/aidllm/qwen3-1.7b
cd /home/aidlux/aidllm/qwen3-1.7b
unzip qnn236_qcs8550_cl2048.zip
mv qnn236_qcs8550_cl2048/* /home/aidlux/aidllm/Step 3: Create Configuration File
cd /home/aidlux/aidllm
vim qwen3-1.7b-aidgen-config.jsonCreate the following json configuration file:
{
"backend_type": "genie",
"prefix_path": "kv-cache.primary.qnn-htp",
"model": {
"path": [
"qwen3-1.7b_qnn236_qcs8550_cl2048_1_of_3.serialized.bin.aidem",
"qwen3-1.7b_qnn236_qcs8550_cl2048_2_of_3.serialized.bin.aidem",
"qwen3-1.7b_qnn236_qcs8550_cl2048_3_of_3.serialized.bin.aidem"
]
}
}Step 4: Verify Resource Files
The file distribution should be as follows:
/home/aidlux/aidllm
├── CMakeLists.txt
├── test_prompt_abort.cpp
├── test_prompt_serial.cpp
├── aidgen_chat_template.txt
├── chat.txt
├── htp_backend_ext_config.json
├── qwen3-1.7b-htp.json
├── qwen3-1.7b-aidgen-config.json
├── kv-cache.primary.qnn-htp
├── qwen3-1.7b-tokenizer.json
├── qwen3-1.7b_qnn236_qcs8550_cl2048_1_of_3.serialized.bin.aidem
├── qwen3-1.7b_qnn236_qcs8550_cl2048_2_of_3.serialized.bin.aidem
└── qwen3-1.7b_qnn236_qcs8550_cl2048_3_of_3.serialized.bin.aidemStep 5: Set Conversation Template
💡Note
Please refer to the aidgen_chat_template.txt file in the model resource package for the conversation template.
Modify the test_prompt_serial.cpp file according to the LLM template:
// test_prompt_serial.cpp
// ...
// lines 43-47
std::string prompt_template_type = "qwen3";
if(prompt_template_type == "qwen3"){
prompt_template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{0}/no_think<|im_end|>\n<|im_start|>assistant\n";
}Step 6: Compilation and Execution
# Install dependencies
sudo apt update
sudo apt install libfmt-dev
# Compile
mkdir build && cd build
cmake .. && make
# Run after successful compilation
# The first parameter `1` enables profiler statistics
# The second parameter `1` specifies the number of inference iterations
mv test_prompt_serial /home/aidlux/aidllm/
cd /home/aidlux/aidllm/
./test_prompt_serial qwen3-1.7b-aidgen-config.json 1 1- Enter your text in the terminal to start the conversation.
AidGenSE Case Deployment
Step 1: Install AidGenSE
sudo aid-pkg update
# Ensure that aidgense is the latest version.
sudo aid-pkg remove aidgense
sudo aid-pkg -i aidgenseStep 2: Model Query & Acquisition
# Check available models
aidllm remote-list api | grep qwen3
#------------------------ Qwen3 series models can be seen ------------------------
Current Soc : 8550
Name Url CreateTime
----- --------- ---------
Qwen3-0.6B-8550 aplux/Qwen3-0.6B-8550 2025-09-26 09:54:15
Qwen3-1.7B-8550 aplux/Qwen3-1.7B-8550 2025-09-26 09:54:15
Qwen3-4B-8550 aplux/Qwen3-4B-8550 2025-09-26 09:54:15
Qwen3-8B-8550 aplux/Qwen3-8B-8550 2025-09-26 09:54:15
...
# Download qwen3-1.7B-8550
aidllm pull api aplux/Qwen3-1.7B-8550Step 3: Start HTTP Service
# Start the OpenAI API service for the corresponding model
aidllm start api -m Qwen3-1.7B-8550
# Check status
aidllm status api
# Stop service: aidllm stop api
# Restart service: aidllm restart api💡Note
The default port number is 8888.
Step 4: Conversation Test
Chat Test via Web UI
# Install UI frontend service
sudo aidllm install ui
# Start UI service
aidllm start ui
# Check UI service status: aidllm status ui
# Stop UI service: aidllm stop uiAfter the UI service starts, visit http://ip:51104
Chat Test via Python
import os
import requests
import json
def stream_chat_completion(messages, model="Qwen3-1.7B-8550"):
url = "http://127.0.0.1:8888/v1/chat/completions"
headers = {
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": messages,
"stream": True # Enable streaming
}
# Initiate request with stream=True
response = requests.post(url, headers=headers, json=payload, stream=True)
response.raise_for_status()
# Read line by line and parse SSE format
for line in response.iter_lines():
if not line:
continue
line_data = line.decode('utf-8')
# SSE each line starts with "data: " prefix
if line_data.startswith("data: "):
data = line_data[len("data: "):]
# End flag
if data.strip() == "[DONE]":
break
try:
chunk = json.loads(data)
except json.JSONDecodeError:
print("Unable to parse JSON:", data)
continue
# Extract the token output by the model
content = chunk["choices"][0]["delta"].get("content")
if content:
print(content, end="", flush=True)
if __name__ == "__main__":
# Example conversation
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Give me a short introduction to large language models."}
]
print("Assistant:", end=" ")
stream_chat_completion(messages)
print() # New line