Deploy LLM HTTP Server with AidGenSE
Introduction
Deploying a Large Language Model (LLM) on edge devices refers to compressing, quantizing, and deploying large models that originally run in the cloud to local devices, enabling offline, low-latency natural language understanding and generation. This chapter demonstrates how to deploy an LLM HTTP service (compatible with OpenAI API) on an edge device based on the AidGenSE inference engine.
In this case, the LLM inference runs on the device side, and relevant interfaces are called via HTTP API to receive user input and return dialogue results in real time.
- Device: Rhino Pi-X1
- System: Ubuntu 22.04
- Model: Qwen2.5-0.5B-Instruct
Supported Platforms
| Platform | Running Method |
|---|---|
| Rhino Pi-X1 | Ubuntu 22.04, AidLux |
Preparation
- Rhino Pi-X1 hardware
- Ubuntu 22.04 system or AidLux system
Case Deployment
Step 1: Install AidGenSE
bash
sudo aid-pkg update
sudo aid-pkg -i aidgenseStep 2: Model Query & Acquisition
- View supported models
bash
# View supported models
aidllm remote-list api
#------------------------Example output is as follows------------------------
Current Soc : 8550
Name Url CreateTime
----- --------- ---------
qwen2.5-0.5B-Instruct-8550 aplux/qwen2.5-0.5B-Instruct-8550 2025-03-05 14:52:23
qwen2.5-3B-Instruct-8550 aplux/qwen2.5-3B-Instruct-8550 2025-03-05 14:52:37
...- Download Qwen2.5-0.5B-Instruct
bash
# Download the model
aidllm pull api aplux/qwen2.5-0.5B-Instruct-8550
# View downloaded models
aidllm list apiStep 3: Start the HTTP Service
bash
# Start the openai api service for the corresponding model
aidllm start api -m qwen2.5-0.5B-Instruct-8550
# Check status
aidllm status api
# Stop service: aidllm stop api
# Restart service: aidllm restart api💡Note
The default port number is 8888.
Step 4: Dialogue Test
Dialogue Test Using Web UI
bash
# Install the UI front-end service
sudo aidllm install ui
# Start the UI service
aidllm start ui
# Check UI service status: aidllm status ui
# Stop UI service: aidllm stop uiAfter starting the UI service, access http://ip:51104.
Dialogue Test Using Python
python
import os
import requests
import json
def stream_chat_completion(messages, model="qwen2.5-0.5B-Instruct-8550"):
url = "http://127.0.0.1:8888/v1/chat/completions"
headers = {
"Content-Type": "application/json"
}
payload = {
"model": model,
"messages": messages,
"stream": True # Enable streaming
}
# Send request with stream=True
response = requests.post(url, headers=headers, json=payload, stream=True)
response.raise_for_status()
# Read and parse SSE format line by line
for line in response.iter_lines():
if not line:
continue
# print(line)
line_data = line.decode('utf-8')
# Each line of SSE starts with the "data: " prefix
if line_data.startswith("data: "):
data = line_data[len("data: "):]
# End flag
if data.strip() == "[DONE]":
break
try:
chunk = json.loads(data)
except json.JSONDecodeError:
# Print and skip when parsing fails
print("Failed to parse JSON:", data)
continue
# Extract the token output by the model
content = chunk["choices"][0]["delta"].get("content")
if content:
print(content, end="", flush=True)
if __name__ == "__main__":
# Example dialogue
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."}
]
print("Assistant:", end=" ")
stream_chat_completion(messages)
print() # New line