Skip to content

Deploy LLM HTTP Server with AidGenSE

Introduction

Deploying a Large Language Model (LLM) on edge devices refers to compressing, quantizing, and deploying large models that originally run in the cloud to local devices, enabling offline, low-latency natural language understanding and generation. This chapter demonstrates how to deploy an LLM HTTP service (compatible with OpenAI API) on an edge device based on the AidGenSE inference engine.

In this case, the LLM inference runs on the device side, and relevant interfaces are called via HTTP API to receive user input and return dialogue results in real time.

  • Device: Rhino Pi-X1
  • System: Ubuntu 22.04
  • Model: Qwen2.5-0.5B-Instruct

Supported Platforms

PlatformRunning Method
Rhino Pi-X1Ubuntu 22.04, AidLux

Preparation

  1. Rhino Pi-X1 hardware
  2. Ubuntu 22.04 system or AidLux system

Case Deployment

Step 1: Install AidGenSE

bash
sudo aid-pkg update
sudo aid-pkg -i aidgense

Step 2: Model Query & Acquisition

  • View supported models
bash
# View supported models
aidllm remote-list api

#------------------------Example output is as follows------------------------

Current Soc : 8550

Name                                 Url                                          CreateTime
-----                                ---------                                    ---------
qwen2.5-0.5B-Instruct-8550           aplux/qwen2.5-0.5B-Instruct-8550             2025-03-05 14:52:23
qwen2.5-3B-Instruct-8550             aplux/qwen2.5-3B-Instruct-8550               2025-03-05 14:52:37
...
  • Download Qwen2.5-0.5B-Instruct
bash
# Download the model
aidllm pull api aplux/qwen2.5-0.5B-Instruct-8550

# View downloaded models
aidllm list api

Step 3: Start the HTTP Service

bash
# Start the openai api service for the corresponding model
aidllm start api -m qwen2.5-0.5B-Instruct-8550

# Check status
aidllm status api

# Stop service: aidllm stop api

# Restart service: aidllm restart api

💡Note

The default port number is 8888.

Step 4: Dialogue Test

Dialogue Test Using Web UI

bash
# Install the UI front-end service
sudo aidllm install ui

# Start the UI service
aidllm start ui

# Check UI service status: aidllm status ui

# Stop UI service: aidllm stop ui

After starting the UI service, access http://ip:51104.

Dialogue Test Using Python

python
import os
import requests
import json

def stream_chat_completion(messages, model="qwen2.5-0.5B-Instruct-8550"):

    url = "http://127.0.0.1:8888/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    payload = {
        "model": model,
        "messages": messages,
        "stream": True    # Enable streaming
    }

    # Send request with stream=True
    response = requests.post(url, headers=headers, json=payload, stream=True)
    response.raise_for_status()

    # Read and parse SSE format line by line
    for line in response.iter_lines():
        if not line:
            continue
        # print(line)
        line_data = line.decode('utf-8')
        # Each line of SSE starts with the "data: " prefix
        if line_data.startswith("data: "):
            data = line_data[len("data: "):]
            # End flag
            if data.strip() == "[DONE]":
                break
            try:
                chunk = json.loads(data)
            except json.JSONDecodeError:
                # Print and skip when parsing fails
                print("Failed to parse JSON:", data)
                continue

            # Extract the token output by the model
            content = chunk["choices"][0]["delta"].get("content")
            if content:
                print(content, end="", flush=True)

if __name__ == "__main__":
    # Example dialogue
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello."}
    ]
    print("Assistant:", end=" ")
    stream_chat_completion(messages)
    print()  # New line