Skip to content

Deploy LLM HTTP Server with AidGenSE

Introduction

Deploying a Large Language Model (LLM) on edge devices means compressing, quantizing, and running models that originally lived in the cloud on local hardware, enabling offline and low-latency natural language understanding and generation. This chapter demonstrates how to deploy an LLM HTTP service compatible with the OpenAI API on edge devices using the AidGenSE inference engine.

In this example, LLM inference runs on the device side, and the HTTP API is used to receive user input and return streaming conversation results.

  • Device: IQ9075
  • System: Ubuntu 24.04
  • Model: Qwen2.5-0.5B-Instruct

Supported Platforms

PlatformRunning Mode
IQ9075Ubuntu 24.04

Prerequisites

  1. IQ9075 hardware
  2. Ubuntu 24.04 system

System Dependency Configuration

Configure AidLux Dependency Sources

bash
# Download the correct public key
sudo wget -O- https://archive.aidlux.com/ubuntu24/public.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/private-aidlux.gpg > /dev/null

# Edit the source file
sudo vim /etc/apt/sources.list.d/private-aidlux.list

# Fill in the private repository provided by AidLux
deb [arch=arm64 signed-by=/etc/apt/trusted.gpg.d/private-aidlux.gpg] https://archive.aidlux.com/ubuntu24 noble main

# Update cache
sudo apt update

After the update is complete, you can query the official AidLux SDK packages with the following command:

bash
sudo apt list | grep aid | grep unknown
bash
# Install software
# Required packages not included by default
sudo apt install python3 python3-pip libopencv-dev python3-opencv net-tools

# Must be installed before aidlite
sudo apt install aidlux-aistack-base aidrtcm

# Install aidlite and dependencies
sudo apt install aid-lms aidlms-sdk aidlite-sdk cmake
sudo apt-get install libfmt-dev nlohmann-json3-dev
sudo apt install aidlite-*

# DSP support
sudo apt-get install qcom-fastrpc1
sudo apt-get install qcom-fastrpc-dev

# Install aidgen-sdk
sudo apt install aidgen-qnn240-sdk

# Install mms service
sudo apt install aid-mms

# GPU support
sudo apt-add-repository -s ppa:ubuntu-qcom-iot/qcom-ppa
sudo apt install qcom-adreno-cl1
sudo ln -s /usr/lib/aarch64-linux-gnu/libOpenCL.so.1 /usr/lib/aarch64-linux-gnu/libOpenCL.so

After installation, verify that the aidlite and aidgen directories have been added under /usr/local/share.

AidLite and AidGen directories in /usr/local/share

Device Authorization

Obtain the Device SN

bash
cat /sys/devices/soc0/serial_number

Obtain the License File

Provide the SN to AidLux technical support so they can generate a device-specific license file, then place it under /etc/opt/aidlux/license/AidLuxLics.

Activate the License

bash
sudo /opt/aidlux/cpf/aid-lms/manager.sh restart

Deployment

Step 1: Install AidGenSE

bash
# Configure the Python virtual environment
sudo apt install -y python3-pip python3-venv > /dev/null 2>&1
sudo python3 -m venv /opt/aidlux/aid-python3

# Create the aid-python3 command
echo '#!/bin/bash
exec /opt/aidlux/aid-python3/bin/python3 "$@"' | sudo tee /usr/bin/aid-python3 > /dev/null
sudo chmod +x /usr/bin/aid-python3

# Create the aid-pip3 command
echo '#!/bin/bash
exec /opt/aidlux/aid-python3/bin/python3 -m pip "$@"' | sudo tee /usr/bin/aid-pip3 > /dev/null
sudo chmod +x /usr/bin/aid-pip3

# Install aidgense
sudo apt install aidgense
sudo aidllm system --sys linux --soc 8550
sudo apt install aid-pkg
sudo aidllm install ui

Step 2: Query and Download the Model

  • Check supported models
bash
# View supported models
aidllm remote-list api

#------------------------ Example output ------------------------

Current Soc : 8550

Name                                 Url                                          CreateTime
-----                                ---------                                    ---------
qwen2.5-0.5B-Instruct-8550           aplux/qwen2.5-0.5B-Instruct-8550             2025-03-05 14:52:23
qwen2.5-3B-Instruct-8550             aplux/qwen2.5-3B-Instruct-8550               2025-03-05 14:52:37
...
  • Download Qwen2.5-0.5B-Instruct
bash
# Download the model
aidllm pull api aplux/qwen2.5-0.5B-Instruct-8550

# View downloaded models
aidllm list api

Step 3: Start the HTTP Service

bash
# Start the OpenAI-compatible API service for the model
aidllm start api -m qwen2.5-0.5B-Instruct-8550

# Check status
aidllm status api

# Stop service: aidllm stop api

# Restart service: aidllm restart api

💡Note

The default port is 8888.

Step 4: Test the Conversation

Test with the Web UI

bash
# Install the UI frontend service
sudo aidllm install ui

# Start the UI service
aidllm start ui

# Check UI service status: aidllm status ui

# Stop UI service: aidllm stop ui

After the UI service starts, open http://ip:51104 in your browser.

Test with Python

python
import json
import requests


def stream_chat_completion(messages, model="qwen2.5-0.5B-Instruct-8550"):
    url = "http://127.0.0.1:8888/v1/chat/completions"
    headers = {
        "Content-Type": "application/json"
    }
    payload = {
        "model": model,
        "messages": messages,
        "stream": True
    }

    response = requests.post(url, headers=headers, json=payload, stream=True)
    response.raise_for_status()

    for line in response.iter_lines():
        if not line:
            continue

        line_data = line.decode("utf-8")
        if line_data.startswith("data: "):
            data = line_data[len("data: "):]
            if data.strip() == "[DONE]":
                break
            try:
                chunk = json.loads(data)
            except json.JSONDecodeError:
                print("Failed to parse JSON:", data)
                continue

            content = chunk["choices"][0]["delta"].get("content")
            if content:
                print(content, end="", flush=True)


if __name__ == "__main__":
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello."}
    ]
    print("Assistant:", end=" ")
    stream_chat_completion(messages)
    print()