Skip to content

MobileCLIP2-S3 Deployment

Introduction

MobileCLIP2 is an upgraded version of the MobileCLIP efficient image-text pre-training model, specifically designed for mobile devices and low-latency scenarios. With a parameter size between 50–150M and inference latency of only 3–15ms, it achieves industry-leading performance in zero-shot tasks. Compared to its predecessor, MobileCLIP2 introduces three major improvements in multi-modal reinforcement training: first, it utilizes a high-quality CLIP teacher ensemble trained on the DFN dataset to enhance distillation; second, it improves the captioner teacher and fine-tunes it on multiple high-quality image-text datasets to increase description diversity and coverage; third, it incorporates synthetic captions from multiple generative models to further enhance model robustness.

Experimental results show that MobileCLIP2-B achieves a 2.2% accuracy improvement in ImageNet-1k zero-shot classification compared to MobileCLIP-B. MobileCLIP2-S4 reaches the accuracy level of SigLIP-SO400M/14 while being 2x smaller with lower latency, achieving a 2.5x speedup on DFN ViT-L/14. We have open-sourced the pre-trained models and data generation tools to facilitate community expansion and reproduction.

This chapter demonstrates how to complete the deployment, loading, and inference workflow for MobileCLIP2-S3 on edge devices using the AidLite Python API.

In this case, model inference runs on the device-side NPU computing unit. The code calls relevant interfaces to receive user input and return results. This example illustrates the relationship between matching text and multiple images, making it ideal for text-to-image search or automated image tagging scenarios.

Supported Platforms

PlatformExecution Method
Rhino Pi-X1Ubuntu 22.04, AidLux

Prerequisites

  1. Rhino Pi-X1 hardware.
  2. Ubuntu 22.04 system or AidLux system.

Download MobileCLIP2-S3-FP16 Model Resources

bash
mms list MobileClip2

#------------------------ MobileClip2-S3 model available ------------------------
Model           Precision  Chipset          Backend
-----           ---------  -------          -------
MobileClip2-S3  FP16       Qualcomm QCS8550  QNN2.36

# Download MobileCLIP2-S3-FP16
mms get -m MobileClip2-S3 -p fp16 -c qcs8550 -b qnn2.36 -d /home/aidlux/mobileclip2-s3
cd /home/aidlux/mobileclip2-s3
# Unzip
unzip mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite.zip

💡Note

This model is in the Preview section of the Model Farm and can only be obtained via the mms command on AidLux-supported boards.

AidLite SDK Installation

  • Ensure the QNN backend version is ≥ 2.36.
  • Ensure the versions of aidlite-sdk and aidlite-qnnxxx are 2.3.x.
bash
# Check AidLite & QNN versions
dpkg -l | grep aidlite
#------------------------ You should see output similar to the following ------------------------
ii  aidlite-qnn236       2.3.0.230         arm64        aidlux aidlite qnn236 backend plugin
ii  aidlite-sdk           2.3.0.230         arm64        aidlux inference module sdk

Update and install dependencies:

bash
# Install AidLite SDK
sudo aid-pkg update
sudo aid-pkg install aidlite-sdk
sudo aid-pkg install aidlite-qnn236

AidLite Python API Deployment

Scenario 1: Searching for a photo of a cat

bash
# Install preprocessing dependencies
cd /home/aidlux/mobileclip2-s3/model_farm_mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite/python/open_clip
pip install -e .
pip install timm
pip install torch torchvision torchaudio

cd /home/aidlux/mobileclip2-s3/model_farm_mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite

# --imgs_path: Input image directory
# --text: Search query
# --invoke_nums: Loop count
python3 python/run_test.py --imgs_path python/imgs --text "a photo of cat" --invoke_nums 10

Terminal output (inference time in ms and results):

plain
 model load success!
====================================
Text model invoke time:
 --mean_invoke_time is 5.9433698654174805 
 --max_invoke_time is 6.332874298095703 
 --min_invoke_time is 5.860805511474609 
 --var_invoketime is 0.020272773895158025
====================================
====================================
Image model invoke time:
 --mean_invoke_time is 10.306566953659058 
 --max_invoke_time is 10.692834854125977 
 --min_invoke_time is 10.177850723266602 
 --var_invoketime is 0.031342316297866546
====================================
Input text is : a photo of cat
Image similarity is :
cat_1                     87.1933%
cat_2                     5.6707%
cat_dog_1                 5.9176%
cat_dog_2                 1.2054%
chicken                   0.0000%
horse                     0.0000%
monkey                    0.0129%
sport3                    0.0000%

The output indicates that cat_1.png has the highest similarity to "a photo of cat".


Scenario 2: Searching for a photo of two cats

bash
python3 python/run_test.py --imgs_path python/imgs --text "a photo of two cats" --invoke_nums 10

Terminal output:

plain
====================================
Text model invoke time:
 --mean_invoke_time is 5.883049964904785 
 --max_invoke_time is 6.177663803100586 
 --min_invoke_time is 5.804538726806641 
 --var_invoketime is 0.01604852343461971
====================================
====================================
Image model invoke time:
 --mean_invoke_time is 10.379016399383545 
 --max_invoke_time is 10.837554931640625 
 --min_invoke_time is 10.234355926513672 
 --var_invoketime is 0.03658466241063252
====================================
Input text is : a photo of two cats
Image similar is :
cat_1                     0.0491%
cat_2                     90.2158%
cat_dog_1                 3.9922%
cat_dog_2                 5.7429%
chicken                   0.0000%
horse                     0.0000%
monkey                    0.0000%
sport3                    0.0000%

The output indicates that cat_2.png has the highest similarity to "a photo of two cats".