MobileCLIP2-S3 Deployment
Introduction
MobileCLIP2 is an upgraded version of the MobileCLIP efficient image-text pre-training model, specifically designed for mobile devices and low-latency scenarios. With a parameter size between 50–150M and inference latency of only 3–15ms, it achieves industry-leading performance in zero-shot tasks. Compared to its predecessor, MobileCLIP2 introduces three major improvements in multi-modal reinforcement training: first, it utilizes a high-quality CLIP teacher ensemble trained on the DFN dataset to enhance distillation; second, it improves the captioner teacher and fine-tunes it on multiple high-quality image-text datasets to increase description diversity and coverage; third, it incorporates synthetic captions from multiple generative models to further enhance model robustness.
Experimental results show that MobileCLIP2-B achieves a 2.2% accuracy improvement in ImageNet-1k zero-shot classification compared to MobileCLIP-B. MobileCLIP2-S4 reaches the accuracy level of SigLIP-SO400M/14 while being 2x smaller with lower latency, achieving a 2.5x speedup on DFN ViT-L/14. We have open-sourced the pre-trained models and data generation tools to facilitate community expansion and reproduction.
This chapter demonstrates how to complete the deployment, loading, and inference workflow for MobileCLIP2-S3 on edge devices using the AidLite Python API.
In this case, model inference runs on the device-side NPU computing unit. The code calls relevant interfaces to receive user input and return results. This example illustrates the relationship between matching text and multiple images, making it ideal for text-to-image search or automated image tagging scenarios.
- Device: Rhino Pi-X1
- System: Ubuntu 22.04
- Source Model: MobileCLIP2-S3
- Precision: FP16
- Model Farm Reference: MobileCLIP2-S3-FP16
Supported Platforms
| Platform | Execution Method |
|---|---|
| Rhino Pi-X1 | Ubuntu 22.04, AidLux |
Prerequisites
- Rhino Pi-X1 hardware.
- Ubuntu 22.04 system or AidLux system.
Download MobileCLIP2-S3-FP16 Model Resources
mms list MobileClip2
#------------------------ MobileClip2-S3 model available ------------------------
Model Precision Chipset Backend
----- --------- ------- -------
MobileClip2-S3 FP16 Qualcomm QCS8550 QNN2.36
# Download MobileCLIP2-S3-FP16
mms get -m MobileClip2-S3 -p fp16 -c qcs8550 -b qnn2.36 -d /home/aidlux/mobileclip2-s3
cd /home/aidlux/mobileclip2-s3
# Unzip
unzip mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite.zip💡Note
This model is in the Preview section of the Model Farm and can only be obtained via the mms command on AidLux-supported boards.
AidLite SDK Installation
- Ensure the QNN backend version is
≥ 2.36. - Ensure the versions of
aidlite-sdkandaidlite-qnnxxxare2.3.x.
# Check AidLite & QNN versions
dpkg -l | grep aidlite
#------------------------ You should see output similar to the following ------------------------
ii aidlite-qnn236 2.3.0.230 arm64 aidlux aidlite qnn236 backend plugin
ii aidlite-sdk 2.3.0.230 arm64 aidlux inference module sdkUpdate and install dependencies:
# Install AidLite SDK
sudo aid-pkg update
sudo aid-pkg install aidlite-sdk
sudo aid-pkg install aidlite-qnn236AidLite Python API Deployment
Scenario 1: Searching for a photo of a cat
# Install preprocessing dependencies
cd /home/aidlux/mobileclip2-s3/model_farm_mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite/python/open_clip
pip install -e .
pip install timm
pip install torch torchvision torchaudio
cd /home/aidlux/mobileclip2-s3/model_farm_mobileclip2_s3_qcs8550_qnn2.36_fp16_aidlite
# --imgs_path: Input image directory
# --text: Search query
# --invoke_nums: Loop count
python3 python/run_test.py --imgs_path python/imgs --text "a photo of cat" --invoke_nums 10Terminal output (inference time in ms and results):
model load success!
====================================
Text model invoke time:
--mean_invoke_time is 5.9433698654174805
--max_invoke_time is 6.332874298095703
--min_invoke_time is 5.860805511474609
--var_invoketime is 0.020272773895158025
====================================
====================================
Image model invoke time:
--mean_invoke_time is 10.306566953659058
--max_invoke_time is 10.692834854125977
--min_invoke_time is 10.177850723266602
--var_invoketime is 0.031342316297866546
====================================
Input text is : a photo of cat
Image similarity is :
cat_1 87.1933%
cat_2 5.6707%
cat_dog_1 5.9176%
cat_dog_2 1.2054%
chicken 0.0000%
horse 0.0000%
monkey 0.0129%
sport3 0.0000%The output indicates that cat_1.png has the highest similarity to "a photo of cat".
Scenario 2: Searching for a photo of two cats
python3 python/run_test.py --imgs_path python/imgs --text "a photo of two cats" --invoke_nums 10Terminal output:
====================================
Text model invoke time:
--mean_invoke_time is 5.883049964904785
--max_invoke_time is 6.177663803100586
--min_invoke_time is 5.804538726806641
--var_invoketime is 0.01604852343461971
====================================
====================================
Image model invoke time:
--mean_invoke_time is 10.379016399383545
--max_invoke_time is 10.837554931640625
--min_invoke_time is 10.234355926513672
--var_invoketime is 0.03658466241063252
====================================
Input text is : a photo of two cats
Image similar is :
cat_1 0.0491%
cat_2 90.2158%
cat_dog_1 3.9922%
cat_dog_2 5.7429%
chicken 0.0000%
horse 0.0000%
monkey 0.0000%
sport3 0.0000%The output indicates that cat_2.png has the highest similarity to "a photo of two cats".