File size: 6,969 Bytes
0b0ad6e 6eda227 2d63589 0305d13 2d63589 0305d13 2d63589 6eda227 2d63589 a31412a 2d63589 738ed1f 2d63589 0b0ad6e 2d63589 fcdf04d 2d63589 fcdf04d 2d63589 3e8cdd1 2d63589 0b0ad6e 2d63589 0b0ad6e 2d63589 fcdf04d 6eda227 19e6ed9 3e8cdd1 fcdf04d 6eda227 fcdf04d 6eda227 fcdf04d 6eda227 fcdf04d 6eda227 fcdf04d 6eda227 0b0ad6e 2d63589 fcdf04d 2d63589 fcdf04d 2d63589 fcdf04d 2d63589 fcdf04d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
---
<p align="center">
<a href="https://arxiv.org/abs/2512.02924"><img src="https://img.shields.io/badge/📄%20arXiv-2512.02924-b31b1b?style=for-the-badge" alt="arXiv"></a>
<a href="https://discord.com/invite/nexa-ai"><img src="https://img.shields.io/badge/💬%20Discord-Nexa%20AI-5865F2?style=for-the-badge" alt="Discord"></a>
<a href="https://x.com/nexa_ai"><img src="https://img.shields.io/badge/𝕏%20Twitter-nexa__ai-000000?style=for-the-badge" alt="Twitter"></a>
</p>
<p align="center">
<a href="https://github.com/NexaAI/nexa-sdk/edit/main/solutions/autoneural/README.md"><b>🌟 Github</b></a> |
<a href="https://nexa.ai/solution/intelligent-cockpit"><b>📄 Webpage</b></a>
</p>
# AutoNeural-VL-1.5B
## **Introduction**
**AutoNeural** is an NPU-native vision–language model for in-car assistants, co-designed with a MobileNetV5 encoder and a hybrid Liquid AI 1.2B backbone to deliver **real-time multimodal understanding on Qualcomm SA8295P NPU**. It runs 768×768 images, cuts end-to-end latency by up to **14×**, and improves quantization error by **7×** compared to ViT–Transformer baselines on the same hardware.
Key Features:
- **NPU-native co-design** – MobileNet-based vision encoder + hybrid Transformer–SSM backbone, built for INT4/8/16 and NPU operator sets.
- **Real-time cockpit performance** – Up to **14× lower TTFT**, ~3× faster decode, and 4× longer context (4096 vs 1024) on Qualcomm SA8295P NPU.
- **High-resolution multimodal perception** – Supports **768×768** images with ~45 dB SQNR under mixed-precision quantization (W8A16 vision, W4A16 language).
- **Automotive-tuned dataset** – Trained with **200k** proprietary cockpit samples (AI Sentinel, Greeter, Car Finder, Safety) plus large-scale Infinity-MM instruction data.
- **Production-focused** – Designed for always-on, low-power, privacy-preserving deployment in real vehicles.
## Use Cases
AutoNeural powers real-time cockpit intelligence including **in-cabin detection**, **out-cabin awareness**, **HMI understanding**, and **visual + conversational agent**.
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/a-Rd-eFETHPgf82wOPr4S.png" alt="Use Case" style="width:700px;"/>
---
## ⚡ **Benchmarks**
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/>
| Metric | InternVL 2B (baseline) | AutoNeural-VL |
| :--------------------- | :--------------------: | :-----------: |
| TTFT (1× 512² image) | ~1.4 s | **~100 ms** |
| Max image size | 448×448 | **768×768** |
| SQNR | 28 dB | **45 dB** |
| RMS quantization error | 3.98% | **0.562%** |
| Decode throughput | ~15 tok/s | **~44 tok/s** |
| Context length | 1024 | **4096** |
> 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.
---
# **How to Use**
> ⚠️ **Hardware requirement:** AutoNeural is only available for **Qualcomm NPUs**.
### 1) Install Nexa-SDK
Download the SDK,follow the installation steps provided on the model page.
### 2) Configure authentication
Create an access token in the Model Hub, then run:
```bash
nexa config set license '<access_token>'
```
### 3) Run the model
```bash
nexa infer NexaAI/AutoNeural
```
### 4) Image input
Drag and drop one or more image files into the terminal window.
Multiple images can be processed with a single query.
---
## Model architecture
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>
AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).
- **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization.
- **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
- **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
- **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.
---
## Training
<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/>
AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.
1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.
---
## **License**
This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.
All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.
## **Enterprise Deployment**
For enterprise deployment, custom integrations, or licensing inquiries:
📅 **[Book a Call with Us](https://nexa.ai/book-a-call)** |