Image-Text-to-Text
File size: 6,969 Bytes
0b0ad6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6eda227
 
 
2d63589
0305d13
2d63589
0305d13
 
 
 
 
 
 
2d63589
6eda227
2d63589
a31412a
2d63589
738ed1f
2d63589
 
 
0b0ad6e
2d63589
fcdf04d
2d63589
fcdf04d
 
 
 
 
 
 
 
 
 
2d63589
3e8cdd1
 
2d63589
 
 
 
0b0ad6e
2d63589
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b0ad6e
2d63589
 
 
 
 
 
fcdf04d
6eda227
19e6ed9
3e8cdd1
fcdf04d
6eda227
fcdf04d
 
 
 
6eda227
 
 
fcdf04d
 
 
6eda227
fcdf04d
6eda227
fcdf04d
 
 
 
 
 
6eda227
0b0ad6e
2d63589
fcdf04d
 
 
2d63589
fcdf04d
2d63589
fcdf04d
2d63589
fcdf04d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
---
<p align="center">
  <a href="https://arxiv.org/abs/2512.02924"><img src="https://img.shields.io/badge/📄%20arXiv-2512.02924-b31b1b?style=for-the-badge" alt="arXiv"></a>
  <a href="https://discord.com/invite/nexa-ai"><img src="https://img.shields.io/badge/💬%20Discord-Nexa%20AI-5865F2?style=for-the-badge" alt="Discord"></a>
  <a href="https://x.com/nexa_ai"><img src="https://img.shields.io/badge/𝕏%20Twitter-nexa__ai-000000?style=for-the-badge" alt="Twitter"></a>
</p>

<p align="center">
  <a href="https://github.com/NexaAI/nexa-sdk/edit/main/solutions/autoneural/README.md"><b>🌟 Github</b></a> |
  <a href="https://nexa.ai/solution/intelligent-cockpit"><b>📄 Webpage</b></a> 
</p>

# AutoNeural-VL-1.5B

## **Introduction**

**AutoNeural** is an NPU-native vision–language model for in-car assistants, co-designed with a MobileNetV5 encoder and a hybrid Liquid AI 1.2B backbone to deliver **real-time multimodal understanding on Qualcomm SA8295P NPU**. It runs 768×768 images, cuts end-to-end latency by up to **14×**, and improves quantization error by **7×** compared to ViT–Transformer baselines on the same hardware.

Key Features:
- **NPU-native co-design** – MobileNet-based vision encoder + hybrid Transformer–SSM backbone, built for INT4/8/16 and NPU operator sets.
- **Real-time cockpit performance** – Up to **14× lower TTFT**, ~3× faster decode, and 4× longer context (4096 vs 1024) on Qualcomm SA8295P NPU.
- **High-resolution multimodal perception** – Supports **768×768** images with ~45 dB SQNR under mixed-precision quantization (W8A16 vision, W4A16 language).
- **Automotive-tuned dataset** – Trained with **200k** proprietary cockpit samples (AI Sentinel, Greeter, Car Finder, Safety) plus large-scale Infinity-MM instruction data.
- **Production-focused** – Designed for always-on, low-power, privacy-preserving deployment in real vehicles.


## Use Cases

AutoNeural powers real-time cockpit intelligence including **in-cabin detection**, **out-cabin awareness**, **HMI understanding**, and **visual + conversational agent**.

<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/a-Rd-eFETHPgf82wOPr4S.png" alt="Use Case" style="width:700px;"/>

---

## ⚡ **Benchmarks**

<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/_zzNwQpFsGehf-_ASRupM.png" alt="Benchmark" style="width:700px;"/>

| Metric                 | InternVL 2B (baseline) | AutoNeural-VL |
| :--------------------- | :--------------------: | :-----------: |
| TTFT (1× 512² image)   |         ~1.4 s         |  **~100 ms**  |
| Max image size         |        448×448         |  **768×768**  |
| SQNR                   |         28 dB          |   **45 dB**   |
| RMS quantization error |         3.98%          |  **0.562%**   |
| Decode throughput      |       ~15 tok/s        | **~44 tok/s** |
| Context length         |          1024          |   **4096**    |

> 📝 These numbers are measured on-device with mixed precision (vision: W8A16; language: W4A16), not simulation.



---

# **How to Use**

> ⚠️ **Hardware requirement:** AutoNeural is only available for **Qualcomm NPUs**.

### 1) Install Nexa-SDK

Download the SDK,follow the installation steps provided on the model page.


### 2) Configure authentication

Create an access token in the Model Hub, then run:

```bash
nexa config set license '<access_token>'
```

### 3) Run the model

```bash
nexa infer NexaAI/AutoNeural
```

### 4) Image input

Drag and drop one or more image files into the terminal window.
Multiple images can be processed with a single query.

---

## Model architecture

<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/eHNdopWWaoir2IP3Cu_AF.png" alt="Model Architecture" style="width:700px;"/>

AutoNeural is an NPU-native vision–language model co-designed for integer-only inference on edge devices (e.g. Qualcomm SA8295P).

- **Vision encoder.** A MobileNetV5-style CNN initialized from Gemma 3n-E4B, taking 768×768 images and producing a 16×16×2048 feature map. A Multi-Scale Fusion Adapter (MSFA) fuses the last stages and flattens them into **256 visual tokens**, giving strong inductive bias and stable INT8/16 quantization.
- **Vision–language connector.** A lightweight 2-layer MLP projects visual tokens into the language embedding space. We deliberately remove normalization from the projector to make activation ranges easier to calibrate for static NPU quantization.
- **Language backbone.** A 1.2B-parameter **hybrid Transformer–SSM (“Liquid AI”)** model with 16 layers, interleaving 10 gated-convolution SSM layers with 6 self-attention layers. The SSM layers provide linear-time inference and a compact state instead of a full KV cache, cutting memory I/O while the attention layers preserve strong reasoning and in-context learning.
- **Quantization.** The deployed model uses mixed precision (e.g. W8A16 for vision, W4A16 for language) and NPU-aware graph partitioning to meet tight latency and memory budgets without sacrificing accuracy.

---

## Training

<img src="https://cdn-uploads.huggingface.co/production/uploads/6851901ea43b4824f79e27a9/GPFXmoOXaF-4M-nne6GPJ.png" alt="Training" style="width:700px;"/>

AutoNeural follows a four-stage curriculum on large-scale multimodal data plus a proprietary automotive dataset.

1. **Image–text alignment.** Freeze vision and language backbones; train only the projector on image–caption pairs to learn basic visual grounding.
2. **General visual understanding.** Unfreeze the full model and train on broad VQA-style tasks (object/scene understanding, basic reasoning) from the Infinity-MM dataset to build strong general multimodal capability.
3. **Instruction tuning.** Continue training on diverse instruction-following data (documents, charts, OCR, multi-turn dialogue, specialized domains) using a mixture of task weights for balanced performance.
4. **Automotive domain finetuning.** Finetune on ~200k curated cockpit samples (AI Sentinel, Greeter, Car Finder, Safety when getting on/off) plus high-quality synthetic data, with an NPU-aware recipe that combines quantization-aware training, mixed-precision constraints, and calibration to keep post-quantization drift low on real hardware.

---

## **License**

This model is licensed under the **Creative Commons Attribution–NonCommercial 4.0 (CC BY-NC 4.0)** license, which allows use, sharing, and modification only for non-commercial purposes with proper attribution.

All NPU-related models, runtimes, and code in this project are protected under this non-commercial license and cannot be used in any commercial or revenue-generating applications.

## **Enterprise Deployment**

For enterprise deployment, custom integrations, or licensing inquiries:

📅 **[Book a Call with Us](https://nexa.ai/book-a-call)**