Model Card for Model ID

This is a fine-tuned version of Salesforce's BLIP-2 model, adapted for the task of image captioning using the QLoRA methodology for parameter-efficient fine-tuning. The model is trained on the Flickr8k dataset to generate descriptive, human-like captions for a wide variety of images.

Model Details

Model Description

This model is an adaptation of the powerful BLIP-2 vision-language architecture, specifically the Salesforce/blip2-opt-2.7b variant. It has been fine-tuned to specialize in generating accurate and contextually relevant captions for images.  

The fine-tuning was performed using QLoRA (Quantized Low-Rank Adaptation), a highly efficient technique that significantly reduces the computational and memory requirements for training. This is achieved by quantizing the base model to 4-bits and then training small, low-rank adapter matrices, leaving the vast majority of the original model's parameters frozen. This approach makes it possible to adapt large-scale models on consumer-grade hardware while preserving high performance.

This is the model card of a 馃 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [Salesforce]
  • Model type: [Vision-Language Model (VLM) based on BLIP-2]
  • Language(s) (NLP): [English (en)]
  • License: [Apache 2.0]
  • Finetuned from model [optional]: [Salesforce/blip2-opt-2.7b]
Downloads last month
11
Safetensors
Model size
2B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support