RexGemma-2048

License: MIT Models Data GitHub

TL;DR: Gemma3-270M decoder converted into encoder with 2048 sequence length and 100M non-embedding parameters to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 350B+ e-commerce-specific tokens


Table of Contents


Quick Start

import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline

MODEL_ID = "thebajajra/RexGemma-2048"

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

# 1) Fill-Mask (if MLM head is present)
mlm = pipeline("fill-mask", model=MODEL_ID, tokenizer=tok)
print(mlm("These running shoes are great for [MASK] training."))

# 2) Feature extraction (CLS or mean-pooled embeddings)
enc = AutoModel.from_pretrained(MODEL_ID)
inputs = tok(["wireless mouse", "ergonomic mouse pad"], padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = enc(**inputs, output_hidden_states=True)
# Mean-pool last hidden state for sentence embeddings
emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) / inputs.attention_mask.sum(dim=1, keepdim=True)

Sentence-Transformers

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "thebajajra/RexGemma-2048"

model_mlm = AutoModelForMaskedLM.from_pretrained(model_name, trust_remote_code=True)
encoder = model_mlm.encoder
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

ENCODER_DIR = "rexgemma-2048-encoder-only"
encoder.save_pretrained(ENCODER_DIR)
tokenizer.save_pretrained(ENCODER_DIR)

model = SentenceTransformer(ENCODER_DIR)

Intended Uses & Limitations

Use cases

  • Product & query retrieval/semantic search (titles, descriptions, attributes)
  • Attribute extraction / slot filling (brand, color, size, material)
  • Classification (category assignment, unsafe/regulated item filtering, review sentiment)
  • Reranking and query understanding (spelling/ASR normalization, acronym expansion)

Out of scope

  • Long-form generation (use a decoder/seq-to-seq LM instead)
  • High-stakes decisions without human review (pricing, compliance, safety flags)

Target users

  • Search/recs engineers, e-commerce data teams, ML researchers working on domain-specific encoders

Model Description

RexGemma-2048 is an encoder-only, 100M parameters transformer trained with a masked-language-modeling objective and optimized for e-commerce related text.


Training Recipe


Data Overview

We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size

Domain Size (GBs)
Hobby 114
News 66
Health 66
Entertainment 64
Travel 52
Food 22
Automotive 19
Sports 12
Music and Dance 7

Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.

Domain Size (GBs)
Fashion 37
Beauty 37
Celebrity 28
Movie 26
Photo 15
Painting 2

By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.


Evaluation

Semantic Similarity

image/png **Used non-embedding parameters to plot RexGemma-2048

RexGemma models outperform all the models in their parameter/size category including RexBERT family of models.


Usage Examples

1) Masked language modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexGemma-2048")
t = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
fill = pipeline("fill-mask", model=m, tokenizer=t)

fill("Best [MASK] headphones under $100.")

2) Embeddings / feature extraction

import torch
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
enc = AutoModel.from_pretrained("thebajajra/RexGemma-2048")

texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    out = enc(**batch)
# Mean-pool last hidden state
attn = batch["attention_mask"].unsqueeze(-1)
emb = (out.last_hidden_state * attn).sum(1) / attn.sum(1)
# Normalize for cosine similarity (recommended for retrieval)
emb = torch.nn.functional.normalize(emb, p=2, dim=1)

3) Text classification fine-tune

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tok = AutoTokenizer.from_pretrained("thebajajra/RexGemma-2048")
model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexGemma-2048", num_labels=NUM_LABELS)

# Prepare your Dataset objects: train_ds, val_ds (text→label)
args = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tok)
trainer.train()

Model Architecture & Compatibility

  • Architecture: Encoder-only, Gemma3-270M backbone model.
  • Libraries: Works with 🤗 Transformers; supports fill-mask and feature-extraction pipelines.
  • Context length: Increased during the Context Extension phase—ensure max_position_embeddings in config.json matches your desired max length.
  • Files: config.json, tokenizer files, and (optionally) heads for MLM or classification.
  • Export: Standard PyTorch weights; you can export ONNX / TorchScript for production if needed.

Responsible & Safe Use

  • Biases: Commerce data can encode brand, price, and region biases; audit downstream classifiers/retrievers for disparate error rates across categories/regions.
  • Sensitive content: Add filters for adult/regulated items; document moderation thresholds if you release classifiers.
  • Privacy: Do not expose PII; ensure training data complies with terms and applicable laws.
  • Misuse: This model is not a substitute for legal/compliance review for listings.

License

  • License: MIT.

Maintainers & Contact


Downloads last month
61
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Dataset used to train thebajajra/RexGemma-2048

Collection including thebajajra/RexGemma-2048