EmbeddingGemma-300M-LawVault (Chinese Legal RAG)
📖 Model Introduction
EmbeddingGemma-300M-LawVault is a high-performance embedding model fine-tuned specifically for Chinese Legal RAG (Retrieval-Augmented Generation) scenarios.
Fine-tuned on Google's embeddinggemma-300m, this model employs a rigorous contrastive learning approach using MultipleNegativesRankingLoss and MatryoshkaLoss. It was trained on a high-quality dataset of over 60,000 (Query, Positive, Hard Negative) triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model.
Note: This model is fine-tuned exclusively on Chinese laws and regulations. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed.
Key Highlights
- Domain Specialization: Specifically addresses the pain point where general models fail to distinguish between "National Laws" and "Local Regulations/Administrative Rules" with similar wording.
- Anti-Interference: Trained with "Source-aware Hard Negatives"—using the base model's incorrect top retrievals for the same query as hard negatives—enabling the model to precisely filter out confusingly similar but incorrect clauses.
- Colloquial Understanding: The training set includes queries generated by LLMs to simulate real-world user questions, bridging the semantic gap between formal legal terminology and everyday language.
- Matryoshka Embeddings: Supports flexible output vector dimensions (768, 512, 256, 128), allowing for significantly reduced storage costs without major performance loss.
📊 Evaluation Performance
The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows:
| Metric | Base Model | Finetuned Model (Ours) | Improvement |
|---|---|---|---|
| Hit Rate @ 10 | 85.0% | 98.0% | Significant reduction in "answer not found" cases |
| Top-1 Accuracy | 58.0% | 92.0% | Huge Leap (+34%), the vast majority of correct answers are ranked 1st |
| MRR @ 10 | 0.78 | 0.96 | Extremely high ranking quality |
Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.
Case Study
| User Query | Base Model Rank | Finetuned Rank |
|---|---|---|
| "Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?" | ❌ Not Retrieved (10+) | ✅ 1st |
| "What are the legal requirements for merchants when setting product prices?" | ❌ Not Retrieved (10+) | ✅ 1st |
| "If land is requisitioned for a large hydropower station, how is compensation calculated?" | 2nd | ✅ 1st |
| "How does the government financially support rural revitalization?" | 6th | ✅ 1st |
🚀 Usage
Install Dependencies
pip install -U sentence-transformers
Load Model
from sentence_transformers import SentenceTransformer
# Load model
model_path = "ByronLeeee/EmbeddingGemma-300M-LawVault"
model = SentenceTransformer(model_path, trust_remote_code=True)
# 1. Define Query
query = "What is the penalty for robbery?" # (In Chinese: 抢劫罪一般判几年?)
# 2. Define Documents - Recommended Format: title: {Law Name} | text: {Content}
documents = [
"title: 中华人民共和国刑法 第二百六十三条 | text: 以暴力、胁迫或者其他方法抢劫公私财物的,处三年以上十年以下有期徒刑,并处罚金...",
"title: 中华人民共和国刑法 第二百六十七条 | text: 抢夺公私财物,数额较大的,或者多次抢夺的,处三年以下有期徒刑、拘役或者管制...",
"title: 陕西省专利条例 第二十四条 | text: 负责专利执法的部门...可以查封或者扣押。"
]
# 3. Encode
query_vec = model.encode(query)
doc_vecs = model.encode(documents)
# 4. Compute Similarity
similarities = model.similarity(query_vec, doc_vecs)
print(similarities)
Training Details (Generated by Trainer)
Dataset
- Size: 65,783 training triplets (Anchor, Positive, Hard Negative)
- Source: Chinese Laws & Regulations (Civil, Criminal, Administrative, etc.)
Training Hyperparameters
- Batch Size: 24 (Effective Batch Size = 144 with Gradient Accumulation)
- Learning Rate: 2e-05
- Epochs: 3
- Precision: bf16 (BFloat16)
- Gradient Accumulation: 6 steps
- Max Sequence Length: 1024 tokens
Loss Function
MatryoshkaLoss wrapping MultipleNegativesRankingLoss:
{
"matryoshka_dims": [768, 512, 256, 128],
"matryoshka_weights": [1, 1, 1, 1]
}
Training Logs
Click to expand detailed logs
| Epoch | Step | Training Loss |
|---|---|---|
| 0.0022 | 1 | 3.5148 |
| ... | ... | ... |
| 1.0 | 457 | 0.2123 |
| 2.0 | 914 | 0.0749 |
| 3.0 | 1371 | 0.0369 |
Framework Versions
- Python: 3.13.1
- Sentence Transformers: 5.1.2
- Transformers: 4.57.1
- PyTorch: 2.9.1+cu130
- Accelerate: 1.12.0
- Datasets: 4.4.1
- Tokenizers: 0.22.1
Citation
If you use this model, please cite the following:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
- Downloads last month
- -
Model tree for ByronLeeee/EmbeddingGemma-300M-LawVault
Base model
google/embeddinggemma-300m