EmbeddingGemma-300M-LawVault (Chinese Legal RAG)

📖 Model Introduction

EmbeddingGemma-300M-LawVault is a high-performance embedding model fine-tuned specifically for Chinese Legal RAG (Retrieval-Augmented Generation) scenarios.

Fine-tuned on Google's embeddinggemma-300m, this model employs a rigorous contrastive learning approach using MultipleNegativesRankingLoss and MatryoshkaLoss. It was trained on a high-quality dataset of over 60,000 (Query, Positive, Hard Negative) triplets to significantly improve retrieval accuracy for legal statutes, colloquial legal inquiries, and noise resistance compared to the base model.

Note: This model is fine-tuned exclusively on Chinese laws and regulations. Its performance on other languages or non-legal domains has not been evaluated and is not guaranteed.

Key Highlights

Domain Specialization: Specifically addresses the pain point where general models fail to distinguish between "National Laws" and "Local Regulations/Administrative Rules" with similar wording.
Anti-Interference: Trained with "Source-aware Hard Negatives"—using the base model's incorrect top retrievals for the same query as hard negatives—enabling the model to precisely filter out confusingly similar but incorrect clauses.
Colloquial Understanding: The training set includes queries generated by LLMs to simulate real-world user questions, bridging the semantic gap between formal legal terminology and everyday language.
Matryoshka Embeddings: Supports flexible output vector dimensions (768, 512, 256, 128), allowing for significantly reduced storage costs without major performance loss.

📊 Evaluation Performance

The model was evaluated on a held-out test set constructed from real legal scenarios (containing 120 unseen colloquial legal queries generated by Deepseek V3.2). The End-to-End RAG retrieval results are as follows:

Metric	Base Model	Finetuned Model (Ours)	Improvement
Hit Rate @ 10	85.0%	98.0%	Significant reduction in "answer not found" cases
Top-1 Accuracy	58.0%	92.0%	Huge Leap (+34%), the vast majority of correct answers are ranked 1st
MRR @ 10	0.78	0.96	Extremely high ranking quality

Note: The test environment used a LanceDB vector database covering a full slice of the Chinese laws and regulations database.

Case Study

User Query	Base Model Rank	Finetuned Rank
"Can the provincial cultural relics bureau directly transfer artifacts unearthed in our area?"	❌ Not Retrieved (10+)	✅ 1st
"What are the legal requirements for merchants when setting product prices?"	❌ Not Retrieved (10+)	✅ 1st
"If land is requisitioned for a large hydropower station, how is compensation calculated?"	2nd	✅ 1st
"How does the government financially support rural revitalization?"	6th	✅ 1st

🚀 Usage

Install Dependencies

pip install -U sentence-transformers

Load Model

from sentence_transformers import SentenceTransformer

# Load model
model_path = "ByronLeeee/EmbeddingGemma-300M-LawVault"
model = SentenceTransformer(model_path, trust_remote_code=True)

# 1. Define Query
query = "What is the penalty for robbery?" # (In Chinese: 抢劫罪一般判几年？)

# 2. Define Documents - Recommended Format: title: {Law Name} | text: {Content}
documents = [
    "title: 中华人民共和国刑法 第二百六十三条 | text: 以暴力、胁迫或者其他方法抢劫公私财物的，处三年以上十年以下有期徒刑，并处罚金...",
    "title: 中华人民共和国刑法 第二百六十七条 | text: 抢夺公私财物，数额较大的，或者多次抢夺的，处三年以下有期徒刑、拘役或者管制...",
    "title: 陕西省专利条例 第二十四条 | text: 负责专利执法的部门...可以查封或者扣押。"
]

# 3. Encode
query_vec = model.encode(query)
doc_vecs = model.encode(documents)

# 4. Compute Similarity
similarities = model.similarity(query_vec, doc_vecs)
print(similarities)

Training Details (Generated by Trainer)

Dataset

Size: 65,783 training triplets (Anchor, Positive, Hard Negative)
Source: Chinese Laws & Regulations (Civil, Criminal, Administrative, etc.)

Training Hyperparameters

Batch Size: 24 (Effective Batch Size = 144 with Gradient Accumulation)
Learning Rate: 2e-05
Epochs: 3
Precision: bf16 (BFloat16)
Gradient Accumulation: 6 steps
Max Sequence Length: 1024 tokens

Loss Function

MatryoshkaLoss wrapping MultipleNegativesRankingLoss:

{
    "matryoshka_dims": [768, 512, 256, 128],
    "matryoshka_weights": [1, 1, 1, 1]
}

Training Logs

Click to expand detailed logs

Epoch	Step	Training Loss
0.0022	1	3.5148
...	...	...
1.0	457	0.2123
2.0	914	0.0749
3.0	1371	0.0369

Framework Versions

Python: 3.13.1
Sentence Transformers: 5.1.2
Transformers: 4.57.1
PyTorch: 2.9.1+cu130
Accelerate: 1.12.0
Datasets: 4.4.1
Tokenizers: 0.22.1

Citation

If you use this model, please cite the following:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for ByronLeeee/EmbeddingGemma-300M-LawVault

Base model

google/embeddinggemma-300m

Finetuned

(152)

this model