Hebrew Manuscript Joint NER Model v2
Model Description
This model performs joint entity recognition and role classification for Hebrew historical manuscripts. It simultaneously:
- Named Entity Recognition (NER): Identifies person names in Hebrew text
- Role Classification: Classifies each person as Author, Copyist, or Other
Key Features:
- Multi-task learning architecture
- Trained on Hebrew manuscript catalog data (MARC records)
- Uses distant supervision from structured metadata
- Optimized for historical Hebrew text
- Robust k-fold cross-validation
K-Fold Cross-Validation Results
This model was trained using 5-fold cross-validation for robust evaluation.
Aggregate Performance
| Metric | Mean | Std Dev | Min | Max |
|---|---|---|---|---|
| NER F1 | 0.9080 | 卤0.0035 | 0.9011 | 0.9102 |
| Classification Accuracy | 1.0000 | 卤0.0000 | - | - |
Per-Fold Results
- Fold 1: NER F1 = 0.9011, Class Acc = 1.0000
- Fold 2: NER F1 = 0.9089, Class Acc = 1.0000
- Fold 3: NER F1 = 0.9096, Class Acc = 1.0000
- Fold 4: NER F1 = 0.9102, Class Acc = 1.0000
- Fold 5: NER F1 = 0.9100, Class Acc = 1.0000
Best Model: Fold 4 (NER F1: 0.9102)
Model Architecture
- Base Model: dicta-il/dictabert
- Architecture: Joint multi-task learning
- NER head: Token classification (B-PERSON, I-PERSON, O)
- Role classification head: Sequence classification (AUTHOR, COPYIST, OTHER)
- Training: 5-fold cross-validation with early stopping
- Regularization: Dropout (0.3), Weight decay (0.01)
Intended Use
Primary Use Cases
- Extracting person names from Hebrew manuscript descriptions
- Identifying roles of people mentioned in manuscripts
- Building knowledge graphs of Hebrew manuscript creators
- Digital humanities research on Hebrew manuscripts
Example Usage
from transformers import AutoTokenizer, AutoModel
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")
model = AutoModel.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")
# Example text (Hebrew)
text = "讛住驻专 谞讻转讘 注诇 讬讚讬 专讘讬 诪砖讛 讘谉 诪讬诪讜谉"
tokens = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**tokens)
# outputs contains both NER and role classification logits
Training Data
- Source: Hebrew manuscript catalog records (MARC format)
- Size: ~10,000 samples
- Annotation: Distant supervision from structured metadata fields
- Languages: Hebrew (historical and modern)
- Domain: Manuscript descriptions, colophons, catalog records
Training Procedure
Hyperparameters
- Epochs: 10 (with early stopping, patience=3)
- Batch Size: 4
- Learning Rate: 2e-5
- Optimizer: AdamW
- Dropout: 0.3
- Weight Decay: 0.01
- Lambda Weight: 0.5 (for multi-task loss balancing)
Data Split
- K-Fold: 5-fold stratified cross-validation
- Stratification: By number of persons per sample
- Train/Val per fold: 90/10 split
Evaluation
Metrics
- NER: Precision, Recall, F1 (seqeval)
- Classification: Accuracy
- Combined: Geometric mean of NER F1 and Classification Accuracy
Validation Strategy
5-fold cross-validation ensures robust performance estimates and reduces overfitting to a single train/test split.
Limitations
- Optimized for Hebrew manuscript descriptions (may not generalize to other Hebrew text types)
- Person names must follow historical Hebrew naming conventions
- Limited to three role categories (Author, Copyist, Other)
- Trained on catalog data (may not work well on manuscript images/OCR)
Ethical Considerations
- Model trained on historical cultural heritage data
- Should be used to assist, not replace, expert manuscript catalogers
- Potential biases from historical naming conventions and catalog practices
Citation
If you use this model, please cite:
@misc{hebrew-manuscript-joint-ner-v2,
author = {Goldberg, Alexander},
title = {Hebrew Manuscript Joint NER Model v2},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner-v2}}
}
Model Card Authors
Alexander Goldberg
Model Card Contact
For questions or issues, please open an issue on the model repository.