Hebrew Manuscript Joint NER Model v2

Model Description

This model performs joint entity recognition and role classification for Hebrew historical manuscripts. It simultaneously:

Named Entity Recognition (NER): Identifies person names in Hebrew text
Role Classification: Classifies each person as Author, Copyist, or Other

Key Features:

Multi-task learning architecture
Trained on Hebrew manuscript catalog data (MARC records)
Uses distant supervision from structured metadata
Optimized for historical Hebrew text
Robust k-fold cross-validation

K-Fold Cross-Validation Results

This model was trained using 5-fold cross-validation for robust evaluation.

Aggregate Performance

Metric	Mean	Std Dev	Min	Max
NER F1	0.9080	±0.0035	0.9011	0.9102
Classification Accuracy	1.0000	±0.0000	-	-

Per-Fold Results

Fold 1: NER F1 = 0.9011, Class Acc = 1.0000
Fold 2: NER F1 = 0.9089, Class Acc = 1.0000
Fold 3: NER F1 = 0.9096, Class Acc = 1.0000
Fold 4: NER F1 = 0.9102, Class Acc = 1.0000
Fold 5: NER F1 = 0.9100, Class Acc = 1.0000

Best Model: Fold 4 (NER F1: 0.9102)

Model Architecture

Base Model: dicta-il/dictabert
Architecture: Joint multi-task learning
- NER head: Token classification (B-PERSON, I-PERSON, O)
- Role classification head: Sequence classification (AUTHOR, COPYIST, OTHER)
Training: 5-fold cross-validation with early stopping
Regularization: Dropout (0.3), Weight decay (0.01)

Intended Use

Primary Use Cases

Extracting person names from Hebrew manuscript descriptions
Identifying roles of people mentioned in manuscripts
Building knowledge graphs of Hebrew manuscript creators
Digital humanities research on Hebrew manuscripts

Example Usage

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")
model = AutoModel.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")

# Example text (Hebrew)
text = "הספר נכתב על ידי רבי משה בן מימון"
tokens = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**tokens)
    # outputs contains both NER and role classification logits

Training Data

Source: Hebrew manuscript catalog records (MARC format)
Size: ~10,000 samples
Annotation: Distant supervision from structured metadata fields
Languages: Hebrew (historical and modern)
Domain: Manuscript descriptions, colophons, catalog records

Training Procedure

Hyperparameters

Epochs: 10 (with early stopping, patience=3)
Batch Size: 4
Learning Rate: 2e-5
Optimizer: AdamW
Dropout: 0.3
Weight Decay: 0.01
Lambda Weight: 0.5 (for multi-task loss balancing)

Data Split

K-Fold: 5-fold stratified cross-validation
Stratification: By number of persons per sample
Train/Val per fold: 90/10 split

Evaluation

Metrics

NER: Precision, Recall, F1 (seqeval)
Classification: Accuracy
Combined: Geometric mean of NER F1 and Classification Accuracy

Validation Strategy

5-fold cross-validation ensures robust performance estimates and reduces overfitting to a single train/test split.

Limitations

Optimized for Hebrew manuscript descriptions (may not generalize to other Hebrew text types)
Person names must follow historical Hebrew naming conventions
Limited to three role categories (Author, Copyist, Other)
Trained on catalog data (may not work well on manuscript images/OCR)

Ethical Considerations

Model trained on historical cultural heritage data
Should be used to assist, not replace, expert manuscript catalogers
Potential biases from historical naming conventions and catalog practices

Citation

If you use this model, please cite:

@misc{hebrew-manuscript-joint-ner-v2,
  author = {Goldberg, Alexander},
  title = {Hebrew Manuscript Joint NER Model v2},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner-v2}}
}

Model Card Authors

Alexander Goldberg

Model Card Contact

For questions or issues, please open an issue on the model repository.

Downloads last month: -; Downloads are not tracked for this model. How to track