Hebrew Manuscript Joint NER Model v2

Model Description

This model performs joint entity recognition and role classification for Hebrew historical manuscripts. It simultaneously:

  1. Named Entity Recognition (NER): Identifies person names in Hebrew text
  2. Role Classification: Classifies each person as Author, Copyist, or Other

Key Features:

  • Multi-task learning architecture
  • Trained on Hebrew manuscript catalog data (MARC records)
  • Uses distant supervision from structured metadata
  • Optimized for historical Hebrew text
  • Robust k-fold cross-validation

K-Fold Cross-Validation Results

This model was trained using 5-fold cross-validation for robust evaluation.

Aggregate Performance

Metric Mean Std Dev Min Max
NER F1 0.9080 卤0.0035 0.9011 0.9102
Classification Accuracy 1.0000 卤0.0000 - -

Per-Fold Results

  • Fold 1: NER F1 = 0.9011, Class Acc = 1.0000
  • Fold 2: NER F1 = 0.9089, Class Acc = 1.0000
  • Fold 3: NER F1 = 0.9096, Class Acc = 1.0000
  • Fold 4: NER F1 = 0.9102, Class Acc = 1.0000
  • Fold 5: NER F1 = 0.9100, Class Acc = 1.0000

Best Model: Fold 4 (NER F1: 0.9102)

Model Architecture

  • Base Model: dicta-il/dictabert
  • Architecture: Joint multi-task learning
    • NER head: Token classification (B-PERSON, I-PERSON, O)
    • Role classification head: Sequence classification (AUTHOR, COPYIST, OTHER)
  • Training: 5-fold cross-validation with early stopping
  • Regularization: Dropout (0.3), Weight decay (0.01)

Intended Use

Primary Use Cases

  • Extracting person names from Hebrew manuscript descriptions
  • Identifying roles of people mentioned in manuscripts
  • Building knowledge graphs of Hebrew manuscript creators
  • Digital humanities research on Hebrew manuscripts

Example Usage

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")
model = AutoModel.from_pretrained("alexgoldberg/hebrew-manuscript-joint-ner-v2")

# Example text (Hebrew)
text = "讛住驻专 谞讻转讘 注诇 讬讚讬 专讘讬 诪砖讛 讘谉 诪讬诪讜谉"
tokens = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**tokens)
    # outputs contains both NER and role classification logits

Training Data

  • Source: Hebrew manuscript catalog records (MARC format)
  • Size: ~10,000 samples
  • Annotation: Distant supervision from structured metadata fields
  • Languages: Hebrew (historical and modern)
  • Domain: Manuscript descriptions, colophons, catalog records

Training Procedure

Hyperparameters

  • Epochs: 10 (with early stopping, patience=3)
  • Batch Size: 4
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Dropout: 0.3
  • Weight Decay: 0.01
  • Lambda Weight: 0.5 (for multi-task loss balancing)

Data Split

  • K-Fold: 5-fold stratified cross-validation
  • Stratification: By number of persons per sample
  • Train/Val per fold: 90/10 split

Evaluation

Metrics

  • NER: Precision, Recall, F1 (seqeval)
  • Classification: Accuracy
  • Combined: Geometric mean of NER F1 and Classification Accuracy

Validation Strategy

5-fold cross-validation ensures robust performance estimates and reduces overfitting to a single train/test split.

Limitations

  • Optimized for Hebrew manuscript descriptions (may not generalize to other Hebrew text types)
  • Person names must follow historical Hebrew naming conventions
  • Limited to three role categories (Author, Copyist, Other)
  • Trained on catalog data (may not work well on manuscript images/OCR)

Ethical Considerations

  • Model trained on historical cultural heritage data
  • Should be used to assist, not replace, expert manuscript catalogers
  • Potential biases from historical naming conventions and catalog practices

Citation

If you use this model, please cite:

@misc{hebrew-manuscript-joint-ner-v2,
  author = {Goldberg, Alexander},
  title = {Hebrew Manuscript Joint NER Model v2},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/alexgoldberg/hebrew-manuscript-joint-ner-v2}}
}

Model Card Authors

Alexander Goldberg

Model Card Contact

For questions or issues, please open an issue on the model repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support