Update README.md

Browse files

Files changed (1) hide show

README.md +2 -132

README.md CHANGED Viewed

@@ -97,6 +97,8 @@ Users should be aware that ZentryPII-278M is optimized for ASR-style conversatio
 ---
 ## How to Get Started with the Model
 Use the code snippet below to run the model using 🤗 Transformers:
@@ -113,144 +115,12 @@ output = ner("i met rohit near connaught place at three thirty")
 for ent in output:
     print(f"{ent['word']} → {ent['entity_group']}")
-## Training Details
-### Training Data
-The training data for ZentryPII-278M consists of a synthetic dataset of ~1,000 conversational, ASR-style utterances constructed using name, location, and time expression templates. These examples were generated to reflect realistic speech patterns, disfluencies (e.g., "um", "haan", "like"), and English-Hindi code-switching.
-Each sentence was tokenized and annotated with BIO-style labels:
-- `B-NAME`: Names (e.g., Ramesh, Neha)
-- `B-LOC`: Locations (e.g., Mumbai, Connaught Place)
-- `B-TIME`: Time references (e.g., three thirty, Sunday morning)
-- `O`: Non-PII tokens
-The dataset is not publicly released as a standalone resource, but was generated specifically to fine-tune ZentryPII on redaction-style PII tagging in noisy, multilingual text.
-### Training Procedure
-- **Model used:** `xlm-roberta-base`
-- **Objective:** Token classification using cross-entropy loss
-- **Training Framework:** Hugging Face Transformers (Trainer API)
-- **Epochs:** 5
-- **Train/Test Split:** 90/10 stratified at sentence level
-- **Batch size:** 8
-- **Optimizer:** AdamW
-- **Hardware:** Google Colab T4 (free tier)
-- **Eval Metric:** Token-level accuracy and `seqeval` loss
-- **Final Eval Loss:** ~3.26e-05
-#### Training Hyperparameters
-- **Training regime:** fp32
-- **Learning rate scheduler:** linear with warmup
-- **Weight decay:** 0.01
-- **Warmup steps:** 0
-- **Gradient clipping:** None
-- **Evaluation strategy:** after each epoch
-- **Save strategy:** every 500 steps
-- **Logging:** every 10 steps
-- **Seed:** 42
----
-#### Speeds, Sizes, Times
-- **Total training time:** ~20 minutes on Google Colab (T4 GPU)
-- **Final checkpoint size:** ~1.06 GB
-- **Throughput:** ~39 samples/sec on evaluation
-- **Evaluation runtime:** ~2.55 seconds
----
-## Evaluation
-### Testing Data, Factors & Metrics
-#### Testing Data
-The held-out test set (~10% of training data) was sampled from the synthetic BIO-tagged dataset used for training. It includes ASR-style sentences with varied disfluencies and code-switching examples in Hindi-English.
-Evaluation was performed using Hugging Face’s `Trainer.evaluate()` API on token-level classification.
-**Entity types evaluated:**
-- `B-NAME`
-- `B-LOC`
-- `B-TIME`
-**Metrics used:**
-- `eval_loss` (cross-entropy)
-- [planned: `seqeval` F1-score in future update]
-**Final evaluation result:**
-- `eval_loss`: `3.26e-05`
-#### Factors
-The evaluation did not explicitly disaggregate results by subpopulations or domains. However, the synthetic test set includes:
-- Code-switched utterances (Hindi-English)
-- Disfluent speech (fillers, hesitations)
-- Mixed-case and punctuation-stripped phrases to simulate Whisper-style ASR output
-Future iterations may include evaluations across real-world datasets and dialectal variation.
----
-#### Metrics
-- **Eval Loss** (cross-entropy): Measures token-level classification confidence.
-- Intended metrics such as **precision**, **recall**, and **F1-score** (via `seqeval`) were not computed in this release but will be included in a future version for fine-grained NER-style performance analysis.
----
-### Results
-- **Eval Loss (final checkpoint):** `3.26e-05`
-- **Evaluation runtime:** `2.55 seconds`
-- **Samples/sec during evaluation:** ~39
-The extremely low evaluation loss indicates strong learning stability and high token-level accuracy within the test domain.
----
-#### Summary
-ZentryPII-278M demonstrates excellent performance on synthetic ASR-style transcripts, achieving near-zero loss over a multilingual, code-switched BIO-tagged benchmark. Its architecture (XLM-RoBERTa) allows for robust generalization across English and Hindi tokens with informal structure. While tested only on synthetic data, it sets the foundation for more rigorous real-world deployment within LexGuard’s privacy-preserving stack.
-### Model Architecture and Objective
-ZentryPII-278M is based on the `xlm-roberta-base` transformer architecture — a multilingual masked language model pretrained on 100+ languages. It has been adapted for the **token classification** objective using a linear classifier head on top of the contextual embeddings. The model is fine-tuned using a BIO tagging scheme to detect and label PII entities such as names, locations, and temporal references.
----
-### Compute Infrastructure
-#### Hardware
-- Training was performed on a **Google Colab T4 GPU instance**
-- 16 GB system RAM
-- GPU Memory: ~15 GB
-#### Software
-- Python 3.10
-- Hugging Face Transformers 4.38+
-- Datasets 2.x
-- PyTorch 2.x
-- Accelerate and Tokenizers libraries
-- Environment: Google Colab (Free Tier)
----
-## Model Card Contact
-For questions, usage inquiries, or integration support:
-**📧 Email:** [email protected]
-**👤 Maintainer:** Sanskar Pandey
-**🏢 Organization:** LexGuard

 ---
+# ZentryPII-278M
 ## How to Get Started with the Model
 Use the code snippet below to run the model using 🤗 Transformers:
 for ent in output:
     print(f"{ent['word']} → {ent['entity_group']}")