kenbaker-gif's picture
Update README.md
f29900b verified
---
library_name: transformers
tags:
- text-generation-inference
- spam-detection
- nlp
- binary-classification
license: apache-2.0
datasets:
- bvk/SMS-spam
- SetFit/enron_spam
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- **Developed by:** Ainebyona Abubaker
- **Funded by :** This model was developed independenly by Ainebyona Abubaker with no external funding.
- **Shared by :** Ainebyona Abubaker
- **Model type:** DistilBERT
- **Language(s) (NLP):** English
- **License:** Apache 2.0 License
- **Finetuned from model distilbert-base-uncased:**
### Model Sources.
<!-- Provide the basic links for the model. -->
- **Repository:** https://huggingface.co/kenbaker-gif/Email_Spam_Classifier
## Uses
- This model can be used for:
- Detecting spam messages in SMS or short text messages
- Educational purposes in NLP and machine learning
- Research and development of spam detection systems
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load the model and tokenizer
model_name = "kenbaker-gif/Email-Spam-Classifier"
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create a text-classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Example usage
result = classifier("Congratulations! You've won a $500 gift card.")
print(result)
# Output: [{'label': 'SPAM', 'score': 0.99}]
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
### Downstream Use.
- Email spam detection – fine-tune on email datasets for spam classification
- Chat moderation – detecting unwanted or spammy messages in chat apps
- SMS analytics – analyzing messaging patterns for marketing or user studies
- Text classification pipelines – can be incorporated into larger NLP workflows
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
### Out-of-Scope Use
- Not recommended for high-stakes decisions (legal, financial, or medical) without further validation
- Performance on languages other than English is not guaranteed
- Not tested on long-form words like messaging platforms (social media)
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
## Bias, Risks, and Limitations
Biases:
- The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects.
- It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives.
- Minority or unusual types of spam may not be well recognized.
Risks:
- Misclassifying messages could lead to important messages being ignored or spam being delivered.
- Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences.
Limitations:
- Only trained for binary classification: HAM (not spam) vs SPAM.
- Performance may degrade on longer texts like social media messages.
- The model may need fine-tuning for datasets outside SMS messages to maintain accuracy.
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[More Information Needed]
### Recommendations
- This model is recommended for detecting spam in short English text messages (SMS).
- Suitable for educational, research, and prototype applications in NLP and text classification.
- Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation.
- Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats.
- Always review model predictions before acting on them, especially in critical applications.
💡 Tip:
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load model and tokenizer
model_name = "kenbaker-gif/Email-Spam-Classifier"
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
model = AutoModelForSequenceClassification.from_pretrained(Email_Spam_Classifier)
# Create pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
# Example usage
result = classifier("Congratulations! You've won a $500 Amazon gift card.")
print(result)
# Output: [{'label': 'SPAM', 'score': 0.99}]
## Training Details
- Base Model: distilbert-base-uncased (DistilBERT)
- Task: Binary SMS spam classification (HAM / SPAM)
- Dataset: SMS Spam Collection (80% train, 20% eval)
- Preprocessing: Tokenized with padding & truncation
- Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer
- Metrics: Accuracy, Weighted F1-score
- Trained for short English SMS messages; fine-tuning may be needed for other text types or languages.
### Training Data
- Primary Dataset: SMS Spam Collection Dataset
- Content: English SMS messages labeled as HAM (not spam) or SPAM
- Size: ~5,500 messages
- Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM)
- Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization
- The model is optimized for short English SMS messages; performance on other text types or languages may vary.
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
### Training Procedure
1. Data Preparation:
- Loaded the SMS Spam Collection dataset
- Tokenized messages using AutoTokenizer with padding and truncation
- Split dataset: 80% train, 20% evaluation
2. Model Setup:
- Base model: distilbert-base-uncased
-Task: Binary classification (HAM vs SPAM)
3. Training:
- Optimizer: AdamW
- Learning rate: 2e-5
- Batch size: 16 (train & eval)
4. Number of epochs: 3
5. Evaluation and checkpointing performed at each epoch.
6. Metrics Monitored:
- Accuracy
- Weighted F1-score
Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types.
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
## Model Card Authors
Ainebyona Abuabker
## Model Card Contact
- Name: Ainebyona Abubaker
- Email: [email protected]
- GitHub: https://github.com/kenbaker-gif