|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- spam-detection |
|
|
- nlp |
|
|
- binary-classification |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- bvk/SMS-spam |
|
|
- SetFit/enron_spam |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
base_model: |
|
|
- distilbert/distilbert-base-uncased |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
|
|
- **Developed by:** Ainebyona Abubaker |
|
|
- **Funded by :** This model was developed independenly by Ainebyona Abubaker with no external funding. |
|
|
- **Shared by :** Ainebyona Abubaker |
|
|
- **Model type:** DistilBERT |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** Apache 2.0 License |
|
|
- **Finetuned from model distilbert-base-uncased:** |
|
|
|
|
|
### Model Sources. |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://huggingface.co/kenbaker-gif/Email_Spam_Classifier |
|
|
|
|
|
## Uses |
|
|
|
|
|
- This model can be used for: |
|
|
|
|
|
- Detecting spam messages in SMS or short text messages |
|
|
|
|
|
- Educational purposes in NLP and machine learning |
|
|
|
|
|
- Research and development of spam detection systems |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "kenbaker-gif/Email-Spam-Classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Create a text-classification pipeline |
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Example usage |
|
|
result = classifier("Congratulations! You've won a $500 gift card.") |
|
|
print(result) |
|
|
# Output: [{'label': 'SPAM', 'score': 0.99}] |
|
|
|
|
|
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
|
|
|
|
|
|
|
|
### Downstream Use. |
|
|
- Email spam detection – fine-tune on email datasets for spam classification |
|
|
|
|
|
- Chat moderation – detecting unwanted or spammy messages in chat apps |
|
|
|
|
|
- SMS analytics – analyzing messaging patterns for marketing or user studies |
|
|
|
|
|
- Text classification pipelines – can be incorporated into larger NLP workflows |
|
|
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
|
|
|
|
|
|
|
|
### Out-of-Scope Use |
|
|
- Not recommended for high-stakes decisions (legal, financial, or medical) without further validation |
|
|
|
|
|
- Performance on languages other than English is not guaranteed |
|
|
|
|
|
- Not tested on long-form words like messaging platforms (social media) |
|
|
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
Biases: |
|
|
|
|
|
- The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects. |
|
|
|
|
|
- It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives. |
|
|
|
|
|
- Minority or unusual types of spam may not be well recognized. |
|
|
|
|
|
Risks: |
|
|
|
|
|
- Misclassifying messages could lead to important messages being ignored or spam being delivered. |
|
|
|
|
|
- Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences. |
|
|
|
|
|
Limitations: |
|
|
|
|
|
- Only trained for binary classification: HAM (not spam) vs SPAM. |
|
|
|
|
|
- Performance may degrade on longer texts like social media messages. |
|
|
|
|
|
- The model may need fine-tuning for datasets outside SMS messages to maintain accuracy. |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
### Recommendations |
|
|
- This model is recommended for detecting spam in short English text messages (SMS). |
|
|
|
|
|
- Suitable for educational, research, and prototype applications in NLP and text classification. |
|
|
|
|
|
- Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation. |
|
|
|
|
|
- Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats. |
|
|
|
|
|
- Always review model predictions before acting on them, especially in critical applications. |
|
|
|
|
|
💡 Tip: |
|
|
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "kenbaker-gif/Email-Spam-Classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(Email_Spam_Classifier) |
|
|
|
|
|
# Create pipeline |
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Example usage |
|
|
result = classifier("Congratulations! You've won a $500 Amazon gift card.") |
|
|
print(result) |
|
|
# Output: [{'label': 'SPAM', 'score': 0.99}] |
|
|
|
|
|
|
|
|
## Training Details |
|
|
- Base Model: distilbert-base-uncased (DistilBERT) |
|
|
|
|
|
- Task: Binary SMS spam classification (HAM / SPAM) |
|
|
|
|
|
- Dataset: SMS Spam Collection (80% train, 20% eval) |
|
|
|
|
|
- Preprocessing: Tokenized with padding & truncation |
|
|
|
|
|
- Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer |
|
|
|
|
|
- Metrics: Accuracy, Weighted F1-score |
|
|
|
|
|
- Trained for short English SMS messages; fine-tuning may be needed for other text types or languages. |
|
|
|
|
|
### Training Data |
|
|
- Primary Dataset: SMS Spam Collection Dataset |
|
|
|
|
|
- Content: English SMS messages labeled as HAM (not spam) or SPAM |
|
|
|
|
|
- Size: ~5,500 messages |
|
|
|
|
|
- Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM) |
|
|
|
|
|
- Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization |
|
|
|
|
|
- The model is optimized for short English SMS messages; performance on other text types or languages may vary. |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
### Training Procedure |
|
|
1. Data Preparation: |
|
|
- Loaded the SMS Spam Collection dataset |
|
|
- Tokenized messages using AutoTokenizer with padding and truncation |
|
|
- Split dataset: 80% train, 20% evaluation |
|
|
|
|
|
2. Model Setup: |
|
|
- Base model: distilbert-base-uncased |
|
|
-Task: Binary classification (HAM vs SPAM) |
|
|
|
|
|
3. Training: |
|
|
- Optimizer: AdamW |
|
|
- Learning rate: 2e-5 |
|
|
- Batch size: 16 (train & eval) |
|
|
|
|
|
4. Number of epochs: 3 |
|
|
|
|
|
5. Evaluation and checkpointing performed at each epoch. |
|
|
|
|
|
6. Metrics Monitored: |
|
|
- Accuracy |
|
|
- Weighted F1-score |
|
|
|
|
|
Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types. |
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Ainebyona Abuabker |
|
|
|
|
|
## Model Card Contact |
|
|
- Name: Ainebyona Abubaker |
|
|
- Email: [email protected] |
|
|
- GitHub: https://github.com/kenbaker-gif |