---
library_name: transformers
tags:
- text-generation-inference
- spam-detection
- nlp
- binary-classification
license: apache-2.0
datasets:
- bvk/SMS-spam
- SetFit/enron_spam
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Ainebyona Abubaker
- **Funded by :** This model was developed independenly by Ainebyona Abubaker with no external funding.
- **Shared by :** Ainebyona Abubaker
- **Model type:** DistilBERT
- **Language(s) (NLP):** English
- **License:** Apache 2.0 License
- **Finetuned from model distilbert-base-uncased:** 

### Model Sources.

<!-- Provide the basic links for the model. -->

- **Repository:** https://huggingface.co/kenbaker-gif/Email_Spam_Classifier

## Uses

- This model can be used for:

- Detecting spam messages in SMS or short text messages

- Educational purposes in NLP and machine learning

- Research and development of spam detection systems

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the model and tokenizer
model_name = "kenbaker-gif/Email-Spam-Classifier"
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a text-classification pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Example usage
result = classifier("Congratulations! You've won a $500 gift card.")
print(result)
# Output: [{'label': 'SPAM', 'score': 0.99}]


<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->


### Downstream Use.
- Email spam detection – fine-tune on email datasets for spam classification

- Chat moderation – detecting unwanted or spammy messages in chat apps

- SMS analytics – analyzing messaging patterns for marketing or user studies

- Text classification pipelines – can be incorporated into larger NLP workflows

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->


### Out-of-Scope Use
- Not recommended for high-stakes decisions (legal, financial, or medical) without further validation

- Performance on languages other than English is not guaranteed

- Not tested on long-form words like  messaging platforms (social media)

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->


## Bias, Risks, and Limitations
Biases:

- The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects.

- It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives.

- Minority or unusual types of spam may not be well recognized.

Risks:

- Misclassifying messages could lead to important messages being ignored or spam being delivered.

- Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences.

Limitations:

- Only trained for binary classification: HAM (not spam) vs SPAM.

- Performance may degrade on longer texts like social media messages.

- The model may need fine-tuning for datasets outside SMS messages to maintain accuracy.

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

### Recommendations
- This model is recommended for detecting spam in short English text messages (SMS).

- Suitable for educational, research, and prototype applications in NLP and text classification.

- Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation.

- Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats.

- Always review model predictions before acting on them, especially in critical applications.

💡 Tip:

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model and tokenizer
model_name = "kenbaker-gif/Email-Spam-Classifier"
tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
model = AutoModelForSequenceClassification.from_pretrained(Email_Spam_Classifier)

# Create pipeline
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Example usage
result = classifier("Congratulations! You've won a $500 Amazon gift card.")
print(result)
# Output: [{'label': 'SPAM', 'score': 0.99}]


## Training Details
- Base Model: distilbert-base-uncased (DistilBERT)

- Task: Binary SMS spam classification (HAM / SPAM)

- Dataset: SMS Spam Collection (80% train, 20% eval)

- Preprocessing: Tokenized with padding & truncation

- Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer

- Metrics: Accuracy, Weighted F1-score

- Trained for short English SMS messages; fine-tuning may be needed for other text types or languages.

### Training Data
- Primary Dataset: SMS Spam Collection Dataset

- Content: English SMS messages labeled as HAM (not spam) or SPAM

- Size: ~5,500 messages

- Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM)

- Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization

- The model is optimized for short English SMS messages; performance on other text types or languages may vary.

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

### Training Procedure
1. Data Preparation:
     - Loaded the SMS Spam Collection dataset
     - Tokenized messages using AutoTokenizer with padding and truncation
     - Split dataset: 80% train, 20% evaluation

2. Model Setup:
     - Base model: distilbert-base-uncased
     -Task: Binary classification (HAM vs SPAM)

3. Training:
     - Optimizer: AdamW
     - Learning rate: 2e-5
     - Batch size: 16 (train & eval)

4. Number of epochs: 3

5. Evaluation and checkpointing performed at each epoch.

6. Metrics Monitored:
    - Accuracy
    - Weighted F1-score

Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types.
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

## Model Card Authors 

Ainebyona Abuabker

## Model Card Contact
- Name: Ainebyona Abubaker
- Email: ainebyonabubaker@proton.me
- GitHub: https://github.com/kenbaker-gif