--- library_name: transformers tags: - text-generation-inference - spam-detection - nlp - binary-classification license: apache-2.0 datasets: - bvk/SMS-spam - SetFit/enron_spam language: - en metrics: - accuracy - f1 - precision - recall base_model: - distilbert/distilbert-base-uncased pipeline_tag: text-classification --- # Model Card for Model ID ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Ainebyona Abubaker - **Funded by :** This model was developed independenly by Ainebyona Abubaker with no external funding. - **Shared by :** Ainebyona Abubaker - **Model type:** DistilBERT - **Language(s) (NLP):** English - **License:** Apache 2.0 License - **Finetuned from model distilbert-base-uncased:** ### Model Sources. - **Repository:** https://huggingface.co/kenbaker-gif/Email_Spam_Classifier ## Uses - This model can be used for: - Detecting spam messages in SMS or short text messages - Educational purposes in NLP and machine learning - Research and development of spam detection systems ### Direct Use from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline # Load the model and tokenizer model_name = "kenbaker-gif/Email-Spam-Classifier" tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Create a text-classification pipeline classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) # Example usage result = classifier("Congratulations! You've won a $500 gift card.") print(result) # Output: [{'label': 'SPAM', 'score': 0.99}] ### Downstream Use. - Email spam detection – fine-tune on email datasets for spam classification - Chat moderation – detecting unwanted or spammy messages in chat apps - SMS analytics – analyzing messaging patterns for marketing or user studies - Text classification pipelines – can be incorporated into larger NLP workflows ### Out-of-Scope Use - Not recommended for high-stakes decisions (legal, financial, or medical) without further validation - Performance on languages other than English is not guaranteed - Not tested on long-form words like messaging platforms (social media) ## Bias, Risks, and Limitations Biases: - The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects. - It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives. - Minority or unusual types of spam may not be well recognized. Risks: - Misclassifying messages could lead to important messages being ignored or spam being delivered. - Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences. Limitations: - Only trained for binary classification: HAM (not spam) vs SPAM. - Performance may degrade on longer texts like social media messages. - The model may need fine-tuning for datasets outside SMS messages to maintain accuracy. [More Information Needed] ### Recommendations - This model is recommended for detecting spam in short English text messages (SMS). - Suitable for educational, research, and prototype applications in NLP and text classification. - Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation. - Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats. - Always review model predictions before acting on them, especially in critical applications. 💡 Tip: Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline # Load model and tokenizer model_name = "kenbaker-gif/Email-Spam-Classifier" tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier) model = AutoModelForSequenceClassification.from_pretrained(Email_Spam_Classifier) # Create pipeline classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) # Example usage result = classifier("Congratulations! You've won a $500 Amazon gift card.") print(result) # Output: [{'label': 'SPAM', 'score': 0.99}] ## Training Details - Base Model: distilbert-base-uncased (DistilBERT) - Task: Binary SMS spam classification (HAM / SPAM) - Dataset: SMS Spam Collection (80% train, 20% eval) - Preprocessing: Tokenized with padding & truncation - Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer - Metrics: Accuracy, Weighted F1-score - Trained for short English SMS messages; fine-tuning may be needed for other text types or languages. ### Training Data - Primary Dataset: SMS Spam Collection Dataset - Content: English SMS messages labeled as HAM (not spam) or SPAM - Size: ~5,500 messages - Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM) - Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization - The model is optimized for short English SMS messages; performance on other text types or languages may vary. ### Training Procedure 1. Data Preparation: - Loaded the SMS Spam Collection dataset - Tokenized messages using AutoTokenizer with padding and truncation - Split dataset: 80% train, 20% evaluation 2. Model Setup: - Base model: distilbert-base-uncased -Task: Binary classification (HAM vs SPAM) 3. Training: - Optimizer: AdamW - Learning rate: 2e-5 - Batch size: 16 (train & eval) 4. Number of epochs: 3 5. Evaluation and checkpointing performed at each epoch. 6. Metrics Monitored: - Accuracy - Weighted F1-score Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types. ## Model Card Authors Ainebyona Abuabker ## Model Card Contact - Name: Ainebyona Abubaker - Email: ainebyonabubaker@proton.me - GitHub: https://github.com/kenbaker-gif