Update README.md

f29900b verified 3 months ago

7.6 kB

	---
	library_name: transformers
	tags:
	- text-generation-inference
	- spam-detection
	- nlp
	- binary-classification
	license: apache-2.0
	datasets:
	- bvk/SMS-spam
	- SetFit/enron_spam
	language:
	- en
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model:
	- distilbert/distilbert-base-uncased
	pipeline_tag: text-classification
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Ainebyona Abubaker
	- Funded by : This model was developed independenly by Ainebyona Abubaker with no external funding.
	- Shared by : Ainebyona Abubaker
	- Model type: DistilBERT
	- Language(s) (NLP): English
	- License: Apache 2.0 License
	- Finetuned from model distilbert-base-uncased:

	### Model Sources.

	<!-- Provide the basic links for the model. -->

	- Repository: https://huggingface.co/kenbaker-gif/Email_Spam_Classifier

	## Uses

	- This model can be used for:

	- Detecting spam messages in SMS or short text messages

	- Educational purposes in NLP and machine learning

	- Research and development of spam detection systems

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	# Load the model and tokenizer
	model_name = "kenbaker-gif/Email-Spam-Classifier"
	tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Create a text-classification pipeline
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	# Example usage
	result = classifier("Congratulations! You've won a $500 gift card.")
	print(result)
	# Output: [{'label': 'SPAM', 'score': 0.99}]


	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->



	### Downstream Use.
	- Email spam detection – fine-tune on email datasets for spam classification

	- Chat moderation – detecting unwanted or spammy messages in chat apps

	- SMS analytics – analyzing messaging patterns for marketing or user studies

	- Text classification pipelines – can be incorporated into larger NLP workflows

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->



	### Out-of-Scope Use
	- Not recommended for high-stakes decisions (legal, financial, or medical) without further validation

	- Performance on languages other than English is not guaranteed

	- Not tested on long-form words like messaging platforms (social media)

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->



	## Bias, Risks, and Limitations
	Biases:

	- The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects.

	- It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives.

	- Minority or unusual types of spam may not be well recognized.

	Risks:

	- Misclassifying messages could lead to important messages being ignored or spam being delivered.

	- Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences.

	Limitations:

	- Only trained for binary classification: HAM (not spam) vs SPAM.

	- Performance may degrade on longer texts like social media messages.

	- The model may need fine-tuning for datasets outside SMS messages to maintain accuracy.

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	[More Information Needed]

	### Recommendations
	- This model is recommended for detecting spam in short English text messages (SMS).

	- Suitable for educational, research, and prototype applications in NLP and text classification.

	- Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation.

	- Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats.

	- Always review model predictions before acting on them, especially in critical applications.

	💡 Tip:

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	# Load model and tokenizer
	model_name = "kenbaker-gif/Email-Spam-Classifier"
	tokenizer = AutoTokenizer.from_pretrained(Email_Spam_Classifier)
	model = AutoModelForSequenceClassification.from_pretrained(Email_Spam_Classifier)

	# Create pipeline
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	# Example usage
	result = classifier("Congratulations! You've won a $500 Amazon gift card.")
	print(result)
	# Output: [{'label': 'SPAM', 'score': 0.99}]


	## Training Details
	- Base Model: distilbert-base-uncased (DistilBERT)

	- Task: Binary SMS spam classification (HAM / SPAM)

	- Dataset: SMS Spam Collection (80% train, 20% eval)

	- Preprocessing: Tokenized with padding & truncation

	- Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer

	- Metrics: Accuracy, Weighted F1-score

	- Trained for short English SMS messages; fine-tuning may be needed for other text types or languages.

	### Training Data
	- Primary Dataset: SMS Spam Collection Dataset

	- Content: English SMS messages labeled as HAM (not spam) or SPAM

	- Size: ~5,500 messages

	- Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM)

	- Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization

	- The model is optimized for short English SMS messages; performance on other text types or languages may vary.

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	### Training Procedure
	1. Data Preparation:
	- Loaded the SMS Spam Collection dataset
	- Tokenized messages using AutoTokenizer with padding and truncation
	- Split dataset: 80% train, 20% evaluation

	2. Model Setup:
	- Base model: distilbert-base-uncased
	-Task: Binary classification (HAM vs SPAM)

	3. Training:
	- Optimizer: AdamW
	- Learning rate: 2e-5
	- Batch size: 16 (train & eval)

	4. Number of epochs: 3

	5. Evaluation and checkpointing performed at each epoch.

	6. Metrics Monitored:
	- Accuracy
	- Weighted F1-score

	Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types.
	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	## Model Card Authors

	Ainebyona Abuabker

	## Model Card Contact
	- Name: Ainebyona Abubaker
	- Email: [email protected]
	- GitHub: https://github.com/kenbaker-gif