Spam detection is a common classification problem. We can build a model to identify spam emails. This guide outlines the steps using Python. We will use Scikit-learn for machine learning.
First, load your email dataset. Assume data is in CSV format. Use Pandas to read the CSV file. The dataset should have email text and labels. Labels are “spam” or “not spam” (ham).
import pandas as pd data = pd.read_csv('spam_dataset.csv')
Next, preprocess the email text. Convert text to lowercase. Remove punctuation and stop words. Tokenize the text into words. We can use NLTK for text processing. Alternatively, Scikit-learn offers tools. CountVectorizer is useful for feature extraction.
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(data['text']) y = data['label']
Split data into training and testing sets. Use train_test_split from Scikit-learn. This ensures proper model evaluation. A common split ratio is 80% training, 20% testing.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Choose a classification model. Naive Bayes is effective for text classification. Scikit-learn provides MultinomialNB. Train the model using the training data.
from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(X_train, y_train)
Evaluate the model on the test set. Calculate accuracy and other metrics. accuracy_score from Scikit-learn is used. This shows model performance on unseen data.
from sklearn.metrics import accuracy_score y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}")
This outlines building a basic spam detector. Further improvements are possible. Experiment with different models and features. Consider TF-IDF vectorization. Evaluate precision and recall metrics. Spam detection is a practical ML application.