Spam Filtering Emails using NLP

Spam filters are used to detect potentially harmful emails sent by attackers or marketers. Attackers frequently send emails claiming to provide a valuable service or to safeguard you from potential threat, but they are really just clickbait, meant to get you to click on a link that downloads harmful software or redirects you to a dangerous website.

SPAM FILTERING: WHY IT'S IMPORTANT ?

You become a target once spam reaches your email inbox. Humans are the weakest link in most IT security issues when it comes to technology. Attackers will constantly try to deceive them, using a variety of techniques to get them to click on things they shouldn't. These "tricks" are frequently carried out via email, as email platforms can target a vast number of people and are a "budget-friendly" attack. Internal data is exposed if users accidentally click the wrong link in the spam email, bullseye.
Spam filtering has become increasingly important and relevant as email has increasingly being used to exploit people and their data. Organizations must implement a spam filter to limit the danger of people accidentally clicking on something they shouldn't, thereby protecting their internal data from a cyber-attack.

HERE'S HOW IT WORKS
Spam filtering makes use of a filtering solution within your email that uses a set of methods to identify which incoming emails are spam and which are safe to open. According to Spamhaus, the United States is the country with the most active spam problems. Spam is being sent to users, and it is being transmitted in large quantities.
The major methods of screening look at the email's source, whether there have been any complaints or if the source has ever been banned, the email's content, and subscriber interaction. Before reaching a user's email, all of this is tracked and sorted. Spam filtering solutions for enterprises can be hosted in a variety of ways, such as a cloud service, on premise technology, or software installed on organisational PCs that can work with email platforms.

WHY DOES IT MATTER?
Implementing spam filtering is critical for every business. Spam filtering not only keeps junk out of inboxes, but it also improves the quality of life of business emails by ensuring that they function efficiently and are only used for their intended purpose. Spam filtering is essentially an anti-malware tool, as many email-based assaults try to deceive users into opening a dangerous file, providing their passwords, and more.
Email spam costs businesses up to $20.5 billion every year, according to Radicati Research Group Inc., and the sum is expected to climb. Spam filtering keeps spam communications from ever reaching an inbox in the first place, preventing businesses from contributing to the huge statistic of income lost.

NLP approach towards filtering spam emails

Spam emails in your inbox would be annoying since it disturbs the routine. Building a simple spam filter for emails is one of the recommended applications.
N-grams are used for language modeling based on word prediction that is, it predicts next word of a sentence from previous N-1 words. Bigram is the 2-word sequence of N-grams which predicts the next word of a sentence using the previous one word. Instead of considering the whole history of a sentence or a particular word sequence, a model like bigram can be occupied in terms of an approximation of history by occupying a limited history.
Identification of a message as ‘ham’ or ‘spam’ is a classification task since the target variable has got discrete values that is ‘ham’ or ‘spam’. In this article, bigram model is used though there are many advanced techniques that can be utilized for the purpose. In order to use bigram model to assign a given message as ‘spam’ or ‘ham’, there are several steps that needed to be achieved:

1. Inspecting and Separating messages given into ‘ham’ and ‘spam’ categories
Initially the data set should be inspected in order to occupy and approach to achieve the task. The format of given data, the amount of data given, the nature of data are included in this inspection towards identifying the best feasible approach for the task.
The given corpus of messages has flagged each message as either ham or spam. Furthermore, there are 5568 messages in a Data Frame written in English which are not null objects. Therefore tsv file can be read using Data Frame in python to classify those messages accordance with the given flag.

2. Preprocessing text
Preprocessing is the task of performing the preparation steps on the raw text corpus for efficient completion of a text-mining or Natural Language Processing or any other raw text included task. Text preprocessing consists of several steps although some of them may not apply to a particular task due to the nature of the data set available.

In this task, preprocessing of text includes the following steps in accordance with the data set:

Removal of Punctuation Marks:

Converting to Lowercase: The Conversion of all characters in text into a common context such as lowercase supports to prevent identifying two words differently where one is in lowercase and the other one is not. For an instance, ‘First’ and ‘first’ should be identified as the same, therefore lowercasing all the characters makes the task easier. Moreover, the stop words are also in lowercase, so that this would make removing stop words later is also feasible.

Tokenizing: Tokenization is the task of breaking up text into meaningful pieces that is tokens including sentences and words. A token can be considered as an instance of a sequence of characters in a particular text that are grouped together for providing a useful semantic unit for further processing. In this task, word tokenizing is done by matching whitespaces between words as delimiter. This is achieved in Python using regular expressions to split a string into substrings with split () function which is a basic tokenizer.

Lemmatizing Words: Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circum-fixes) from a word in order to obtain its word stem. Although lemmatization is related to stemming, it differs since lemmatization is able to capture canonical forms based on the lemma of a word. Lemmatization occupies a vocabulary and morphological analysis of words which make it faster and accurate than stemming. Lemmatizing has been achieved by WordNetLemmatizer in Python language.

3. Feature extraction
After preprocessing stage, the features should be extracted from the text. The features are the units that supports for the classifying task, and bigrams are the features in this task of message classification. The bigrams or the features are extracted from the preprocessed text. Initially, the unigrams are acquired, and then those unigrams are used to obtain the unigrams in each corpus (‘ham’ and ‘spam’).

4. Stop words removing
There are certain words in a language (here English) which are necessary for a sentence or a sequence of words although they do not contribute to the meaning of a considered phrase. The Natural Language Toolkit (NLTK) library in Python provides common stop words for some languages.
Instead of removing stop words in the preprocessing step, it is done after extracting features from the corpus in order to avoid the absence the bigrams with one stop words ( (‘use’, ‘your’), (‘to’, ‘win’) ) when acquiring the features since they have an impact on the final outcome of the application. The stop words can be ignored in this keyword-oriented Information Retrieval because the effect of it on retrieval accuracy.

5. Get frequency distribution of features
The frequency distribution is used to obtain the frequency of occurrence of each vocabulary items in a certain text.

Frequency Calculation

6. Building a model for prediction
The model for classifying a given message as ‘ham’ or ‘spam’ has been approached by calculating bigram probabilities within each corpus. Initially, the given message should be preprocessed in order to progress with classification including punctuation removing, changing all characters to lowercase, tokenizing and lemmatizing. Then the bigrams are extracted from the preprocessed text for finally calculating the probability of the text to be in each corpus ‘ham’ or ‘spam’.

7. Smoothing
Smoothing algorithms are occupied in order to mitigate the zero probability issue in language modeling applications. Here, Laplace (Add-1) Smoothing techniques has been used which overcomes the issue of zero probability by pretending the non-existent bigrams have been seen once before.

Maximum Likelihood Estimation

The above equation has been modified in Laplace smoothing into the following equation to avoid the dividing by zero error.

MLE with Laplace Smoothing

Then the message is assigned ‘ham’ if the probability calculated with ‘ham’ corpus is greater than ‘spam’ corpus probability or ‘spam’ if not.

What Should You Do About Spam?

It is not to be opened. When spam arrives in your email, the best thing you can do is not open it or respond in any way. Even simply hitting the unsubscribe link at the bottom of an email can be regarded favorable by the sender because it shows you read and interacted with the message. It also double-checks the validity of your email address.

Don't give out any personal details. Never respond to an email that demands your login, account number, or other personal information with any personal information. Always be wary. If you receive an email from your bank and are unsure whether it is authentic, phone the bank instead of responding to it via email.

Track down the sender. It may appear to be from Apple or your credit card provider, but if the sender address is someone from a country where you have no contacts, you know it's spam.

In your inbox, mark it as spam. Use your mail interface's spam or junk mail capability to report an email as spam. Your spam reports are used by the email service to help you.

Spam Filtering Emails using NLP

Search This Blog

Spam Filtering Emails using NLP

Labels

Comments

Post a Comment