So As discussed in our last post we will be focusing on Lexical Processing. It generally mean extracting the raw text, identifying and analyzing the structure of words. Lexical analysis is extracting the whole document to sentence, sentence to words or we can simply term as breaking whole chunk of words in tokens or smaller unit.
It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.
So why it is required to do lexical processing:
- Let us say an email contain word such as lottery , bumber bonanza ! etc type of word you can easily identify that email is spam, so that is the reason you have to break text in tokens.
- Hence in general the group of words give more idea about sentence. So considering all plural words to singular form we use stemming which is part of lexical processing. For example dogs and dog have same meaning
- Most of the words or we can say that 80 % of words are not important such as the, is ,that etc, so sometime we remove them. This are generally termed as stop words and are generally removed for Spam/Ham Mail classification, but this may be required when we are dealing with language understanding such as recommendation system,Machine Translation, Generic chat bot such as Alexa, let us say if you ask Alexa “who is Shahrukh Khan” and “What is Shahrukh first movie, we may require the stop words here.
So Let us move step by step for Lexical Processing. I will be creating the GitHub repository from where you can download the code
- How to preprocess text using Tokenisation, Stop Word removal, Stemming, Lemmatization
- How to build Spam/Ham Mail detector model using Bag of Words and TF-IDF model
We will cover first part in next session. Do leave your comment and feedback below. Happy Learning. Regards Chetan/Kamal