So in the current part of Lexical Processing we will first focus on Word Frequency and Stop words and then we will have some practical demonstration.
While working with any kind of data whether it is structured or unstructured data we should have proper understanding of data , and thus we have to do some pre-processing steps. As we know text is made of characters, sentences, paragraph, words etc. The most statistical type of analysis you can do is to look at word frequency distribution, i.e. visualizing the word frequency of given text corpus.It turns out that there is a common pattern you see when you plot word frequencies in fairly large corpus of text, such as news article, reviews of products, viral tweets, Wikipedia articles etc. We will learn how stop words are less important words in texts.
Word frequency and their significance
In 19th century a Linguist George Zip was studying different terms , after reading too many documents and after analyzing the document(i.e. he just started counting the number of times each word appear in particular document) , he just created a measure of rank order based on the frequency of their occurrence in various documents. So most frequently or word which have large frequency was given as last rank , word with least frequency was given Rank 1. In mathematical term f(word)* r(word)=constant.This is know as Zipf Law or Power Law distribution or Pareto Analysis(80:20 rule) i.e 20 % of words contribute to 80 % of frequency. In other word we can say that there are some words which have very high frequency which are generally known as language builder words (is,the,then,that etc.) The image below with area under upper cutoff contribute to this words. This words cannot tell you about what the document is , or in case some one have written a review about product they can’t tell the context for that.
So George Zip generally contributed to one more pattern in the above graph in form of Bell curve(Gaussian Distribution) for the checking the relevance factor of word in particular document, i.e. as per image below the words with high frequency are less relevant or dominant for particular document. Please refer to image below
So this is the reason we have to sometime remove stop words as they are high in frequency but are less relevant in term of document. Let us take an example, “You have won Rs 100000 in monthly Jackpot held by Mahalaxmi Lottery !!. Here you can easily identify from the word Jackpot, Mahalaxmi that this is a Spam mail. In the same way Virat Kohli hit his 31 st century at ODI yesterday. Here you can easily identify from word like Virat, ODI that it is Ham. So Zipf’s law help us form the basic intuition for stop words – these are the words having the highest frequency (or lowest rank ) in text, and they are typically of limited ‘importance’
Broadly speaking there are three kinds of words present in text corpus:
- Highly frequent words , called stop words , such as is, as ,this etc.
- Significant words , which are more important to understand the text
- Rarely occurring word, which are again less important then significant words
Generally speaking stop words are removed from text from two reason :
- They provide no useful information such as Spam/Ham mail, search engine
- As frequency is very high data size can be reduced
However there are some exception where we have to consider stop words. We will study about them later in series.
In the next session we will discuss , tokenization. Please share your comments and feedback about the blogs. Regards Chetan/Kamal