Tokenisation

As we are in fourth blog of this Lexical Processing series. Please watch the first three video as we will be considering a Spam/Ham detector model by end of this module which will use all the algorithms that we are considering one by one. https://kite4sky.wordpress.com/2019/09/16/introduction-to-nlp/ , https://kite4sky.wordpress.com/2019/09/18/lexical-processing-part-i/ , https://kite4sky.wordpress.com/2019/09/24/lexical-processing-part-ii/

In the spam detector model, which we will be working at last of this module, we will be using word tokenisation, i.e. break the chunk of text in words or tokens. When we generally deal with large amount, there is a lot of noise in data.Noise in form of non-uniform cases, punctuation, spelling errors. These are exactly the things that make hard for anyone to work on text data.

There is another thing to think about -how to extract features from messages or large chunk of text so that you can build a classifier. When you create any machine learning model for text you have to feed features related to each messages, that machine learning algorithm can take and build the model. So how does machine learning algorithm read text. As we all know machine learning works on numeric data, not text. With Predictive model or classification algorithm such as logistic regression or SVM etc. when we worked with text you treat them as categorical variables and further you convert them in numerical values for each category or create dummy variable type stuff for them. Here you can do neither of them as message column in Spam/Ham example is unique, it’s not categorical variable .In case you will treat them as categorical your model will fail miserably.

To deal with this you will extract features from this messages(We are considering all mails as messages). For each message you’ll extract each word by breaking each messages into separate words known as ‘token’. This technique is consider as tokenisation – a technique that’s used to split the chunk of text in smaller units or tokens. These elements or tokens can be characters, words, sentence tokenisation etc.

Tokenisation

The notebook contains three types of tokenisation techniques:

  1. Word tokenisation
  2. Sentence tokenisation
  3. Tweet tokenisation
  4. Custom tokenisation using regular expressions

1. Word tokenisation: When we want to break the text in words token we import word_tokenize library from nltk.tokenize . Same can be done using the spacy package in python. Most of the people will say that split() can work in same way, but with split we generally break the texts as per white spaces, in case if after some word , let us say “It look too good.”, split will generally break the last word with full stop as “good.” , which is wrong

2. Sentence tokeniser : Tokenising based on sentence requires you to split on the period (‘.’). Let’s use nltk sentence tokeniser. Let us say we have text ” I am learning NLP as it is most widely used automated technique for understanding unstructured data. Giving response to user query in form of chat bots. It is be for capturing sentiments.” , It will break the sentence as per full stop

3. Tweet Tokeniser. A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time. That is why we have to use Tweet tokeniser. For example consider a message = “i watched the movie limitless:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎”. The word tokeniser breaks the emoji ‘<3’ into ‘<‘ and ‘3’ which is something that we don’t want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

As you can see, it handles all the emojis and the hashtags pretty well.

4. Regular Expression Tokeniser. Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression. Let’s look at how you can use regular expression tokeniser. Regular expression are very efficient when you have to extract particular pattern from text, such as phone number, particular ID pattern(Employee ID type) etc. Let’s look at how you can use regular expression tokeniser.

You were able to extract hashtag related information from text

This might be very important as per tweeter trends of hashtags i.e which type of hashtags are most commonly used or trending in twitter

Do leave us with valuable feedback. In the next part of this module series we will consider Bag -of -Words representation. Regards Kamal/Chetan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: