Covid-19 Analysis

In continuity for our last blog we want to analyze the most painful figures once again and want to see for how much more time world have to suffer with Pandemic condition. So far we all are not at good condition for our fight against this severe condition. Giants like USA, France, Germany and many other counties are looking very small in front of this Pandemic crisis condition. With more and more pain we all as a unity in ISOLATION mode standing again and again to fight against the Crisis. There are both good and bad what we have done, let us analyze those features through data in continuity as per per our last blog


Topic we will be covering:

  • Source for Data
  • Validating effect in figures and numbers
  • Testing : Without Testing results for Covid-19 we can never be sure about how much adversely it have affected the world
  • Cases: The exact number of Covid-19 cases are still unknown for all of us, this is because of less testing, people with no symptoms.
  • Mortality :CFR(Case fatality rate study)

Source for Data: We have tried to gather data from John Hopkins University for our detailed analysis

Figures and numbers: While i am writing this blog data for different countries in world is getting doubled each 2nd day, 4th day and so on. We all have been a complete failure for stopping the virus growth daily. As you can see in the trend below, how the data for counties is getting doubled each 2nd day or 4th day. Japan is the only country which is having doubling rate near about to week. Week later USA was doubling it count in 4 Days. This week i.e. as of 8th April ,the number of confirmed coronavirus cases in the U.S. surpassed 426,000 on Wednesday, according to figures provided by NBC, with 12,864 fatalities nationwide.Just one week back they were having around 213,000 cases. India is also doubling its figures every 4th-5th consecutive day now. India was having 3082 cases on 4th April and on 8th April it have been doubled to 5916. You can refer to image below as per doubling rate for different counties. As per the graph we can easily see the growth for Spain, Italy and Germany have started to steep down. We will discuss the reason for this further

Impact Of Testing : Different counties in world have followed different tactics to fight with Covid-19 and finally most important thing which came out is that Testing and Self Isolation are two mechanism which can cure this pandemic crisis. Let see how world is doing in terms of Testing .

As you can see world have now increased number of Test per day. At Mar-26 USA were doing 427 Test/Million population which have been increased 7088 Test/Million population as of Apr-09 and this is the most important reason for sudden rise in number of cases in USA. With the coming weeks there will be steep decrease in number of cases in USA. Germany have done the testing most effectively. They are testing 15730 Test/Million population(1317887-Total Number Of Test). UAE is having highest rate as per Test/Million(59967/Million) which is too high and other important thing apart from testing they are having Strict Lockdown rules they are having cases but not too much high. Really very well job done by UAE government on this. Italy have increased number of Test and that was the main reason for number of cases in Italy. Italy have also started to follow a decreasing slope. India having second largest population and as cases have start increasing in India every 4th Day Testing speed is low as per population size- 129/Million , but in total 177584.

One of most important point about testing is that USA is having 19.71% as true positive cases for Covid cases, Spain 43%, India having very low 3.7 % as of now

Below figure give Number Of Test Done by different countries in decreasing order and their Mortality Rate

Countries with most number of Test
Countries as per Test/Million

Critical Cases: This one is most important part, which is directly correlated with number of test being done, number of test and with other factors, refer to correlation matrix chart after Serious case Vs Recovered Graph. USA have most number of serious case which generally mean more number of fatality in coming days and after this period they will have steep down curve. One thing which is most surprising for USA is apart from having most number of cases, more deaths, more critical cases they are having less recovery rate as compared to other country, reason for this as increased number of cases their health system have started collapsing.

Germany is doing really well, whether we talk about Testing, Recovered cases. COVID-19, the disease caused by the novel coronavirus, has been much deadlier in older people, but more anecdotes are popping up of young, healthy people getting critically ill. Among the first reported cases in the US, around 40 percent of the patients that required hospitalization were between the ages of 20 and 54 and severe treatment as well. This is main reason for USA for number of severe condition people as both young and old people are getting affected and requiring medical aid. This serious cases will decrease in coming day as USA is now trying to figure out each and every cases for Covid-19 and within 2-3 weeks we will see good news for USA for coming out of Covid-19.


Case Fatality Rate : In an outbreak of an infectious disease it is important to not only study the number of deaths, but also the growth rate at which the number of deaths is increasing. In opening remarks at the March 3 media briefing on Covid-19, WHO Director-General Dr Tedros Adhanom Ghebreyesus stated:“Globally, about 3.4% of reported COVID-19 cases have died. By comparison, seasonal flu generally kills far fewer than 1% of those infected.” To report the rate of change we focus on the question: How long did it take for the number of confirmed deaths to double? Let’s take an example: if three days ago there had been 500 confirmed deaths in total, and today we have reached 1,000, then the doubling time is three days.The doubling time of deaths has changed and it will change in the future; we should not naively extrapolate the current doubling time to conclude how many people will die.If deaths go up by a fixed number over a fixed period – say, by 500 every two days – then we call that “linear” growth. But if they keep on doubling within a fixed time period – say, every three days – then we call that “exponential” growth. Let us try to understand doubling logic with graph. CFR rate have been increase from 1.5% to 4 % in less then one month. So you can imagine how fast this rate is increasing as well

  • We don’t know how many were infected  -When you look at how many people have died, you need to look at how many people where infected, and right now we don’t know that number. So it is early to put a percentage on that.
  • The only number currently known is how many people have died out of those who have been reported to the WHO
  • It is therefore very early to make any conclusive statements about what the overall mortality rate will be for the novel coronavirus, according to the World Health Organization
  • Elderly people and unwell are more likely to die. Death rate for people over 70 is 10 times as compared to people less then 40
  • Death rate for most Covid-19 patient depend on the following health condition in decreasing order- Cardiovascular, Diabetes, Respiratory Disease, High Blood Pressure.
  • Most cases of Covid-19 get uncounted because tend not to visit to doctor with mild symptoms. That is reason Total cases are lot more then exact figure we are seeing and people of old age are at most danger situation due to such carriers, which can actually be 7-8 times more then that. I have shared my calculation for this in my last blog.
  • Reason for different CFR rate for different countries- It actually depend on the number of cases that are being tested, Germany is doing more then 30000 test daily and has been testing mild symptoms carriers too which might get ignored as per above point
  • Age wise comparison: Death Rate = (number of deaths / number of cases) = probability of dying if infected by the virus (%). This probability differs depending on the age group. The percentages shown below do not have to add up to 100%, as they do NOT represent share of deaths by age group. Rather, it represents, for a person in a given age group, the risk of dying if infected with COVID-19.
  • Doubling Rate: This chart is not showing good results as per humanity. You can generally see that how the figures are getting doubled every four day, week for countries.

Total Cases: Number of cases are increasing by 1 40000 every 2nd day. As per today Apr-10th we are having figures as per table below. With increase in number of Test now days cases are growing in large number

Prediction Using Arima Model : ARIMA stands for Auto Regressive Integrated Moving Average. There are seasonal and Non-seasonal ARIMA models that can be used for forecasting. We will be covering the detail about Arima Model in coming blogs.

As per our model which we have done prediction on 9th April, India will be having given case by end of April

We are also trying to get time series data for number of test for getting forecasting more accurate. It is evaluated on basis of RMSLE. The RMSLE for a single column calculated as


n is the total number of observations
p is your prediction
ai is the actual value
log(x) is the natural logarithm of x

Correlation matrix as per different attribute

You can refer to given notebook on Kaggle for Visualization and Forecasting Model : Notebook Link

Link for Tableau Dashboard : It get refreshed once a day : Tableau Dashboard.

While writing this blog i was feeling very sad as in every data points in our Viz or forecast are people who are fighting with Covid-19.

So please be safe and take care of your loved ones and help the needy one !


Kamal Naithani



You have no idea What’s Coming

‘Be prepared’ something worse is coming. Unfortunately, most governments to most people, from China to the US, have ignored this and their people are suffering the tremendous onslaught of the Coronavirus. Most of the countries from Europe to US, India to china are in state of quarantine because of this Coronavirus outbreak. It is time for all of us to come together with Isolation mode, being isolated but being together for supporting each other is the only way to fight this disaster.

Coronavirus phase consist of given stages.

  • Stage 1: Most of the world have taken this stage very lightly, ‘We are safe as it have only spread in China, ‘my immunity system is good, so no need to worry’, ‘i am safe as i am less then 70’. None of the country in world have taken this stage seriously, when we have only small numbers of Coronavirus cases. In case the world have taken strict action or even we people have done then situation have been different.Government should have restricted all flight from Countries affected with Coronavirus. But what now, we are in stage 2.
  • Stage 2: Number of cases begin to increase . Government only declare or red zone area for few of the state or area. Cases increase in slow pace and there are some death . People mentality at stage 2 ‘Media is creating a hype’ , ‘People are dead because they are more then 70 and carrier for some disease’ , ‘I’m not going to stop and will party with my friends’ , ‘Offices doesn’t have any backup plan’, People are still enjoying their “Social” life
  • Stage 3: Number of cases start increasing exponentially. They get doubled every 2 or 3 days. Now it is the time of fear. People saying ‘I’m not going to stop and will party with my friends’ are now blaming each other and government, but remember you are also responsible for Stage 3 and more worse in further stages. Steps have been taken but still government have not taken fool proof action plan, some people are still not taking it serious. People with fear of Coronavirus are now spreading within countries, Person working in some metro cities for his earning is running to his/her area where Coronavirus is not spread as of now. So from International level carriers Coronavirus have become traveler withing your countries too. Isn’t it more worse 😦 . Government should have banned travel with countries itself. Government are spreading message for washing hands, hand sanitizer, not to go outside, isolation, but still there are some people who have learned nothing
  • Stage4 : Number of cases have been increased to figure which you have never assume. Now everything is closed, people have locked them inside their home. It is national level health emergency. Economy is decreasing, hospital staff are working day and night for Coronavirus cases or test. Cases will become so large that there will be not enough of medical staff. This means other patient who are infected with some other diseases and if someone get affected with some heart attack etc. they will be not treated as corona cases have high priority and less medical staff, so death rate gets affected, In USA before Corona virus case average time for reaching an ambulance was 8-10 minutes which have been increase a lot. In simple each and every system will start collapsing. Now the virus is not alone dangerous, but there are lot of other things that have become deadly weapon too. Please refer to some of deadly views how the virus have grown all over the world.

As you can see with above trend Feb-24 Italy were having 237 cases and at Mar-3 they were having 2741 cases and on Mar-20 they were having 55,493 cases. First case which came to Italy was in Jan-31. So you can easily get the idea as per stages above with given trends. Total number of death in Italy as of now are 4032 and total recovered cases 4440 cases. Total number of death in Italy on Mar-3 were 79 and total recovered cases 160 cases. Mortality rate from Mar-3 till Mar-20 have increased from 2.8 percent to 7.2%.

As you can see with above trend Feb-24 US were having 56 cases and at Mar-3 they were having 546 cases , Mar-15 3574 cases and on Mar-20 they were having 19344 cases. First case which came to US was in Jan-31. So you can easily get the idea as per stages above with given trends. Total number of death in US as of now are 244 .Total number of death in US on Mar-3 were 7 and total recovered cases 7 cases. Mortality rate from Mar-3 till Mar-20 have increased from 1.2 percent to 6.8%. So as you can see with increase in number of cases mortality rate is increasing as we are lacking in Medical aids .

Let us analyze the country which have Covid-19 cases more then 10000. We can see Korea have done really well as per reducing the number of cases. China is also doing good, situation looks like in their control. Apart from that for all of the other countries in top are at worst scenario. One most important things which need to be paid attention is that Mortality rate is increasing with rise in Number of cases.For Italy Mortality rate have been increase to 9 %. One factor affecting the country’s death rate may be the age of its population—Italy has the oldest population in Europe, with about 23% of residents 65 or older. Many of Italy’s deaths have been among people in their 80s, and 90s. But on the other hand being as oldest population in world Japan have maintained the Mortality rate as well as the decline in Covid-19 cases.This is important. Other reason for increasing Mortality rate is facilities and resources. As with increase number of cases more ICU beds and ventilators will be required. This is why people died in droves in Hubei and are now dying in droves in Italy and Iran. The Hubei fatality rate ended up better than it could have been because they built 2 hospitals nearly overnight, but the other country will be able to do that. If 5% of your cases require intensive care and you can’t provide it, most of those people die. As simple as that. This condition will be give rise to collateral damage. Above numbers only show people dying from coronavirus. But what happens if all your healthcare system is collapsed by coronavirus patients? Others also die from other ailments. What happens if you have a heart attack but the ambulance takes 50 minutes to come instead of 8 (too many coronavirus cases) and once you arrive, there’s no ICU and no doctor available? You die. Let us take an example of US, There are 4 million admissions to the ICU in the US every year, and 500k (~13%) of them die. Without ICU beds, that share would likely go much closer to 80%. Even if only 50% died, in a year-long epidemic you go from 500k deaths a year to 2M, so you’re adding 1.5M deaths, just with collateral damage.If the coronavirus is left to spread, the US healthcare system will collapse, and the deaths will be in the millions. The same thinking is true for most countries. The number of ICU beds and ventilators and healthcare workers are usually similar to the US or lower in most countries. Unbridled coronavirus means healthcare system collapse, and that means mass death.

Just imagine the condition for India,we lack in most of the healthcare facilities and with second largest population it will become a disaster. We can’t live in denial model that let’s quarantine the numbers of people affected and it will stop, reason for this as numbers of cases are very large in number what we. In India today we have 300 active cases, but as per our calculation we find that this number should be somewhere around 4000. This is because of asymptomatic adults, who have the infection but no symptoms and also it take 5-14 days for showing it symptoms and person dying from this as per average of days is near about to 17.So think how much chain reaction this can be. Many people in Korea who are feeling absolutely fine but are contagious to COVID-19 between 20-29 are 29% of total population. But due to strong immunity and younger generation they are not feeling anything.

As per above report we can see that recovery rate in China have increase a lot. Confirmed cases as of Mar-21 are 81305 in China and Recovered cases are 71857 taking a total to 89 %. This have been increased a lot in case we compare China recovery percentage as on Feb-25 which was 35 %(77754 total case 27676 recovered one out of that) and which was 1.3% on Feb-12. This have generally increased seriousness by China government on understanding this disaster(Building hospital and increasing hospital staff), though they were already late. Italy is having recovery rate % on Date Mar-21 as 11 %. India is having less then 7 %, though the cases are not more, which is big point of concern for India

Refer to Dashboard Link:

We have to save us from situation like Italy or China. Italy 4825 fatalities account for 38.3 percent of world’s total 12700 deaths. Total 1420 death since Friday.

Increase in no of case as per last Day

What to do

India have really a tough time, with less no be medical beds and facilities, poverty and most important large population. We generally believe in rumors too quickly, political parties in this phase are thinking of Politics.Locking down the country of a billion from outside will not be enough now. Neither will be tracking each case as it comes as we lack in that expertise. Locking down condition for government is too tough, as it can create panic and civil war type situation too.

Our government is late in doing things but now they are working hard. We have to help and support them, just get isolated and keep a distance with Old generation as virus is really worse for them.

Social distancing or isolated mode is only solution, it is important for young and elder population to stop it from spreading . All state borders should be closed by government for next few days. Soon the government will run out of options, and we will blame them, without realizing, they didn’t bring the virus. We know we have shortage of healthcare system, so try not to be in next stage of this chain, else consequences will be worst.

We should support the think tanker and Leaders, and should respect their decision and provide the courage and strength for fighting with this situation. This is the most contagious disease as of now and in case government take some strict action we should follow and do not create panic type of situation.

At last please be unsocial, isolation is only cure for this virus, wash your hand clean metal and other surface. Please maintain social distancing.

For the above images: Dashboard Link (Dashboard get updated 2 times in a day):

Team Kite4Sky

Topic Modelling – Part I

Topic Modelling is an unsupervised Machine Learning Technique used for identifying multiple topics in text or you can say identifying abstract or “topics” that are talked in multiple documents. Let us say I Phone or Galaxy Note series have launched newer version of Phone and they want to understand about the the features which customers are talking in n number of reviews( each review will be considered as one document). Let us say 50 % of customers are talking about hardware, 20 % are talking about camera quality and features , 10 % are talking about music quality and 20 % are talking about packaging of product.

Similarly you can say you have large corpus of scientific documents and you want to build a search engine for this corpus. Imagine you have n number of documents which talk about diseases such as heart, lungs, diabetes, so applying topic modelling on top of this document will lead to identify important analysis as per document and Key terms which are most talked about or are responsible for this diseases to occur.

We will covering the understanding of Topic Modelling with practical demonstration with LDA(Latent Dirichlet Allocation)

Defining a Topic: As with other semantic analytics technique we are aware that topic is distribution over terms, i.e. each term has a certain “weights” in each topic, term here can be referred as k number of words in n number of documents. But is this the only way to define topic ? What are the other way in which topics can be defined ?

There are two major task in Topic Modelling :

  • Estimating Topic Term Distribution : In this case we define each topic as single term ( which will be changes as per LDA further)
  • Estimating the coverage of Topics in Document, i.e. the document – topic distribution : Coverage= the frequency of topic j in document i / Σj( the frequency of topic j in document i)

Some problems in defining topics in single term are :

  • Polysemy: If a document has words having same meaning such as(lunch, food, cuisine, meal etc.), the model would choose only one word(say food) as topic and will ignore all the others.
  • Word sense disambiguation : Words with multiple meanings such as ‘stars’ would be incorrectly inferred as representing one topic, though the document could have both topics(movie star and astronomical star)

As per above mentioned points we need more complex definition of topic to solve the problem of Polysemy and Word sense disambiguation

To summarize, there are multiple advantages of defining a topic as a distribution over terms.

Consider two topics – ‘magic‘ and ‘science‘. The term ‘magic’ would have a very high weight in the topic ‘magic’ and a very low weight in the topic ‘science’. That is, a word can now have different weights in different topics. You can also represent more complex topics which are hard to define via a single term.

There are multiple models through which you can model the topics in this manner. You will study two techniques in the following lectures – Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA).

Probabilistic Model

Let us study the basic idea of generative probabilistic models in the context of topic modelling.  

Informally speaking, in generative probabilistic modelling, we assume that the data (which we observe, such as a set of documents) is ‘generated’ through a probabilistic model. Then, using the observed data points, we try to infer the parameters of the model which maximise the probability of observing the data.

For example, you can see POS tagged words as data generated from a probabilistic model such as an HMM. You then try to infer the HMM model parameters which maximize the probability of the observed data.

Following are the assumption made:

  • The words in documents are assumed to come from an imaginary generative probabilistic process(We assume that a probabilistic model generates the data, i.e. the words in a document).
  • The order of words in a topic is assumed to be unimportant(If you change the order of words in a topic, the topic will stay the same)
  • The process to generate a document is to draw words one at a time(The generative process is assumed such that words are generated one after another, one at a time)

Let’s now learn about the plate notation, that will be used in the later to understand PLSA and LDA.

  • Shaded Node: These are observed variables
  • Plate: A repetitive structure (The plate signifies repetition)
  • What does the number ‘N’ written on the bottom-right of the plate represent: Number of X’s(A total of N number of X’s are present)

Probabilistic Latent Semantic Analysis (PLSA) : Say there are M documents (represented by the outer plate in the figure below), and for simplicity, assume that there are N words in each document (the inner plate). Also, let’s assume you have k topics (k is not represented in the figure).

Each document contains each topic with some probability (document-topic distribution), and each topic contains the N words with some probability (topic-term distribution). The inference task is to figure out the M x k document-topic probabilities and the k x N topic-term probabilities. In other words, you want to infer Mk + kN parameters.

The basic idea used to infer the parameters, i.e. the optimisation objective, is to maximise the joint probability p(w, d) of observing the documents and the words (since those two are the only observed variables)Notice that you are doing something very clever (and difficult) here – using the observed random variables (d, w) to infer the unobserved random variable (c).

Using the Bayes’ rule, you can write p(w, d) as: p(w,d) = p(d) x p(w|d)

  1. M represents : Number of documents
  2. N represents: Numbers of terms or words in document
  3. W shaded: Observed variables are shaded, according to the plate notation
  4. C represents: Number of topics
  5. What us unobserved variable: Topic is a latent variable which is not observed, rather inferred
  6. The number of parameters in PLSA depends upon: No. of topics, no of terms, number of documents
  7. The total number of parameters in PLSA is equal to the: Number of documents x number of topics + Number of topics x number of terms(Parameters will depend upon the number of documents, number of topics and the number of terms.)

The term p(w|d) represents the probability of a word w being generated from a document d. But our model assumes that words are generated from topics, which in turn are generated from documents, so we can write p(w|d) as p(w|c). p(c|d) summed over all k topics: 

P(w|d) =  ∑ p(c|d) x p(w|c)

So, we have

P(w,d) =  p(d) x ∑ [p(c|d) x p(w|c)]

To summarise, PLSA models documents as a distribution over topics and topics as a distribution over terms. The parameters of PLSA are all the probabilities of associations between documents-topics and topics-terms which are estimated using the expectation maximisation algorithm

Drawbacks of PLSA: You see that PLSA has lots of parameters (Mk + kN) which grow linearly with the documents M. Although estimating these parameters is not impossible, it is computationally very expensive. 

For e.g. if you have 10, 000 documents (say Wikipedia articles), 20 topics, and each document has an average 1500 words, the number of parameters you want to estimate is 1500*20 + 20*10k = 230, 000.

We will be covering the next part of Topic Modelling with LDA and practical session of using it. Please provide your comments for feedback.

Team, Kite4Sky

How to Use PySpark with Jupyter Notebook in Remote Environment

Apache Spark is a most common for Big Data people. In a few words, Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.

Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. You can easily run your code and visualize the output as per each cell view. This is the reason most common platform used for Programmers in Data Science field including Kaggle Competitions.

Installation : I am supposing that you all are done with installation of Spark and Hadoop, in case if you want blog on that please mentions in comment section so we can cover that as well in coming blogs. Before installing Pyspark, you must have Python and Spark installed. As we are having all this environment installed.

Check whether Pyspark is installed correctly :

  • Login to Remote Environment using Bitwise SSH Client or any other Remote Connectivity Client
  • Let us check what are the databases we are having access when we are Logged in with SSH Client

Pyspark Installation: Let us check whether Pyspark is installed correctly or not.

  • Use the command for Annaconda Path: PATH=/cloudera/parcels/Anaconda/bin:$PATH; export PATH. This may be different as per Path for your environment setup
  • Type Pyspark
  • We can see that Pyspark is installed in our Environment

Working with Jupyter Notebook integration with Pyspark:  Before moving to Jupyter Notebook there are few steps for environment setup. Run all the command for remote environment cmd

a) Path Setup

1. locate


3. PATH=cloudera/parcels/Anaconda/bin:$PATH; export PATH


5.export PYSPARK_DRIVER_PYTHON=jupyter

6. export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’

7. env

8. pyspark

b) Then assign the port 8888 at your local command prompt or any other port specifying Jupyter Notebook with this port can be opened in your Local Environment using SSH setup by using the command: ssh -N -L localhost:8888:localhost:8888 id@remoteservername. This step is known as “Set up SSH tunnel to your remote machine“.To access the notebook on your remote machine over SSH, set up a SSH tunnel to the remote machine using the above command: This command opens a new SSH session in the terminal. I’ve added the option -N to tell SSH that I’m not going to execute any remote commands. This ensures that the connection cannot be used in that way, see this as an added security measure.

I’ve also added the -L option that tells SSH to open a tunnel from port 8888 on the remote machine to port 8888 in my local machine.

ssh -N -L localhost:8888:localhost:8888 id@remoteservername

In case you are using service account then replace service account name instead of your account.

c) ) Jupyter Notebook will be opened like:

d) Sometimes you may get error like multiplesparkcontext error:  This happens because when you type “pyspark” in the terminal, the system automatically initialized the SparkContext (maybe Object?), so you should stop it before creating a new one.

sc = SparkContext.getOrCreate() or you can use sc.stop().

Now you can run your code in Jupyter Notebook with Pyspark. We have used Linux Environment for all above mentioned setup.

Please provide us with your feedback.

Regards Team Kite4Sky

Google AutoML

Automated Machine Learning Model

Automated Machine Learning Model is the process of applying automated process for Machine Learning Model.It generally cover the automated approach from Raw Data set to deployable Machine Learning Model including Data Preprocess steps, Model Building, Model Ewvaluation, Model Training and Testing etc. This approaches are great in reducing full development life cycle of Machine Learning Model and technique without requiring much expertise in this field.All you need is the information of how to use services offered by this Cloud based tool and basic overview of AI-ML models working.

Machine learning (ML) has achieved considerable successes in recent years and an ever-growing number of disciplines rely on it. However, this success crucially relies on human machine learning experts to perform the following tasks:

  • Preprocess and clean the data.
  • Select and construct appropriate features.
  • Select an appropriate model family.
  • Optimize model hyperparameters.
  • Postprocess machine learning models.
  • Critically analyze the results obtained. But with AutoML technique we can save a lot of time for the given mentioned steps.

Some of the popular AutoML tools are : AutoKeras, H20AutoML, Amazon Lex, AWS Sagemaker, Google AutoML, IBM Watson, Microsoft Azure and many more. We will be covering how to use Google AutoML and its services.

How To Use Google AutoML: We will be using Jupyter Notebook for Google AutoML working. We will showing how you can use Jupyter Notebook or Python for using Google AutoML services

  1. Create or Select a GCP(Google Cloud Platform) Project : Create your account in Google Cloud. You will be provided with some free credits. Use them for doing all stuff for your first AutoML working project in Google.
  2. Enable Billing: When you are done with Enable billing you will be provided with some free credits
  3. Enable the API’s for AutoML and Google Storage: Select the Project that you have created at step 1 and enable the API’s for that.
  4. We will be using Data from site: where we will be using Text Data from tweets: to identify Real or fake tweets.
  5. Create a storage bucket at storage section at Google AutoML console which will be used for storing and getting data from GCS
  6. Import important library for Google AutoML
  7. GCS upload/download utilities/function: These functions make upload and download of files from the kernel to Google Cloud Storage easier. This is needed for AutoML. We have taken help from the Google AutoML documentation for this
  8. Export to CSV and Upload your data to GCS storage bucket
  9. Create a class Instance for using Natural Language Service
  10. Create Data set
  11. Start with created data set for training model
  12. Once the model is trained , make the prediction with that.
  13. You can refer to my Kaggle Notebook for practical demo:
Storage Bucket Pic
Model Evaluation

For code and how to make use of Python for using Google AutoML services, follow my Notebook:

Please provide your feedback


Team Kite4sky

Chat Bot Challenges(Azure Services)

Chat Bots, Recommendation Engine etc. have become popular in large number of business domains such as Finance, Banking, Tour & Travel etc. Companies are building chatbots for booking hotel, movies, suggestion etc., customer support, enquiring bus flights, tax saving advice, stock advice etc and many more.

They are required in such a large number because they reduce time , efforts and cost required to get task done( and if designed well , improve user experience as well)

Chatbots can be divided into two broader picture:

  1. Generic Chatbots: also known as virtual assistants, such as Google assistant, Amazon Alexa, Cortana, Siri etc, can be used for calling, typing messages, booking calendar slot, fetching result from web. This system have been trained on massive amount of user’s data , encyclopedias, conversational dialogues and stories with human etc.
  2. Domain Specific : This are generally used as per particular domain specific chatbot. Just think about some bank Bot answering information related to corresponding Bank , let us say ICICI bank or a weather bot can only tell weather related information.It cannot book a table in some restaurant or set a morning alarm for you.

Some of the popular paid services such as IBM Watson , Microsoft Azure where you can build up Chatbots using NodeJS, C#, Python and services provided by them. For Example Microsoft Azure have two most important services for building a Bot(QNA Maker and LUIS).

We will try to find out some of challenges when you are working with Microsoft Azure Service like LUIS( Language Understanding Intelligent Service ): A machine learning-based service to build natural language into apps, bots, and IoT devices. Let us discuss them step by step and how to solve the problem: is custom machine learning intelligence to a user’s conversational , natural language(NLU) to predict overall meaning and extracting relevant details such as Entity extraction and other important and relevant details

Luis Consider two types of model as listed below. We are using model similar to Prebuilt model

  • Prebuilt model include prebuilt entities and can be said without ML/AI specific technique
  • Custom model have custom entities where LUIS gives you to identify your own custom intents and entities including ML driven entities or a combination of both

Some of the Challenges due to either of two model are:

To much Intent can confuse the LUIS model

In order to get the same top intent between all the apps, make sure the intent prediction between the first and second intent is wide enough that LUIS is not confused, giving different results between apps for minor variations in utterances.

Entity are optional but highly recommend :

You do not need to create entities for every concept in your app, but only for those required for the app to act. Entity represent parameters or data for an intent: Entities are data you want to pull from the utterance. This can be a name, date, product name, or any group of words. For example, in the utterance “Buy a ticket from New York to London on March 5”, three entities are used. Location, Origin, Location. Destination and Prebuilt datetimeV2 which would contain the value “New York”, “London” and “March 5” respectively.

Entities are shared across intent :

Entities are shared among intents. They don’t belong to any single intent. Intents and entities can be semantically associated but it is not an exclusive relationship. In the utterance “Book  a meeting with Kamal”, “Kamal” is an entity of type Person. By recognizing the entities that are mentioned in the user’s input, LUIS helps you choose the specific actions to take to fulfill an intent.

But what happen sometimes, in case you are using too many Intent in your model, same types of utterances in two entity or Intent consisting of more then one entity , your LUIS Model can make wrong prediction. For example:

Solution for above challenges:

  • Use Composite Entity: A composite entity is made up of other entities, that form part of whole.
  • Pattern.any : Entity where end of entity is difficult to determine. It is also uses ML capability
  • Regular Expression: Use particular pattern from text
  • Machine-learned(ML) entities work best when tested via endpoint queries and reviewing endpoint utterances.

As per our current scenario we are facing the wrong intent identification when more than one Entity are used or wrong Intent identification.

  1. It’s all about finding a balance between the number of intents, and the number of options or actions within an intent. To try to explain, let’s take an example where a user wants to ask questions about a company and questions about an aggregation Initially the question’s are simple like, “Contact No of Kamal”, “Who is Chetan”. But when you start expanding or allow more option such as “Who is Kamal Naithani and what are his skills set” it may either respond to wrong intent or no answer.
  2. We can use Pattern in such cases : Patterns are tools you can use to generalize common utterances, wording, or ordering that signals a particular intent. LUIS first recognizes entities, and then can use matching to identifying patterns within the rest of the utterance. Patterns can be used to improve prediction accuracy of utterances by using entities and their roles to extract data using a specific pattern. This reduces the number of utterance example you would need to provide to teach LUIS about the common utterances for an intent, saving the time it would take to train your LUIS app while improving its’ accuracy.
  3. Similarly Phrase lists can be used which provides hints that certain words and phrases are part of a category. If LUIS learns how to recognize one member of the category, it can treat the others similarly. . This improve the accuracy of intent scores and identify entities for words that have the same meaning (synonyms) by adding an interchangeable phrase list

When LUIS makes a prediction, it gets results from all models and picks the top scoring model. Example utterances assigned to one intent act as negative examples for all other intents.

Intents with significantly more positive examples are more likely to receive positive predictions. This is called data imbalance. We have to work on this with proper analysis

Make sure the vocabulary for each intent is used only in that intent.
Intents with overlapping vocabulary can confuse LUIS and cause the scores for the top intents to be very close. This is called Unclear Prediction. We have to identify if there exist case like this

Apply Active Learning to add user utterances.

LUIS is meant to learn quickly with few examples. User utterances will help you find examples with a variety of formats. So add Phrase List and pattern

Use Dispatch services in case you have more number of Domain or Intent to increase the accuracy.

Dispatch Service: If your app is meant to predict a wide variety of user utterances, consider implementing the dispatcher model. If a bot uses multiple LUIS models ,knowledge bases (knowledge bases), you can use Dispatch tool to determine which LUIS model or QnA Maker knowledge base best matches the user input.

  1. The parent app indicates top-level categories of questions.
  2. Create a child app for each subcategory.
  3. The child app breaks up the subcategory into relevant intents. Breaking up a monolithic app allows LUIS to focus detection between intents successfully instead of getting confused between intents across the top level and intents between the top level and sub levels.
  4. Schedule a periodic review of endpoint utterances for active learning, such as every two weeks, then retrain and republish.

Batch Testing: You can use batch testing to understand the problem and improve results. Batch testing validates your active trained model to measure its prediction accuracy. A batch test helps you view the accuracy of each intent and entity in your current trained model, displaying results with a chart

  1. Review the batch test results to take appropriate action to improve accuracy, such as adding more example utterances to an intent if your app frequently fails to identify the correct intent
  2. It generally divide the Utterance/Intent or Utterance/Entity result in 4 Quadrant i.e.. Data points on the False Positive and False Negative sections indicate errors, which should be investigated. If all data points are on the True Positive and True Negative sections, then your app’s accuracy is perfect on this data set.

Some of the best practices while working with

  1. We will also validate the current Luis Model design as per best practice. Some of them are listed below
  2. Intents:Create an intent when this intent would trigger an action.
  3. Intents should be specific.
  4. If intents are semantically close, consider merging them.
  5. Entities:Create when bot needs some parameters or data from the utterance.
  6. Use ML related entities
  7. Utterances:Begin with 10-15 utterances per intent.
  8. Each utterance should be contextually different.
  9. The None intent should have between 10-20% of the total utterances.

Please provide your comments and suggestion.


Team Kite4Sky


As we are in fourth blog of this Lexical Processing series. Please watch the first three video as we will be considering a Spam/Ham detector model by end of this module which will use all the algorithms that we are considering one by one. , ,

In the spam detector model, which we will be working at last of this module, we will be using word tokenisation, i.e. break the chunk of text in words or tokens. When we generally deal with large amount, there is a lot of noise in data.Noise in form of non-uniform cases, punctuation, spelling errors. These are exactly the things that make hard for anyone to work on text data.

There is another thing to think about -how to extract features from messages or large chunk of text so that you can build a classifier. When you create any machine learning model for text you have to feed features related to each messages, that machine learning algorithm can take and build the model. So how does machine learning algorithm read text. As we all know machine learning works on numeric data, not text. With Predictive model or classification algorithm such as logistic regression or SVM etc. when we worked with text you treat them as categorical variables and further you convert them in numerical values for each category or create dummy variable type stuff for them. Here you can do neither of them as message column in Spam/Ham example is unique, it’s not categorical variable .In case you will treat them as categorical your model will fail miserably.

To deal with this you will extract features from this messages(We are considering all mails as messages). For each message you’ll extract each word by breaking each messages into separate words known as ‘token’. This technique is consider as tokenisation – a technique that’s used to split the chunk of text in smaller units or tokens. These elements or tokens can be characters, words, sentence tokenisation etc.


The notebook contains three types of tokenisation techniques:

  1. Word tokenisation
  2. Sentence tokenisation
  3. Tweet tokenisation
  4. Custom tokenisation using regular expressions

1. Word tokenisation: When we want to break the text in words token we import word_tokenize library from nltk.tokenize . Same can be done using the spacy package in python. Most of the people will say that split() can work in same way, but with split we generally break the texts as per white spaces, in case if after some word , let us say “It look too good.”, split will generally break the last word with full stop as “good.” , which is wrong

2. Sentence tokeniser : Tokenising based on sentence requires you to split on the period (‘.’). Let’s use nltk sentence tokeniser. Let us say we have text ” I am learning NLP as it is most widely used automated technique for understanding unstructured data. Giving response to user query in form of chat bots. It is be for capturing sentiments.” , It will break the sentence as per full stop

3. Tweet Tokeniser. A problem with word tokeniser is that it fails to tokeniser emojis and other complex special characters such as word with hashtags. Emojis are common these days and people use them all the time. That is why we have to use Tweet tokeniser. For example consider a message = “i watched the movie limitless:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎”. The word tokeniser breaks the emoji ‘<3’ into ‘<‘ and ‘3’ which is something that we don’t want. Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can salone prove to be a really good predictor of the sentiment. Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook. So there, you want to use the hashtag as is.

As you can see, it handles all the emojis and the hashtags pretty well.

4. Regular Expression Tokeniser. Now, there is a tokeniser that takes a regular expression and tokenises and returns result based on the pattern of regular expression. Let’s look at how you can use regular expression tokeniser. Regular expression are very efficient when you have to extract particular pattern from text, such as phone number, particular ID pattern(Employee ID type) etc. Let’s look at how you can use regular expression tokeniser.

You were able to extract hashtag related information from text

This might be very important as per tweeter trends of hashtags i.e which type of hashtags are most commonly used or trending in twitter

Do leave us with valuable feedback. In the next part of this module series we will consider Bag -of -Words representation. Regards Kamal/Chetan

Lexical Processing Part II

So in the current part of Lexical Processing we will first focus on Word Frequency and Stop words and then we will have some practical demonstration.

While working with any kind of data whether it is structured or unstructured data we should have proper understanding of data , and thus we have to do some pre-processing steps. As we know text is made of characters, sentences, paragraph, words etc. The most statistical type of analysis you can do is to look at word frequency distribution, i.e. visualizing the word frequency of given text corpus.It turns out that there is a common pattern you see when you plot word frequencies in fairly large corpus of text, such as news article, reviews of products, viral tweets, Wikipedia articles etc. We will learn how stop words are less important words in texts.

Word frequency and their significance

In 19th century a Linguist George Zip was studying different terms , after reading too many documents and after analyzing the document(i.e. he just started counting the number of times each word appear in particular document) , he just created a measure of rank order based on the frequency of their occurrence in various documents. So most frequently or word which have large frequency was given as last rank , word with least frequency was given Rank 1. In mathematical term f(word)* r(word)=constant.This is know as Zipf Law or Power Law distribution or Pareto Analysis(80:20 rule) i.e 20 % of words contribute to 80 % of frequency. In other word we can say that there are some words which have very high frequency which are generally known as language builder words (is,the,then,that etc.) The image below with area under upper cutoff contribute to this words. This words cannot tell you about what the document is , or in case some one have written a review about product they can’t tell the context for that.

So George Zip generally contributed to one more pattern in the above graph in form of Bell curve(Gaussian Distribution) for the checking the relevance factor of word in particular document, i.e. as per image below the words with high frequency are less relevant or dominant for particular document. Please refer to image below

So this is the reason we have to sometime remove stop words as they are high in frequency but are less relevant in term of document. Let us take an example, “You have won Rs 100000 in monthly Jackpot held by Mahalaxmi Lottery !!. Here you can easily identify from the word Jackpot, Mahalaxmi that this is a Spam mail. In the same way Virat Kohli hit his 31 st century at ODI yesterday. Here you can easily identify from word like Virat, ODI that it is Ham. So Zipf’s law help us form the basic intuition for stop words – these are the words having the highest frequency (or lowest rank ) in text, and they are typically of limited ‘importance’

Broadly speaking there are three kinds of words present in text corpus:

  • Highly frequent words , called stop words , such as is, as ,this etc.
  • Significant words , which are more important to understand the text
  • Rarely occurring word, which are again less important then significant words

Generally speaking stop words are removed from text from two reason :

  • They provide no useful information such as Spam/Ham mail, search engine
  • As frequency is very high data size can be reduced

However there are some exception where we have to consider stop words. We will study about them later in series.

Please refer to

In the next session we will discuss , tokenization. Please share your comments and feedback about the blogs. Regards Chetan/Kamal

Lexical Processing Part I

So As discussed in our last post we will be focusing on Lexical Processing. It generally mean extracting the raw text, identifying and analyzing the structure of words. Lexical analysis is extracting the whole document to sentence, sentence to words or we can simply term as breaking whole chunk of words in tokens or smaller unit.

It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.

So why it is required to do lexical processing:

  • Let us say an email contain word such as lottery , bumber bonanza ! etc type of word you can easily identify that email is spam, so that is the reason you have to break text in tokens.
  • Hence in general the group of words give more idea about sentence. So considering all plural words to singular form we use stemming which is part of lexical processing. For example dogs and dog have same meaning
  • Most of the words or we can say that 80 % of words are not important such as the, is ,that etc, so sometime we remove them. This are generally termed as stop words and are generally removed for Spam/Ham Mail classification, but this may be required when we are dealing with language understanding such as recommendation system,Machine Translation, Generic chat bot such as Alexa, let us say if you ask Alexa “who is Shahrukh Khan” and “What is Shahrukh first movie, we may require the stop words here.

So Let us move step by step for Lexical Processing. I will be creating the GitHub repository from where you can download the code

  • How to preprocess text using Tokenisation, Stop Word removal, Stemming, Lemmatization
  • How to build Spam/Ham Mail detector model using Bag of Words and TF-IDF model

We will cover first part in next session. Do leave your comment and feedback below. Happy Learning. Regards Chetan/Kamal

Introduction To NLP

Natural Language Processing

  • What is and why text Analytics required ?

NLP is a technology that allows machine to understand human language or unstructured data in form of text. As human we all can understand the tenses(past, present, future etc.), we can understand the meaning of sentence(making a clear differentiation between words when they are together in sentence), recognizing the entity( As of now we can relate entity with Noun, will study in more details in coming topics). So NLP lets individuals use their normal speech and writing patterns to communicate with computer systems in more convenient way and providing meaningful fact with textual data

Why text Analytics required : It is required as in coming era of changing technology day by day, we have gathered lots of lot of unstructured data. So why we all are so greedy about this data, as data in form of text which may provide some meaningful information or important KPI which can be really very very beneficial. Okay think in this way, data in twitter which have tweets about government policies, let us say demonetization in India can actually lead to results how the people of country are taking that government step. Customer review in page of Amazon can lead about Product feedback, seller feedback and in fact Amazon feedback. So you can think how important and crucial this data is in each and every fields

Text Analytics: Areas of Application

  1. Social Media Analytics
  2. Banking and Loan Processing
  3. Insurance Claim Processing
  4. Help Desk or Ticketing System/Call Centers
  5. E-Commerce
  6. Psychology
  7. Cognitive Science
  8. Security and Counter Terrorism
  9. Government and Government Policies
  10. Computational Social Science
  11. and many more

So this is overview of NLP, in the coming Blogs we will focus on three main pillars for text Analytics or “steps” generally undertaken on the journey from data to meaning. This can be divided in three parts which are further divided in sub parts. We will cover industry example with proper explanation of Algorithm used

  1. Lexical Processing : Extraction of Raw form of text and then using some techniques like Stopword removal, Bag of Words, Stemming etc.
  2. Syntactic Processing : Understanding the language of text. Understanding the grammar and Parsing technique such as POS tagging etc.
  3. Semantic Processing: Understanding the meaning of text. LDA, LSA, PLSA techniques etc.

So stay tuned we will be covering all this in coming weeks.…. Regards Chetan/Kamal