Topic Modelling is an unsupervised Machine Learning Technique used for identifying multiple topics in text or you can say identifying abstract or “topics” that are talked in multiple documents. Let us say I Phone or Galaxy Note series have launched newer version of Phone and they want to understand about the the features which customers are talking in n number of reviews( each review will be considered as one document). Let us say 50 % of customers are talking about hardware, 20 % are talking about camera quality and features , 10 % are talking about music quality and 20 % are talking about packaging of product.
Similarly you can say you have large corpus of scientific documents and you want to build a search engine for this corpus. Imagine you have n number of documents which talk about diseases such as heart, lungs, diabetes, so applying topic modelling on top of this document will lead to identify important analysis as per document and Key terms which are most talked about or are responsible for this diseases to occur.
We will covering the understanding of Topic Modelling with practical demonstration with LDA(Latent Dirichlet Allocation)
Defining a Topic: As with other semantic analytics technique we are aware that topic is distribution over terms, i.e. each term has a certain “weights” in each topic, term here can be referred as k number of words in n number of documents. But is this the only way to define topic ? What are the other way in which topics can be defined ?
There are two major task in Topic Modelling :
- Estimating Topic Term Distribution : In this case we define each topic as single term ( which will be changes as per LDA further)
- Estimating the coverage of Topics in Document, i.e. the document – topic distribution : Coverage= the frequency of topic j in document i / Σj( the frequency of topic j in document i)
Some problems in defining topics in single term are :
- Polysemy: If a document has words having same meaning such as(lunch, food, cuisine, meal etc.), the model would choose only one word(say food) as topic and will ignore all the others.
- Word sense disambiguation : Words with multiple meanings such as ‘stars’ would be incorrectly inferred as representing one topic, though the document could have both topics(movie star and astronomical star)
As per above mentioned points we need more complex definition of topic to solve the problem of Polysemy and Word sense disambiguation
To summarize, there are multiple advantages of defining a topic as a distribution over terms.
Consider two topics – ‘magic‘ and ‘science‘. The term ‘magic’ would have a very high weight in the topic ‘magic’ and a very low weight in the topic ‘science’. That is, a word can now have different weights in different topics. You can also represent more complex topics which are hard to define via a single term.
There are multiple models through which you can model the topics in this manner. You will study two techniques in the following lectures – Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA).
Let us study the basic idea of generative probabilistic models in the context of topic modelling.
Informally speaking, in generative probabilistic modelling, we assume that the data (which we observe, such as a set of documents) is ‘generated’ through a probabilistic model. Then, using the observed data points, we try to infer the parameters of the model which maximise the probability of observing the data.
For example, you can see POS tagged words as data generated from a probabilistic model such as an HMM. You then try to infer the HMM model parameters which maximize the probability of the observed data.
Following are the assumption made:
- The words in documents are assumed to come from an imaginary generative probabilistic process(We assume that a probabilistic model generates the data, i.e. the words in a document).
- The order of words in a topic is assumed to be unimportant(If you change the order of words in a topic, the topic will stay the same)
- The process to generate a document is to draw words one at a time(The generative process is assumed such that words are generated one after another, one at a time)
Let’s now learn about the plate notation, that will be used in the later to understand PLSA and LDA.
- Shaded Node: These are observed variables
- Plate: A repetitive structure (The plate signifies repetition)
- What does the number ‘N’ written on the bottom-right of the plate represent: Number of X’s(A total of N number of X’s are present)
Probabilistic Latent Semantic Analysis (PLSA) : Say there are M documents (represented by the outer plate in the figure below), and for simplicity, assume that there are N words in each document (the inner plate). Also, let’s assume you have k topics (k is not represented in the figure).
Each document contains each topic with some probability (document-topic distribution), and each topic contains the N words with some probability (topic-term distribution). The inference task is to figure out the M x k document-topic probabilities and the k x N topic-term probabilities. In other words, you want to infer Mk + kN parameters.
The basic idea used to infer the parameters, i.e. the optimisation objective, is to maximise the joint probability p(w, d) of observing the documents and the words (since those two are the only observed variables). Notice that you are doing something very clever (and difficult) here – using the observed random variables (d, w) to infer the unobserved random variable (c).
Using the Bayes’ rule, you can write p(w, d) as: p(w,d) = p(d) x p(w|d)
- M represents : Number of documents
- N represents: Numbers of terms or words in document
- W shaded: Observed variables are shaded, according to the plate notation
- C represents: Number of topics
- What us unobserved variable: Topic is a latent variable which is not observed, rather inferred
- The number of parameters in PLSA depends upon: No. of topics, no of terms, number of documents
- The total number of parameters in PLSA is equal to the: Number of documents x number of topics + Number of topics x number of terms(Parameters will depend upon the number of documents, number of topics and the number of terms.)
The term p(w|d) represents the probability of a word w being generated from a document d. But our model assumes that words are generated from topics, which in turn are generated from documents, so we can write p(w|d) as p(w|c). p(c|d) summed over all k topics:
P(w|d) = ∑ p(c|d) x p(w|c)
So, we have
P(w,d) = p(d) x ∑ [p(c|d) x p(w|c)]
To summarise, PLSA models documents as a distribution over topics and topics as a distribution over terms. The parameters of PLSA are all the probabilities of associations between documents-topics and topics-terms which are estimated using the expectation maximisation algorithm
Drawbacks of PLSA: You see that PLSA has lots of parameters (Mk + kN) which grow linearly with the documents M. Although estimating these parameters is not impossible, it is computationally very expensive.
For e.g. if you have 10, 000 documents (say Wikipedia articles), 20 topics, and each document has an average 1500 words, the number of parameters you want to estimate is 1500*20 + 20*10k = 230, 000.
We will be covering the next part of Topic Modelling with LDA and practical session of using it. Please provide your comments for feedback.