Natural Language Processing (NLP) has secured so much acceptance recently as there are many live projects running and now it’s not just limited to academics only. Use cases of NLP can be seen across industries like understanding customers’ issues, predicting the next word user is planning to type in the keyboard, automatic text summarization etc. Many researchers across the world trained NLP models in several human languages like English, Spanish, French, Mandarin etc so that benefit of NLP can be seen in every society. In this post we will talk about one of the most useful NLP metric called Pointwise mutual information (PMI) to identify words that can go together along with its implementation in Python and R.
What is Pointwise mutual information?
PMI helps us to find related words. In other words, it explains how likely the co-occurrence of two words than we would expect by chance. For example the word “Data Science” has a specific meaning when these two words “Data” and “Science” go together. Otherwise meaning of these two words are independent. Similarly “Great Britain” is meaningful since we know the word “Great” can be used with several other words but not so relevant in meaning like “Great UK, Great London, Great Dubai etc.”
When words ‘w1’ and ‘w2’ are independent, their joint probability is equal to the product of their individual probabilities. Imagine when the formula of PMI as shown below returns 0, it means the numerator and denominator is same and then taking log of 1 produces 0. In simple words it means the words together has NO specific meaning or relevance. Question arises what are we trying to achieve here. We are focusing on the words which have high joint probability with the other word but having not so high probability of occurrence if words are considered separately. It implies that this word pair has a specific meaning.
Our objective is to find pairs of words that have high pointwise mutual information.
Steps to compute PMI
Let’s understand with an example. Suppose you have the following text and you are asked to calculate PMI scores based on that. this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence