Skip to content

Unsupervised Sentiment Analysis Using Python

FIM


Recently , the company I worked for saw a huge churn in customers due to some user experience issues.
We hence decided to collect feedbacks from all the customers and analyse their sentiments. The problem is , we do not have any past labelled data to train a model and predict on current feedbacks.

Although most of the analysis over the web concentrates on supervised sentiment analysis. We today will checkout unsupervised sentiment analysis using python.


As we all know , supervised analysis involves building a trained model and then predicting the sentiments. This needs considerably lot of data to cover all the possible customer sentiments.

In real corporate world , most of the sentiment analysis will be unsupervised. Today we shall discuss one module named VADER ( Valence Aware Dictionary and sEntiment Reasoner ) which helps us achieve this sole purpose.


1. Lets Begin With Action

VADER is an NLTK module that provides sentiment scores based on words used. VADER is intelligent enough to understand negation words like “I Love You” vs “I Don’t Love You” , also not limited to finding sentiments in “wow” vs “wow!!!!” where “!” adds to emotions.


Step 0 : Before we begin , Lets download the dataset to be used from HERE. Also besides NLTK we need to install VADAR NLTK files as shown below.

import nltk.  #if not present : pip install nltk

'''
This is how we download vadar files.
'''
nltk.download('vader_lexicon')


Step 1 : Next we shall read the files in pandas dataFrame.

import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df.head()

Sample
Sample


Step 2 : Next we manage null values & empty strings.

# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)

blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  
    if type(rv)==str:            
        if rv.isspace():         # test 'review' for spaces
            blanks.append(i)     # add matching index numbers to the list

df.drop(blanks, inplace=True)


Step 3 : import SentimentIntensityAnalyzer and create a object for future use.

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()


Step 4 : Lets get into real action

#use inbuilt sid.polarity_scores to extract scores. Read on for the results
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

#We break the dict generated above and pull only column 'compound'
df['compound']  = df['scores'].apply(lambda s : s['compound'])

#the step above returns values from -1 to 1.  
df['comp_score'] = df['compound'].apply(lambda n : 'pos' if n >=0 else 'neg')

df.head()

SentimentOp
SentimentOp

Note 1 : function sid.polarity_scores returns 4 elements :
neg : negative sentiment score.
neu : neutral sentiment score.
pos : positive sentiment score
compound : computed by normalising the scores above.

Note 2 :
negative compound value signifies negative sentiment .
Positive compound value signifies Positive sentiment .
Compound value around zero signifies neutral sentiments.


Step 5 : Verify for accuracies using confusion matrix & classification report.


from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
accuracy_score(df['label'],df['comp_score'])
#0.6367389060887513

print(classification_report(df['label'],df['comp_score']))

print(confusion_matrix(df['label'],df['comp_score']))

c_report
c_report


2. Scope for improvement.

We see the results aren’t very impressive yet. Few of the workarounds we can try to get better results are :

  • Try to cleanse the text better.
  • Try using stemming and lemmatization and see if it makes a difference.


3. Complete Code

import numpy as np
import pandas as pd

df = pd.read_csv('moviereviews.tsv', sep='\t')
df.head()

# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)


blanks = []

for i,lb,rv in df.itertuples():  
    if type(rv)==str:            
        if rv.isspace():       
            blanks.append(i) 

df.drop(blanks, inplace=True)


from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()


#use inbuilt sid.polarity_scores to extract scores. Read on for the results
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

#We break the dict generated above and pull only column 'compound'
df['compound']  = df['scores'].apply(lambda s : s['compound'])

#the step above returns values from -1 to 1.  
df['comp_score'] = df['compound'].apply(lambda n : 'pos' if n >=0 else 'neg')


from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
accuracy_score(df['label'],df['comp_score'])
#0.6367389060887513

print(classification_report(df['label'],df['comp_score']))

print(confusion_matrix(df['label'],df['comp_score']))


4. Final Say


The module VADER produces some amazing results if we have data clean enough.
Besides this main limitation I observed is , VADER is very poor in identifying if a sentence has mix of positive and negative sentiments. Also VADER is bad in identifying sarcasm too 🙂

Thanks for reading !

You may not want to miss these exciting posts :