Are you happy or sad?

Sentiment analysis using Natural Language Processing in Python.

Hussain Burhani
2 min readMar 21, 2021

--

TL;DR: the github repo for this project is here :-)

Background
The other day my friend Sandy from class complemented me on handling DSI with calm and composure, and though I took a moment to pause before thanking her, it did make me wonder if I have just been good at feigning it.

In reality my week is a roller-coaster ride which starts off on a high note when I see all the lovely faces in class, a mid-week panic-attack which slings me into a trough, an early morning swim which kicks in a healthy dose of endorphins, before slumping off at the end of the week as I am wrangled away from my happy place.

I would much rather modulate my week so the troughs aren’t so low and perhaps the peaks not so high. Serendipitously, last week I discovered the happy and sad subreddits — appropriate to temper my mood along the week. It had been working well, but it did get me thinking of how I could extend this modulation to other aspects of my life.

If I am sending off a text, writing a note to my long lost love, or an important email at the end of a tiring day, it would be healthy to have what I have written cross-checked through some sentiment analysis. So, with that in mind, I went about building a natural language processing-based classifier based on the rather rich corpus of verbiage from each of the happy and sad subreddits.

Specifically, I am building a Natural Language Processing-based model which learns from information contained within the ‘happy’ and ‘sad’ subreddits, so that when unseen text is entered, the tool can ascertain whether that text has a happy or sad sentiment to it.

Data science problem
Given some unseen text, classify its sentiment as either ‘happy’ or ‘sad’.

Gathering the data
I coallated the reddit posts through an existing API called Pushshift and using a wrapper on top of that called psaw. Essentially, the API adds structure to gathering data from reddit, and the wrapper further streamlines this process.

I rather arbitrarily chose data from 2010 onward for each of the ‘happy’ and ‘sad’ subreddits, and collected both submission titles and comments. The corpus comprised more than 300,000 observations. The data was stored in csv files and then read back into a dataframe as input for training the classification algorithm.

Exploring and transforming the data

--

--

Hussain Burhani

Having fun with data, ai, and all those sorts of things