project / March 8th, 2024
This is an individual final project for the course ENEL 645 (Data Mining and Deep Learning), in which students perform a sentiment analysis on IMDB movie reviews to figure out if a comment is positive or negative. Students can create a model using either traditional classification models or deep learning algorithms to predict the number of positive and negative reviews.
The final results must meet the following requirements:
The training accuracy should not be less than 87%.
To prevent overfitting the model, the difference between training and test accuracy should not exceed 2.5%.
The model input cannot contain more than 10000 words or features.
Students are required to download and use the dataset of 50K movie reviews for binary sentiment classification from https://ai.stanford.edu/~amaas/data/sentiment. A report is written to explain the reasoning behind the choice of model, the code used, and the results.
Sentiment analysis is the process of classifying positive or negative sentiment in text. It is useful to businesses, as it assists in enabling them to understand the opinions and behaviors of customers. By analyzing the sentiment behind reviews or even social media conversations, a business will be able to make quicker and accurate decisions that are less uncertain.
Through this project, I wanted to apply what I learned about deep learning and neural networks using the Keras Python interface for the TensorFlow library.
Python
TensorFlow
Keras
Deep Learning and Neural Networks
Matplotlib
Seaborn
Because I wanted to build more experience with deep learning, I decide to build a neural network model for classifying the IMDB reviews.
To start, I download the dataset of 50000 IMDB movie reviews from the link given in the assignment document (https://ai.stanford.edu/~amaas/data/sentiment/). The dataset contains a set of 25000 movie reviews for testing and 25000 reviews for training. The folder that contains this data is called ‘acllmdb’, and I place this folder next to my Python notebook that will contain my code.
Inside the ‘acllmdb’ folder, there are ‘train’ and ‘test’ folders which contain 25000 reviews each.
In the Python notebook, I import the libraries I feel that I may need:
To start, I create a dictionary called data_dict which contains two keys ‘Review’ and ‘Sentiment’ that have empty arrays as values. These two keys will be the column names of the DataFrame I want to create.
Next, I create a variable ‘positive_train_dir’ that stores the path of the folder where the positive reviews for training are stored. Then, I use the listdir method from the os module to create a list of all the text files that are in this folder. I loop through each text file in the list, create a file object, and read the file object in order to get the text for that review. I then append the review text to the array in the ‘Review’ key and append a ‘Positive’ string to the array in the ‘Sentiment’ key to indicate that the review was positive. At this point, there are 12500 reviews stored in my data_dict.
The next step is to repeat the same process for the negative reviews for training and add those reviews and sentiments to the data_dict. After this step, there are 25000 reviews stored in my data_dict, all of it training data so far.
After all the training reviews have been added to data_dict, I repeat the same process for the testing reviews. I start by adding the positive reviews for testing to data_dict, followed by the negative reviews for testing. After this process is complete, I have all 50000 reviews in my data_dict – the first 25000 are training reviews, the latter 25000 are testing reviews.
Next, I convert data_dict to a pandas DataFrame.
The next step is to perform some exploratory data analysis using the describe DataFrame method. As expected, there are 50000 reviews and sentiments in my DataFrame.
I also check the value counts of each sentiment. As expected, there are 25000 positive sentiments and 25000 negative sentiments in the DataFrame.
I also check there are no null values in the Review or Sentiment column and the shape of the DataFrame:
The next step is to preprocess each review, starting by removing special characters. This will help us vectorize the text inputs later without having to worry about special characters.
Next, we use the WordNetLemmatizer object from the Natural Language Processing toolkit to lemmatize each word in each review. This will transform the inflected forms of a word to the base version – for example, the words ‘walk’, ‘walked’, ‘walks’, or ‘walking’ will be all converted to the base form (or the lemma) ‘walk’.
Once this is completed, we convert each word in each review to lower-case. This is done so that the upper- and lower-case version of the same word is treated as one unique feature for vectorization and duplicate features/words not used in the feature matrix.
The next step is to instantiate a CountVectorizer object that will convert each word or text feature to a numerical feature in our features matrix. Because there is a project requirement that we can only use 10000 words in our model input, I also add parameters of min_df=0.00102 and max_df=0.7. This means that CountVectorizer will ignore rare words that appear in less than 0.102% of reviews, and also ignore common words that appear in more than 70% of reviews. Applying these parameters results in a features matrix X_count_vectorized that has 9889 words/features, a little less than the 10000-word limit.
Once this is complete, I split X_count_vectorized back into X_train_count_vectorized which has the first 25000 rows of training data and X_test_count_vectorized which has the last 25000 rows of testing data. I check the shape of each of these DataFrames to ensure the size of each is 25000 rows and 9889 columns.
Next, I apply the same logic to the Sentiment column of the original DataFrame to create y_train and y_test. I then use LabelEncoder on both y_train and y_test to encode the positive and negative sentiments to numerical values, which creates y_train_enc and y_test_enc.
Next, I decide to use a Neural Network model over a traditional classification model because I wanted more practice with Neural Networks and ENEL 645 emphasizes deep learning.
I instantiate a Neural Network model with 6 layers using the Keras Python interface for the Tensorflow library. Because this is a binary classification problem in which Sentiment can only be two possible values, I specify the last layer as a SoftMax layer with 2 neurons. For all other layers, I use the standard RELU activation because I wanted my hidden layers to extract as many features as possible and did not want to limit the output of my hidden layers too much. Because this problem is a classification problem, I chose the loss function to be Sparse Categorical Crossentropy. Eventually I arrive at using a model with 6 layers, starts with 512 neurons, and narrows down to 2 neurons for the output, and a learning rate of 0.001. I also use Adam as the gradient descent optimizer because it is used for 90% of Neural Network problems and it is computationally efficient, has minimal memory requirements, and is well suited for large data problems.
Next, I fit the neural network to the training data using X_train_count_vectorized and y_train_enc. I limit the number of epochs to 30 to start.
Here, we have a training accuracy score of 0.9990 or 99.9%. I found that using a number of epochs above 5 did not provide a substantial improvement in the accuracy score. This is far above the required training accuracy of 87%.
Once the model is trained, I use the newly-trained model to evaluate the testing accuracy score of the remaining testing data X_test_count_vectorized and y_test_enc. However, we have a problem - the testing accuracy score is 86.16%, a difference of 13.74% from the training accuracy score and far above the required range of 2.5%. My model is overfitting the data!
To fix this, I do a few things to reduce the complexity of my model:
I reduce the number of layers from 6 to 5.
I reduce the number of neurons in my first layer from 512 to 256.
I apply L2 regularization with a regularization factor of 0.05 to the first layer. This reduces the effects of the weights in my model by adding the squared magnitude of the feature weights coefficients as a penalty term to the loss function.
Reduce the number of epochs from 30 to 15.
Fitting this model to the training data now gives a training accuracy of 88.08%, which is still above the specified requirement of 87%.
Once the model is trained, I use the newly-trained model to evaluate the testing accuracy score again. This time, I still have a testing accuracy score 86.16% but with there is now a difference of 1.92% between the training and testing accuracy scores. We are now within the 2.5% range, and have fulfilled the project requirements!
Just for fun, I continue to experiment by using both L1 and L2 regularization instead of just L2 regularization to see if I can further reduce the difference between the testing and training accuracy scores:
This leads us to our final result.
Applying an L1 regularization factor of 0.001 and an L2 regularization factor of 0.025 managed to further reduce the difference between the training accuracy score of 88.42% and testing accuracy score of 87.04%. Both scores slightly increase from those of the previous model, while seeing the difference between the two scores drop from 1.92% to 1.38%. I am extremely satisfied with these final results, since the training accuracy score of 88.42% is above the required training accuracy score of 87%, and the difference between the training and testing scores is below the required range of 2.5%. The model is just fit and I am pleasantly surprised at the results of my model because we are using a real-world dataset from IMDB. I did not expect my neural network model to predict the sentiments of movie reviews so accurately since it is the first time I am performing sentiment analysis.
I also plot the testing and training accuracy scores in a line plot.
I am very happy with the final results, as I able to comply with the requirements of the project:
The training accuracy should not be less than 87%.
To prevent overfitting the model, the difference between training and test accuracy should not exceed 2.5%.
The model input cannot contain more than 10000 words or features.
This project was the first time I have applied my knowledge of deep learning and neural networks to problem, and has given me confidence that I can use a similar approach to other problems while satisfying client requirements.
To improve my model, I could continue to tinker with the hyperparameters - for example, I could use an even smaller learning rate and see how it affects the difference between training and testing scores. I would also like to use this model on a different set of review data to see if the results are similar.