project / March 9th, 2024
This is a individual final project for the course ENSF 611 (Machine Learning for Software Engineers), in which students are given the freedom to choose a topic to investigate using the concepts learned in the course.
I decided to explore the following topic: "Using the clustering models learned in ENSF 611, experiment and explore how effective each of them are in clustering clues from the game show Jeopardy! based on the keywords used in each clue. If the models learned in class are ineffective, see if there are any models that do work."
Motivation: Whenever I watch the game show Jeopardy!, I have always been curious as to which categories and topics appear the most frequently in clues that appear on the show.
Looking at the fan-created archive website J! Archive, which contains a database of all the clues aired since the debut of the show in 1984, I noticed a definite shift in the writing style of the clues and category names from then to now.
For example, it would be easy to know what the category of an Opera clue is from an episode in 1984, since the category at the top is shown to be “Opera”. However, in a 2021 Opera clue example this is not as clear, since the category is shown as “Aria Grande” – this is a play on the name of the pop singer Ariana Grande and also references the opera term “Aria”, which is self-contained opera piece for one voice. The category name hints at indicates multiple possibilities for what the clue may be about. The clue content in the 2021 example is also less straightforward and contains trickier writing, making the answer harder to find.
The main point is that the category and content of the clues that appear on the current episodes of Jeopardy! are less obvious and more subtle. Therefore, choosing to separate clues based solely on category name may not be the most reliable way to find out which topics appear the most frequently in clues.
Instead, what if we clustered clues based on keywords used in the clue instead? Would this improve the chances of discovering distinct topics for the clues?
In terms of expectations as someone who is new to the topic of Machine Learning, I definitely do not expect the clustering models I choose to produce worthwhile results, especially since the clue language is so complex and tricky. This is more of an experimental project for my own curiosity to see first-hand how effective models can be with real-world data.
Python
Scikit-learn
Pandas
Numpy
Seaborn
Plotly
To start, I acquired a dataset of all Jeopardy! clues from September 1984 to June 2021 from J! Archive. Since many people have already web scraped the clue data and converted them to CSV files, there is no need to write any web scraping code. September 1984 is when the current iteration of Jeopardy! started, and the most recent dataset I could find contained clues up to June 2021.
Next, I imported the libraries I would need to perform an exploratory data analysis:
Pandas allows the data in CSV files to be converted to a DataFrame that can be manipulated in Python. Numpy provides mathematical functions that can be operated on arrays and matrices. Matplotlib, Seaborn, and Plotly are all data visualization libraries.
I performed an exploratory data analysis after converting the clue data in the CSV to a pandas DataFrame. Here are the first five rows of the clue DataFrame:
Next, I drop any columns or features from the dataset that should not affect the clustering of the clues.
I also inspect the shape of the DataFrame (almost 390,000 clues!), the number of unique categories (over 47,000!), the data types of each DataFrame column, and visualize the 100 most common answers using Plotly.
From the list of the top 100 most common answers, a lot of clues appear to be heavily based around world and U.S. geography (all answers in the top 10 most common answers are about geography) and U.S. presidents, as well as famous historical figures, space, and Shakespeare. Let's see if this correlates with the most common categories.
It is interesting that Science, Literature, Word Origins, Before & After, and Sports are among the top categories, as I probably would not have guessed this based off the earlier results for the most common answers. I expected World and U.S. Geography and World History to be at the very top of this list, based off the most common answers. This also hints at a discrepancy between the clue content and the clue category that I talked about at the start of this project.
Just for fun: let's look at the most common questions to see if there is any repeated clues.
If we look at the most common questions, we can see questions like [flag], [State outline], [theme music], [sports logo], [instrumental], or [music plays] - these are most likely audio or video clues instead of written clues.
We should drop all rows with these audio or video clues. This leads us to the next step: Data Cleaning.
First, we check for null values in our dataset:
There are no null values in our dataset!
Next, we drop all pure audio or video clues in our dataset that have questions like [flag], [State outline], [theme music], [sports logo], [instrumental], or [music plays].
Compared to the 389445 rows found earlier, there are now 389416 rows in clues_df.
Dropping all rows with pure audio/video clues cut 29 clues from our dataset.
With 389416 questions and 47712 'unique' categories (although some of these categories like U.S. History, History, and World History are essentially the same), let's see how the clustering models learned in ENSF 611 are effective in grouping the clues based off their keywords.
First, let's add another column in our clues_df called 'Processed Question' that will be in lower case letters and have special characters removed, which will help us implement a bag-of-words representation later.
Next, let's check how some of the processed questions look compared to the original version:
I'm pretty satisfied with how the questions have been cleaned. Special characters like quotation marks, commas, and backslashes have been removed, and all words have been converted to lower case.
Next, let's convert the text features in the "Processed Questions" column to numerical features based on how often they appear in the clue set using CountVectorizer. Let's ignore words that appear in less than 10% of the clues, since these words appear too infrequently and are unlikely to help our models.
Before I can consider using this dataset, I should reduce its dimensions and keep only the most important features. How many components should I use? This can be determined by looking at the cumulative explained variance ratio as a function of the number of components:
How many components are needed to describe the data? From the curve, approximately 90% of the total variance is contained within the first 6 components. The first 4 components contain approximately 75% of the variance, while 9 components are needed to describe close to 100% of the variance.
For now, I will choose 2 components to avoid overfitting the data and to avoid long computational times. Let's visualize the CountVectorized data after running it through PCA:
Let's put this 2D representation into a DataFrame:
Before I consider running my CountVectorized 2D dataset through the different clustering models, let's also consider an alternative way of re-scaling features: Term Frequency–Inverse Document Frequency (TF-IDF). This method gives a high weight to any term that appears often in a particular document, but not in many documents in the corpus - so if a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document.
Let's see how different my dataset would look if I used TFIDF Vectorizer and PCA. Because my computer cannot use all 389420 samples from the original dataset in the TfidfVectorizer, I randomly sample 5000 samples before fitting - even 10000/20000 samples were too large!
Again, I should reduce the dimensions of this matrix and keep only the most important features. Let's see how the 2D representation of this TFIDFVectorized data looks in comparison to the 2D representation of the CountVectorized data:
Compared to the 2D representation of my CountVectorized data, this doesn't look good - most of the points appear to be concentrated in one spot, and there isn't a wide variance being shown. I would love to try this with more data samples, but I do not have enough memory to even do 10000 samples.
Therefore I will use the 2D CountVectorized data for my clustering. This is probably the better approach anyways, since I think the common words matter in this case and I am looking for words that are common in many documents, not just one like in TFIDF.
Let's start with KMeans clustering. How many clusters should I use? Let's try the Elbow Method with distortion/inertia values:
Using the Elbow method means finding the point with the maximum curvature in the above plot. Since the curve decreases at a steadier pace from 30 clusters onwards, I will choose this number for my model.
Let's grab the coordinates of our 30 cluster centers and put them into a DataFrame:
Next, let's compare the coordinate of each data sample point in count_vectorized_2D_df and find the cluster (out of the 30 clusters created) that is closest to it.
The below segment of code is credited to https://medium.com/@williamsuh/unsupervised-learning-based-on-jeopardy-questions-part-2of-3-68c18c3490bd, who was able to find a way to compare each centroid center to each coordinate using the following formula:
Let's now visualize all of the data samples in our count_vectorized_2D_df. The different colours correspond to each of the 30 different centroids (numbered from 0-29). The red 'X' markers correspond to the center of each cluster.
Visually, it doesn't look the model clustered the data too well - especially the outliers. K-means assumes the clusters are convex, and the data samples aren't necessarily arranged that way.
How effective was my model in clustering the dataset? Let's now look at some Validation Metrics - namely the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index:
A silhouette score of 0.779 is actually a pretty good score. A score of +1 indicates a highly dense clustering, so to be over 3/4ths of that metrics is a positive sign that my model performed better than I thought.
The Calinski-Harabasz score is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. A higher Calinski-Harabasz index relates to a model with better defined clusters - I will be able to gain more insight on whether this value of 747317.65 is good or bad once I get more Calinski-Harabasz scores from the other clustering models.
The Davies-Bouldin index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters.
0 is the lowest possible score, and values closer to zero indicate a better partition.
Based off that, our value of 0.52 is fairly close to 0 - according to this, the clusters are not really similar and are better partitioned.
Let's randomly pick a cluster and manually check the questions in that cluster to see if there is any similarity between them:
Upon first glance, the questions from this cluster don't look similar - the topics and official categories are quite varied. Any similarities found seem to be small ones. From this small sample the questions content might be geared more towards Sports, TV, and Movies.
Let's look at the categories of all the questions in the cluster - I'm interested to see if most of the clues in this cluster have a common theme.
Out of the top categories in this cluster, history seems to be the most common - with History, World History, European History, American History, and Museums all appearing in the top 20. Literature is also present in this cluster with Literature, Shakespeare, and Nonfiction appearing in the top 20.
It is encouraging to see that the questions in this cluster appear to be slightly slanted towards a couple of categories - even if it is not a significant amount of clues. Keep in mind however that there are 10993 categories, so the low numbers in the table above indicate that the categories are quite varied and there isn't a category or topic that the cluster questions are heavily geared to.
To demonstrate this point, let's visualize the top 40 categories in the cluster in a pie/donut chart below - you'll see that it's quite varied. You can hover your mouse over each section in the pie/donut to get more details.
Let's see if using a GaussianMixture Model is any better.
First, let's see how many components we'll need for our Gauusian Mixture model. Let's create an array of Gaussian Mixture models, each with a different number of components from 5 to 60:
Next, let's use some techniques in the Data Science Handbook to plot the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) for each Gaussian Mixture model in our array.
From Data Science Handbook Chapter 5: https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html
Here, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) plots for all the GaussianMixture models with components from 5 to 60 are the same.
I will choose the number of components to be 45, since the optimal number of clusters is the value that minimizes the AIC or BIC line plots and the line plot seems to level out from this point on.
Let's add a cluster label column next to each data sample coordinate and call it labels_df:
Next, we draw the Gaussian Mixture model.
Again, it doesn't look the model clustered the data too well - some clusters are overlapping with others.
How effective was my model in clustering the dataset compared to K-means? Let's now look at some Validation Metrics - namely the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index:
A silhouette score of 0.636 is not as good as the K-Means score of 0.779, but is still a high-quality cluster since it is over 0.5.
The Calinski-Harabasz score is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. A higher Calinski-Harabasz index relates to a model with better defined clusters - compared to the value of 747317.65 for the K-Means model, this value of 301526.41 is much smaller. This makes sense, as the K-Means clusters do seem more well defined and spaced apart compared to the GMM clusters.
The Davies-Bouldin index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters.
0 is the lowest possible score, and values closer to zero indicate a better partition.
Based off that, our value of 1.48 is not as close to 0 as the previous K-Means value of 0.52 - therefore, our clusters from Gaussian Mixture are more similar and are not partitioned as well as K-Means clusters.
Let's manually check the clues in a random cluster and see if this is true:
Upon first glance, the questions from this cluster don't look similar - the topics and official categories are quite varied. Unlike the K-Means cluster I manually checked, the questions don't seem to have an obvious common theme.
Let's look at the categories of all the questions in the cluster - I'm interested to see if most of the clues in this cluster have a common theme.
Out of the top categories in this cluster, anatomy-based categories like Anatomy, The Human Body, and The Body Human are all among the top 20 categories. 'Body' or 'Bodies' seems to be the main theme for the cluster, as the category 'Bodies of Water' is the leading category. There is also a small amount of name-based categories (Explorers, Authors, U.S. Presidents) or animal-based categories (Animal, Birds) in this cluster. However, the clue count for each category are quite low overall which indicates a very diverse set of topics for this cluster.
Let's visualize the top 40 categories in the cluster in a pie/donut chart below - you'll still see that it's quite varied, but there seems to be more of a common theme present in this cluster compared to the K-Means cluster I looked at earlier. In the notebook, you can hover your mouse over each section in the pie/donut to get more details.
Let's look at DBSCAN now.
For DBSCAN, let's arbitrarily choose an eps value of 0.1 and a min_samples value of 10 for this model. Because of memory and time constraints, let's choose 50000 random samples from our dataset for the model.
Let's now visualize this dataset:
It looks like the clusters are spaced apart nicely with minimal overlap, apart from the outliers. This probably had to do with the idea behind DBSCAN - that clusters form dense regions of data, separated by regions that are relatively empty.
How effective was my model in clustering the dataset compared to K-means and GMM? Let's now look at some Validation Metrics - namely the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index:
A silhouette score of 0.802 is a good score - slightly better than the K-Means score of 0.779 and significantly better than the GMM score of 0.636. It is important to keep in mind that this score was computed with only 5000 samples, as opposed to K-Means and GMM using the entire dataset. A score of +1 indicates a highly dense clustering, so to be about 4/5ths of that metric is a positive sign that my model performed better than I thought.
The Calinski-Harabasz score is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. A higher Calinski-Harabasz index relates to a model with better defined clusters - compared to the value of 747317.65 for the K-Means model and a value of 301526.41 for GMM, the DBSCAN value of 30609.66 is significantly smaller. This is slightly surprising, as the clusters from the visualization above seem to be more well-defined and space-apart with less overlap compared to the visualization in GMM.
The Davies-Bouldin index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters.
0 is the lowest possible score, and values closer to zero indicate a better partition.
Based off that, our value of 1.43 is about the same as the GMM value of 1.48. Both are not as close to 0 as the previous K-Means value of 0.52 - therefore, our clusters from DBSCAN are more similar and are not partitioned as well as K-Means clusters.
Let's manually check the clues in a random cluster and see if this is true:
The questions from this cluster don't look too similar - the topics and official categories are quite varied with bits of Music, Movies, Politics. The only other thing I could possibly say are similar are the length of the questions - they seem shorter, direct, and to-the-point.
Let's look at the categories of all the questions in the cluster - Again, I'm interested to see if most of the clues in this cluster have a common theme.
Science is the top category in this cluster, but with only 9 clues out of 1155. The categories are pretty varied here, and do not lean to a specific category.
Let's visualize the top 40 categories in the cluster in a pie/donut chart below - as expected, it's quite varied, and doesn't seem to lean to a specific category. You can hover your mouse over each section in the pie/donut to get more details.
Overall, the results for the DBSCAN model are worse than those of K-Means and Gaussian Mixture Model.
Given extra time and resources, I would perform a grid search to determine to best parameters to use, and use parallel computing or a more powerful computer in order to fit a DBSCAN model to the entire dataset.
Let's move on to Agglomerative Clustering.
For Agglomerative Clustering, let's use 10000 samples for time and memory purposes. First, we'll have to use a dendrogram to help decide how many clusters to use.
To determine the number of clusters to use, we imagine a horizontal line (the dashed line below) cutting across the section with the largest vertical lines in the dendrogram above - that is the section with the blue lines. The number of vertical lines that this dashed line intersects is the number of clusters.
The horizontal dashed line intersects 3 vertical lines, so we'll use 3 clusters.
With only three clusters it will be interesting to see contents of each one.
How effective was my model in clustering the dataset compared to K-means, GMM, and DBSCAN? Let's look at some Validation Metrics - namely the Silhouette Score, Calinski-Harabasz Score, and Davies-Bouldin Index:
A silhouette score of 0.546 is significantly less than the K-Means score of 0.779 and the DBSCAN score of 0.802, while being closer to the GMM score of 0.636. It is important to keep in mind that this score was computed with only 10000 samples, as opposed to K-Means and GMM using the entire dataset. A score of +1 indicates a highly dense clustering, so to be only just half of that score indicates to me that the model produced slightly above-average results.
The Calinski-Harabasz score is the ratio of the sum of between-clusters dispersion and of within-cluster dispersion for all clusters. A higher Calinski-Harabasz index relates to a model with better defined clusters - compared to the value of 747317.65 for the K-Means model, the value of 301526.41 for GMM, and the DBSCAN value of 30705.80, the score of 10498.91 is significantly less than all of these scores. This is not surprising, as building only three overlapping clusters means these clusters likely won't be particularly well-defined.
The Davies-Bouldin index signifies the average ‘similarity’ between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters.
0 is the lowest possible score, and values closer to zero indicate a better partition.
Based off that, our value of 0.67 falls between the DBSCAN value of 1.40, GMM value of 1.48 and the K-Means value of 0.52 - therefore, our clusters from Agglomerative clustering are more similar compared to K-means clusters, but are still better partitioned compared to DBSCAN and GMM.
Manually checking questions in each cluster seems pointless when there are only three clusters, so let's go straight to visualizing the top 40 categories for two of the clusters in a pie/donut charts below. You can hover your mouse over each section in the pie/donut to get more details.
The two clusters appear to be fairly similar with how varied their categories are - Sports and science are the leading categories for both clusters, but not by much. There seems to be a little bit of everything in each cluster, and there isn't much distinguishing the two.
To recap, we have the following validation metrics for each model:
Our K-Means model produced a silhouette score of 0.779, a Calinski-Harabasz score of 747317.65, and a Davies-Bouldin index of 0.52.
Our Gaussian Mixture model produced a silhouette score of 0.636, a Calinski-Harabasz score of 301526.41, and a Davies-Bouldin index of 1.48.
Our DBSCAN model produced a silhouette score of 0.802, a Calinski-Harabasz score of 30609.66, and a Davies-Bouldin index of 1.43.
Our Agglomerative model produced a silhouette score of 0.546, a Calinski-Harabasz score of 10498.91, and a Davies-Bouldin index of 0.67.
Overall, our K-Means model has the best validation metrics - the silhouette score of 0.779 indicates highly dense clustering and is excellent, especially considering that I am using the entire dataset. The Calinski-Harabasz score of 747317.65 is by far the highest-ranked and K-Means also has the best Davies-Bouldin index at 0.52 (closest to 0) which indicates a better partition and less similarity compared to the other models. From manually checking a cluster, it is encouraging that the cluster had questions on various types of history and literature.
Our Gaussian Mixture model produced one of the lowest silhouette scores, but at 0.636 it is still a quality cluster since it is over 0.5, especially considering that the entire dataset is being used. The model had the second-best Calinski-Harabasz score at 301526.41, indicating that the model has better defined clusters compared to DBSCAN and Agglomerative. The Davies-Bouldin index of 1.48 was the lowest though, indicating that the clusters are somewhat similar. From manually checking a cluster, I was glad to see that a decent number of clues were based off questions with "body" or "bodies" as a keyword, which resulted in anatomy or bodies of water categories being prominent.
The DBSCAN model produced the best silhouette score at 0.802, the third-best Calinski-Harabasz score at 30609.66, and the second-worst Davies-Bouldin score of 1.43. Keep in mind that these results were produced with a sample of 50000 questions instead of the full dataset, due to time and memory constraints. Manually, the questions from the cluster I checked did not look similar - the topics and official categories are quite varied. The only thing I could possibly say are similar are the length of the questions - they seem shorter, direct, and to-the-point. The manual check of the DBSCAN cluster did not really correlate with the silhouette score results that were produced.
Finally, the Agglomerative model produced the worst results overall (despite only having a sample of 10000) as it had the worst silhouette score at 0.546 (but still a decent number since it is over 0.5), the worst Calinski-Harabasz score by far at only 10498.91, and a decent Davies-Bouldin index at 0.67. Also, the two clusters I checked looked quite similar - both were quite varied as Sports were the leading categories for both clusters, but not by much. There seems to be a little bit of everything in each cluster, and there wasn't much distinguishing the two.
Using these results, let's interpret my original question from my project proposal: Using the clustering models learned in ENSF 611, experiment and explore how effective each of them are in clustering clues from the game show Jeopardy! based on the keywords used in each clue.
Answer: Although none of the models produced particularly amazing results, I would say the K-Means model and Gaussian Mixture models produced the best results overall based off their validation metrics and manual checks that I talked about above. I am curious if it is possible that DBSCAN and Aggomerative could produce better results, given a larger sample.
I wanted to explore if I could cluster clues from the game show Jeopardy! based off their keywords, instead of relying solely on the clue category to determine which topics appear the most frequently in clues. I wondered if separating clues based on the words would be more reliable for finding distinct clue topics than relying on just category names, which have become more cryptic as time has passed. This was also an experimental project for my own curiosity, to see first-hand how effective the clustering models I learned in class can be with clustering real-world data.
Instead of using three versions of the clues CSV file (all clues from 1984-2021, clues from 1984-1995, and clues from 2010-2021) I just used the original version with all clues from 1984-2021 due to time constraints. The code I wrote in this file had become quite long and I didn't want to repeat it for two other datasets.
As I mentioned, the code in this notebook file had become quite long - much longer than I expected. For these reasons I did not explore other clustering models like spectral clustering or Doc2Vec like I planned - these are bonus tasks I can revisit in the future.
I did not expect to manually check questions and categories from random clusters for each model and visualize them in a pie/donut chart, but thought it was a great idea instead of just looking at validation metrics.
The most difficult parts of this project included finding the appropriate parameters to use for each model - I learned about using the Elbow method to find the appropiate amount of clusters for K-Means model, plotting the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) to find the appropiate amount of components for the Gaussian Mixture model, and using a Dendrogram to find the appropiate amount of clusters for Agglomerative model. It was also difficult figuring out a way to compare each centroid center from the K-Means model to each data sample. A lack of memory was also an issue, as training models and computing silhouette scores took a long time. For DBSCAN and Agglomerative clustering, I had to use a small sample size because there wasn't enough memory to fit the entire dataset.
This is my first time doing a machine learning project, and I really enjoyed being able to apply the concepts I learned in class to real-world data, regardless of the results. The easiest and most enjoyable parts of this project were the data exploration, data cleaning, and data visualization aspects.
Given extra time and resources, I would perform a grid search to determine the best parameters to use, especially for DBSCAN. I would also use a more powerful computer or parallel computing in order to fit the DBSCAN and Agglomerative models to the entire dataset. I would also explore other clustering models like Doc2Vec or spectral clustering.