project / March 7th, 2024
This is a collection of three assignments for the course ENEL 645 (Data Mining and Deep Learning), in which students are required to train a linear regression model, a logistic regression model, and two neural network models using the Community Crime and Disorder Statistics dataset from Open Calgary.
One of the most important scopes of ENEL 645 is to deal with real-world machine learning problems, and we use the Community Crime and Disorder Statistics dataset from Open Calgary to practice this.
Through these assignments, I wanted to apply what I learned about traditional machine learning models for regression and classification using Scikit-Learn, as well as deep learning and neural networks using the Keras Python interface for the TensorFlow library.
Python
Scikit-Learn
TensorFlow
Keras
Linear Regression
Logistic Regression
Deep Learning and Neural Networks
Matplotlib
Seaborn
Our goal is to train a linear regression model that predicts the number of crimes in each community center, based on different input features. We train the model using a reduced version of the Community Crime and Disorder Statistics dataset from Open Calgary that is given to us.
To start, I used the pandas read_csv method to convert the CSV containing the community crime and disorder dataset into a pandas DataFrame with the ‘Community Name’ column set as the index column of the dataset. I then used the DataFrame head() method to show the first five rows of the DataFrame, to ensure the name of the columns matches those in the CSV and that ‘Community Name’ is the index.
After confirming that the DataFrame displays the correct column names, I check the dimensions of the DataFrame to confirm the number of rows and columns. I also check if there are any null values in the DataFrame – fortunately, there are no null values.
Next, I have decided to keep all features (except for Community Name which is now the index) and apply one hot encoding with the get_dummies() method from pandas for feature engineering, in order to convert categorical features like Sector, Group Category, Category, and Month to numerical values so they can be used in my model to improve predictions. The new DataFrame with these converted categorical features is called df_dummies, as seen below. It has a dimension of 100000 rows and 36 columns.
Next, I print a list of all columns inside the newly created df_dummies to have a clearer idea of columns names in this new DataFrame.
Next, I create a features matrix called X from all the columns in df_dummies except for ‘Crime Count’, which will be the output vector y. With X and y created, I use the Scikit-Learn method train_test_split to randomly split the data so that the training data is comprised of 70% of the given data and the test data is comprised of the remaining 30%, as specified in the assignment requirements. A random_state parameter of 956 is arbitrarily chosen to ensure the same random split occurs every time the code is run.
The train_test_split method will create a features matrix X_train that contains features of the training data, a features X_test that contains features of the test data, an output vector y_train that contains the ‘Crime Count’ of the training data, and an output vector y_test that contains the ‘Crime Count’ of the test data.
I also preview the original features matrix X to ensure that the output column ‘Crime Count’ is not included in this DataFrame.
Next, I print the dimensions of the DataFrames formed by train_test_split to ensure that the training data contains 70% of the original 100000 rows (70000 rows) and 35/36 columns from df_dummies, while the test data contains the remaining 30% (300000 rows) and only one column (the output column) from df_dummies.
In the code block below, I am instantiating a Linear Regression model. I fit the Linear Regression model to the training data using X_train and y_train. Once the model is trained, I use the newly-trained model to predict Crime Count values of the remaining testing data X_test. This predictions vector is called y_test_pred.
Once the prediction is calculated, I print the coefficients of the model. Finally, I also calculate and print the mean-squared error using the actual crime count values in the testing set y_test, compared to the predicted values in y_test_pred.
The final results are shown in the screenshot below.
The performance of my model is evaluated based on mean-squared-error cost function, which is equal to 453.71635510105625. This is expected because we are using a real-world dataset from Open Calgary. The data in a Community Crime and Disorder Statistics dataset is realistic and applying a linear regression model to it will not accurately capture and predict the number of crimes in each community center.
Plotting the actual Crime Count output vector data points (y_test) and the predicted values (y_test_pred) for comparison further demonstrates this point, as the output vector data points do not follow the predicted values presented by the line.
In this assignment, the goal is to now train a logistic regression model that predicts the category of each crime. We also train this model using the same Community Crime and Disorder Statistics dataset from Open Calgary that was given to us in assignment 1.
To start, the pandas read_csv method is used to convert the CSV containing the community crime and disorder dataset into a pandas DataFrame with the ‘Community Name’ column set as the index column of the dataset. The DataFrame head() method is then used to show the first five rows of the DataFrame, to ensure the name of the columns matches those in the CSV and that ‘Community Name’ is the index.
After confirming that the DataFrame displays the correct column names, I check the dimensions of the DataFrame to confirm the number of rows and columns. I also check if there are any null values in the DataFrame – fortunately, there are no null values. Additionally, I also check all the unique labels in the ‘Category’ column in order to have an idea of what the output values could be.
Next, I have decided to keep all features (except for Community Name which is now the index) and apply one hot encoding with the get_dummies() method from pandas for feature engineering, in order to convert categorical features like Sector, Group Category, and Month to numerical values so they can be used in my model to improve predictions. The new DataFrame with these converted categorical features is called df_dummies, as seen below. It has a dimension of 100000 rows and 26 columns.
Next, I print a list of all columns inside the newly created df_dummies to have a clearer idea of columns names in this new DataFrame.
Next, I create a features matrix called X from all the columns in df_dummies except for ‘Category’, which will be the output vector y. With X and y created, I use the Scikit-Learn method train_test_split to randomly split the data so that the training data is comprised of 70% of the given data and the test data is comprised of the remaining 30%, as specified in the assignment requirements. A random_state parameter of 39 is arbitrarily chosen to ensure the same random split occurs every time the code is run.
The train_test_split method will create a features matrix X_train that contains features of the training data, a features X_test that contains features of the test data, an output vector y_train that contains the ‘Category’ of the training data, and an output vector y_test that contains the ‘Category’ of the test data.
I also preview the original features matrix X to ensure that the output column ‘Category’ is not included in this DataFrame.
Next, I print the dimensions of the DataFrames formed by train_test_split to ensure that the training data contains 70% of the original 100000 rows (70000 rows) and 25/26 columns from df_dummies, while the test data contains the remaining 30% (300000 rows) and only one column (the output column) from df_dummies.
In the code block below, I am instantiating a Logistic Regression model. Due to time and computational restraints, I try different values of 2000, 3000, and 4000 for the max_iters parameter because I wish to see how many maximum iterations are taken for the solvers to converge. I fit the Logistic Regression model to the training data using X_train and y_train. Once the model is trained, I use the newly-trained model to predict Category values of the remaining testing data X_test. This predictions vector is called y_test_pred.
Once the prediction is calculated, I calculate and print the accuracy score using the actual category values in the testing set y_test, compared to the predicted values in y_test_pred.
The accuracy score final results for each max_iter value are shown in the screenshots below.
The performance of my model is evaluated based on the accuracy score, which is equal to 0.4777, or 47.77% with 2000 iterations. I increase the iterations to 3000 to see if the accuracy score rises, and it slightly increases to 48.54%. With 4000 iterations, the accuracy stays essentially the same with a value of 48.58% - at this point, I decide to stop increasing the iterations and use a max_iters of 3000 because there is very little gain in accuracy score after this point. This accuracy score is expected because we are using a real-world dataset from the Open Calgary Community Crime and Disorder Statistics dataset, and it is difficult to accurately capture and predict the Category of each crime based on these real-world features.
Plotting the actual Category counts (y_test) and the predicted values (y_test_pred) in a confusion matrix further demonstrates this point, there is a considerable difference between the actual and predicted values of each category.
The confusion matrix shows the actual counts of each label on the y-axis, while the predicted counts of each label are on the x-axis. For example, the Physical Disorder label appears 4364 times in y_test, with 3799 being correctly predicted as Physical Disorder (3799 true positives), 564 being incorrectly predicted as Social Disorder, and 1 being incorrectly predicted as Theft From Vehicle (564 + 1 = 565 false negatives). On the other hand, Physical Disorder had 783 false positives (with all of those 783 actually being Social Disorder) in y_test_pred. For most labels, there appears to be a significant number of false negatives and false positives which further confirms my accuracy score of 48.54%.
I also print the actual and predicted counts of each label, just to make sure the numbers look like they match the numbers that are displayed in the display matrix above.
It is notable that no Commercial Break & Enter, Assault (Non-domestic), Street Robbery, and Commercial Robbery, and 1320.131 did not appear in the predicted labels, even though they all have values in the actual test labels.
For assignment 3, we consider the same Community Crime and Disorder Statistics dataset from Open Calgary, like in the previous two assignments. The goal is to train two Neural Network models, with the first model predicting crime count and the second model predicting category.
To start, I used the pandas read_csv method to convert the CSV containing the community crime and disorder dataset into a pandas DataFrame with the ‘Community Name’ column set as the index column of the dataset. I then used the DataFrame head() method to show the first five rows of the DataFrame, to ensure the name of the columns matches those in the CSV and that ‘Community Name’ is the index.
After confirming that the DataFrame displays the correct column names, I check the dimensions of the DataFrame to confirm the number of rows and columns. I also check if there are any null values in the DataFrame – fortunately, there are no null values. Once all these tasks are completed and I am satisfied with the DataFrame, I use the DataFrame pop method to remove the ‘Crime Count’ column and assign it to a variable ‘y’ to represent the output vector.
Next, I have decided to keep all features (except for Community Name which is now the index) and apply one hot encoding with the get_dummies() method from pandas for feature engineering, in order to convert categorical features like Sector, Group Category, Category, and Month to numerical values so they can be used in my model to improve predictions. The new DataFrame with these converted categorical features is called categorical_columns_df_enc, as seen below. It has a dimension of 100000 rows and 33 columns.
Once this is completed, I concatenate this newly formed categorical_columns_df_enc with the ‘Resident Count’ and ‘Year’ columns from the original DataFrame to form the features matrix ‘X’.
Before proceeding, I use a MinMaxScaler to scale the data in X to form X_scaled. My reasoning for this is because there is a big discrepancy in the range of Resident Count values and Year values compared to the other features. I use a MinMaxScaler because none of these values can be negative values, so I scale them to values between 0 and 1.
With X_scaled and y created, I use the Scikit-Learn method train_test_split to randomly split the data so that the training data is comprised of 70% of the given data and the test data is comprised of the remaining 30%, as specified in the assignment requirements. A random_state parameter of 45 is arbitrarily chosen to ensure the same random split occurs every time the code is run.
The train_test_split method will create a features matrix X_train that contains features of the training data, a features X_test that contains features of the test data, an output vector y_train that contains the ‘Crime Count’ of the training data, and an output vector y_test that contains the ‘Crime Count’ of the test data.
Next, I print the also dimensions of the DataFrames formed by train_test_split to ensure that the training data contains 70% of the original 100000 rows (70000 rows) with 35 columns, while the test data contains the remaining 30% (300000 rows) with only one column (the output column).
In the code block below, I am instantiating a Neural Network model with 7 layers using the Keras Python interface for the Tensorflow library. Because Crime Count is a real number and can take any value above 0, I decided to use a RELU function for activation of all layers except the output layer. I also tinker with various hyperparameters, including the number of layers, number of neurons, learning rate, and the loss function. Because I trust the official crime data from the city, I don’t believe there are huge outliers in the crime data and therefore I choose a loss function of Mean Absolute Error instead of Mean Squared Error since this I am not too focused on reducing outlier numbers.
Eventually I arrive at using a model with 7 layers, starts with 1024 neurons, narrows down to 1 neuron for the output since this is a regression problem, and a learning rate of 0.0001. I also use Adam as the gradient descent optimizer because it is used for 90% of Neural Network problems and it is computationally efficient, has minimal memory requirements, and is well suited for large data problems.
I fit the neural network to the training data using X_train and y_train. Due to time and computational restraints, I limit the number of epochs to 30. Using epochs above 30 did not provide a substantial improvement in the mean absolute error.
Once the model is trained, I use the newly-trained model to evaluate the test mean absolute error of the remaining testing data X_test and y_test.
The mean absolute error final results for the training and testing data are shown below.
The performance of my model is evaluated based on the mean absolute error, which is equal to 3.7757 on my testing data and 3.6716 on my training data. After 30 epochs, I decide to stop increasing the iterations and because there is very little gain in accuracy score after this point to make the time and computational resources worth it at this point. I am very happy with these results because there is only a difference of 0.1 between the training and testing mean squared errors – the model is just fit and is neither overfit or underfit. The testing mean squared error is only slightly worse than the training mean squared error, which is what I wanted. As expected, the loss function mean squared error decreases as the number of epochs and iterations increase.
I am pleasantly surprised at the results of my model because we are using a real-world dataset from the Open Calgary Community Crime and Disorder Statistics dataset, and I saw how difficult it was to accurately capture and predict the Crime Count based on these real-world features back in Assignment 1.
As a bonus, I create a similar model, but adjust the learning rate to 0.0002 to see if the results are in a similar range to the model with learning rate 0.0001. In the screenshot below, there is a slight improvement on the mean absolute error on my training data which is now 3.6153. However, the mean absolute error on my testing data increases to 3.8653 which indicates more overfitting compared to my previous model since there is a bigger difference between the training and testing mean squared error. Therefore, this gives me more confidence that my original learning rate of 0.0001 was the correct choice.
To start, I used the pandas read_csv method to convert the CSV containing the community crime and disorder dataset into a pandas DataFrame with the ‘Community Name’ column set as the index column of the dataset. I then used the DataFrame head() method to show the first five rows of the DataFrame, to ensure the name of the columns matches those in the CSV and that ‘Community Name’ is the index.
After confirming that the DataFrame displays the correct column names, I check the dimensions of the DataFrame to confirm the number of rows and columns. I also check if there are any null values in the DataFrame – fortunately, there are no null values. Additionally, I also check all the unique labels in the ‘Category’ column in order to have an idea of what the output values could be. Once all these tasks are completed and I am satisfied with the DataFrame, I use the DataFrame pop method to remove the ‘Category’ column and assign it to a variable ‘y’ to represent the output vector.
Next, I have decided to keep all features (except for Community Name which is now the index) and apply one hot encoding with the get_dummies() method from pandas for feature engineering, in order to convert categorical features like Sector, Group Category, and Month to numerical values so they can be used in my model to improve predictions. The new DataFrame with these converted categorical features is called categorical_columns_df_enc, as seen below. It has a dimension of 100000 rows and 22 columns.
Once this is completed, I concatenate this newly formed categorical_columns_df_enc with the ‘Crime Count’, ‘Resident Count’, and ‘Year’ columns from the original DataFrame to form the features matrix ‘X’.
Before proceeding, I use a MinMaxScaler to scale the data in X to form X_scaled. My reasoning for this is because there is a big discrepancy in the range of Crime Count values, Resident Count values, and Year values compared to the other features. I use a MinMaxScaler because none of these values can be negative values, so I scale them to values between 0 and 1.
With X_scaled and y created, I use the Scikit-Learn method train_test_split to randomly split the data so that the training data is comprised of 70% of the given data and the test data is comprised of the remaining 30%, as specified in the assignment requirements. A random_state parameter of 45 is arbitrarily chosen to ensure the same random split occurs every time the code is run.
The train_test_split method will create a features matrix X_train that contains features of the training data, a features X_test that contains features of the test data, an output vector y_train that contains the ‘Category’ of the training data, and an output vector y_test that contains the ‘Category’ of the test data.
Next, I print the also dimensions of the DataFrames formed by train_test_split to ensure that the training data contains 70% of the original 100000 rows (70000 rows) with 25 columns, while the test data contains the remaining 30% (300000 rows) with only one column (the output column).
In the code block below, I am instantiating a Neural Network model with 6 layers using the Keras Python interface for the Tensorflow library. Because Category is a discrete label that can take on 11 possible values which makes it a multiclassification problem, I decided to use a softmax function for the activation of the output layer. For all other layers, I use the standard RELU activation because I wanted my hidden layers to extract as many features as possible and did not want to limit the output of my hidden layers too much. I also tinkered with various hyperparameters, including the number of layers, number of neurons, learning rate, and the loss function. Because this problem is a multiclassification problem, I chose the loss function to be Sparse Categorical Crossentropy.
Eventually I arrive at using a model with 6 layers, starts with 512 neurons, and narrows down to 11 neurons for the output which matches the number of possible labels for this multiclassification problem, and a learning rate of 0.0001. I also use Adam as the gradient descent optimizer because it is used for 90% of Neural Network problems and it is computationally efficient, has minimal memory requirements, and is well suited for large data problems.
I fit the neural network to the training data using X_train and y_train. Due to time and computational restraints, I limit the number of epochs to 1. I found that using a number of epochs above 15 did not provide a substantial improvement in the accuracy score.
Once the model is trained, I use the newly-trained model to evaluate the testing accuracy score of the remaining testing data X_test and y_test.
The accuracy score final results for the training and testing data are shown below.
The performance of my model is evaluated based on the accuracy score, which is equal to 0.4932 on my testing data and 0.4929 on my training data. After 15 epochs, I decide to stop increasing the iterations and because there is very little gain in accuracy score after this point to make the time and computational resources worth it at this point. I am very happy with these results because there is only a difference of 0.0003 between the training and testing accuracy scores! The model is just fit and is neither overfit or underfit. As expected, the loss function decreases as the number of epochs and iterations increase.
I am pleasantly surprised at the results of my model because we are using a real-world dataset from the Open Calgary Community Crime and Disorder Statistics dataset, and I saw how difficult it was to accurately capture and predict the Category based on these real-world features back in Assignment 2. Although the accuracy scores are not high, they are slightly higher than those from my logistic regression model and there is a smaller gap between the training and testing accuracy scores.
As a bonus, I create a similar model, but adjust the learning rate to 0.0002 to see if the results are in a similar range to the model with learning rate 0.0001. In the screenshot below, there is a slight improvement on the accuracy score of my training data which is now 0.4971. However, the accuracy score on my testing data also decreases to 0.4899 which indicates more overfitting compared to my previous model since there is a bigger difference between the training and testing accuracy score. Therefore, this gives me more confidence that my original learning rate of 0.0001 was the correct choice.
Although the final results were mixed, I found that the biggest value of completing these assignments was in learning how to properly explain my workflow and approach for three different processes. Since the dataset used was real-world data, as expected the traditional regression and classification models I used were less effective than the neural network models. I was pleasantly surprised with how well the neural networks fit the training and testing data, as there was very little difference between their validation metrics. Tackling this project from three different angles with the same dataset
To further improve the outcomes of this project, it would be nice to explore different ways to visualize the results so they can be more explainable to the client. For example, projecting the results onto a map of Calgary would help the audience understand which crimes are occurring the most frequently in each community and which communities are the most dangerous.