Hockey Analytics Web Scraper

project / November 8th, 2021

Hockey Analytics Web Scraper - Main Image

GitHub

Background

This is a two-part project created for my professional portfolio after completing the 100 Days of Code Python Bootcamp. In the first part, a custom web scraper script is written in Python to open the hockey analytics website Natural Stat Trick in the Google Chrome browser and collect individual Calgary Flames player data from the 2020-21 NHL season. A CSV file containing this data is automatically created. In the second part of the project, the CSV file is uploaded into a Google Colab notebook and its data is analyzed and visualized to answer various questions related to player performance.

Purpose

After completing the 100 Days of Code Python Bootcamp, the majority of the subsequent projects I had done for my professional portfolio were either web or desktop applications. With this project, I wanted to expand the variety of projects in my portfolio by using Selenium for automation, Beautiful Soup for web scraping, and pandas, Matplotlib, Plotly, Seaborn, and NumPy for data exploration, analysis, and visualization. Other than Selenium, I did not have any previous experience using any of these libraries for a project, and thought it was a good opportunity to showcase my versatility and willingness to learn more potential applications of Python. Additionally, this project also presented a chance to tie my knowledge of Python to my interest in hockey.

Concepts and Tools Used

  • Python

  • Selenium

  • Beautiful Soup

  • pandas

  • Matplotlib

  • Plotly

  • Seaborn

  • NumPy

  • Google Colab

  • Data exploration and visualization

Process

At the beginning of this project, I had to make a decision on the scope of the project. More specifically, which data set (eg. analytics for players on one team, analytics for players on all teams, or team analytics) I wanted to scrape and analyze. Because this project was my first time collecting and analyzing data, I chose to collect player analytics from the Calgary Flames for the 2020-21 NHL season (the most recently completed regular season). By limiting the size of the data set to a single season for a single team, there would be less complex analyses required (like a year-to-year analysis and comparison to league-wide averages), which I felt was more suitable for a beginner and less time-consuming.

Part 1: Automation, Web Scraping Data, and Creating a CSV

In part 1 of this project, the first step is to create a Python script that will automate the process of opening a browser, going to the Natural Stat Trick website, selecting the correct data filters to display the desired data table on the browser, and scraping the table data into a Python dictionary format that can be eventually converted to a CSV file.

To create chains of autonomous actions in a browser, Selenium is imported into the script. Inside the script, I create a class named NaturalStatBot, where a driver object is created from the Selenium Webdriver module to open the Natural Stat Trick page in Google Chrome. Inside the NaturalStatBot class will be two methods that I created, set_filters and create_csv.

The set_filters method is called first, which will open the Natural Stat Trick link URL inside the Google Chrome browser using the driver's get method.

Hockey Analytics Web Scraper - Natural Stat Trick Unfiltered Table

After the URL is opened in Chrome, the driver is used to select the HTML elements that filter the table data. The filters are changed to display player data for the Calgary Flames during the 2020-21 regular season for players who played a minimum of 100 even strength minutes (minutes where there are 5 skaters and 1 goalie on each team), as "rates" (percentages or average counts over 60 minutes, rather than just counted numbers). Once all of these filters are set, the "Submit" button is selected by the driver and clicked to display the table data I want. Finally, the set_filters method ends with the driver taking the current URL that the browser is on and storing it into a class attribute called "url".

Hockey Analytics Web Scraper - Natural Stat Trick Filtered Table

Once set_filters is finished, create_csv is called. Here, all the HTML elements in the Natural Stat Trick URL with the filtered table data are made accessible in Python by using the Beautiful Soup library. In order to web scrape an HTML table, the first step involves finding all the column names in the table (which are of the HTML element "th") by using the find_all method. This will store all the "th", or column names inside an array that can be read inside a FOR loop. Inside a FOR loop, the name of each column is taken as a Python string and stored with an empty array to form a tuple. The empty array inside each tuple will eventually contain the corresponding column data under each field. Once a tuple is formed, it is stored into an array of tuples which represent each column in the table.

After the array of tuples is created, the second step to web scraping an HTML table is to find the HTML element that contains the body (the data rows) of the table. This element was found by locating the id attribute of the table body. Once the table body is made accessible in Python, find_all is used to find all the data row elements (which are of the HTML element "tr") and place them into an array. A FOR loop is used to loop through each data row, and find_all is used again to find all the data values (which are of HTML element "td") inside each row. The number of data values inside each row should match the number of columns in the table, or number of tuples inside the array of tuples. Another FOR loop is then used to loop through the data values for each row, convert them to a Python float, integer, or string depending on the data type, and append them to an empty array inside each tuple in the array of tuples. Therefore, the result is an array of tuples where each tuple contains the column/field name and an array of data values in the column. Once it is verified that the array in each tuple each have the same amount of data values, dictionary comprehension is used to create a Python dictionary from the array of tuples. The dictionary keys are the column names, and each key represents an array of data values for that respective column.

A pandas DataFrame is created from the completed dictionary, and the DataFrame's to_csv method is used to create a CSV file that has the same filtered table from Natural Stat Trick and can be viewed in Microsoft Excel. This concludes part 1 of this project - my custom web scraper was successful in collecting data from Natural Stat Trick and creating a working CSV file!

Hockey Analytics Web Scraper - CSV File in Excel

Part 2: Data Exploration, Visualization, and Analysis

What insights can be gained from analyzing the table data in the newly created CSV file? Before starting part 2, I wrote a general list of questions about the Flames 2020-21 regular season that I wanted answers to:

  • Which Flames forwards and defensemen excelled or struggled the most at influencing team puck possession when on the ice during 5-on-5 play?

  • Who were the best or worst offensive performers? Which Flames forwards and defensemen excelled the most or least at contributing to high-danger scoring chances generated by the team when on the ice during 5-on-5 play? Was there a correlation between puck possession time and scoring chances generated?

  • Who were the best or worst defensive performers? Which Flames forwards and defensemen excelled the most or least at contributing to the team preventing high-danger scoring chances during 5-on-5 play?

  • Which players had the easiest and toughest assignments? How was each player strategically deployed and how did it affect team performance? Were players effective in the roles they were given? Was there a correlation between deployment and scoring chances generated?

To answer these questions, I uploaded the CSV in a Google Colab notebook. A notebook makes it easy to write Python code for the CSV data, while keeping all of my code segments organized and labeled.

After uploading the CSV, I imported the pandas library into the notebook for data analysis and manipulation. The pandas method read_csv is used to create a DataFrame from the CSV, seen in the image below.

Hockey Analytics Web Scraper - Full DataFrame
Preliminary Data Exploration

Compared to the PyCharm console, viewing table data in a notebook is easier and more visually appealing. The next step is preliminary data exploration. Using various pandas methods, I can easily check the dimensions of the DataFrame (22 rows, 54 columns), print a list of all the column names (which will be a handy reference when selecting columns later), check if there are missing or duplicate values in the DataFrame (there were none), and get more DataFrame information like the mean, standard deviation, and minimum and maximum values in each column:

Hockey Analytics Web Scraper - DataFrame Description
Possession Metrics

After the preliminary data exploration is finished, it's time to look at possession metrics. To get more insight into how each player influenced team puck possession when they were on the ice, a useful statistic to look at is Corsi. During 5-on-5 play, Corsi is the differential between shot attempts directed towards the net (either on goal, missed, or blocked) of the opposing team, and those directed towards their own net. Because shot attempts are usually made in the offensive zone, Corsi can be used as an indicator of how much time a player's team spends in the offensive zone and how much time their team possesses the puck when they are on the ice. Generally, teams who spend more time in the offensive zone than the defensive zone and have a positive Corsi have a greater chance of winning.

For this analysis, I decided to look at each the Corsi For Percentage (CF%) of each player. I also wanted to look at the forwards and defensemen separately and did so by creating separate DataFrames for each of them. First, I looked at top 5 forwards with the highest CF%.

Hockey Analytics Web Scraper - Top 5 Forwards in CF%

Do these results match the eye test and are there any surprises? Because of an off year where his offensive production was underwhelming until the last quarter of the season, I was surprised to see that Matthew Tkachuk was still one of the best Flames forwards when it came to driving play and influencing team puck possession. Despite having what I thought were rough defensive moments where he was caught out of position or committed turnovers that resulted in goals against, his strong CF% suggests that his two-way game may be underrated as he spends more time in the offensive zone than the defensive zone. It was also interesting to see Derek Ryan and Josh Leivo at the top of the list. Both of them had lesser roles on the fourth line and produced minimal offense during the season, so the smaller ice time sample size may have contributed to their high CF%. It is probably safe to say that Ryan was the player who was driving these results because of his two-way play and his ability to carry the puck into the offensive zone, and that Leivo's numbers were the byproduct of playing as Ryan's winger.

Hockey Analytics Web Scraper - Bottom 5 Forwards in CF%

Looking at the 5 forwards with lowest CF%, it is surprising that Johnny Gaudreau was near the bottom of the forward group, as he has always been amongst the team leaders in points over the last seven seasons in Calgary and the best puck-handler on the team. While the same could be said for Monahan, it's less surprising in his case since his two-way game and puck possession game have always been weaknesses. It was also revealed recently that Monahan had been playing with an injured hip that required surgery since the sixth game of the season, which also likely contributed to his reduced effectiveness since it is difficult to battle for loose pucks and maintain possession along the boards when one is severely injured the same way Monahan has. It is also notable that Brett Ritchie and Dominik Simon are also listed at the bottom, as both played as the third wheel Gaudreau and Monahan at different points throughout the first two-thirds of the season. From what I have seen of Simon and Ritchie, both are ideally fourth line players due to their low skill level and neither should have seen any time with Gaudreau and Monahan. It's likely that Gaudreau and Monahan's numbers were negatively affected by a lack of quality linemate to complete their line.

Finally, let's look at the CF% for all of the Flames defensemen.

Hockey Analytics Web Scraper - Defensemen CF%

Chris Tanev's results are outstanding, especially when you consider that he had more 5-on-5 ice time than any other defenseman! When he joined the team last season, I knew he was a solid defensive defenseman but I didn't expect to him to stand out as much as he did in this table. This is a case where the analytics definitely matches the eye test. Another interesting find is Michael Stone and Oliver Kylington having strong numbers in their limited minutes. I've always been an advocate for Kylington to get more ice time because his exceptional skating and puck-moving ability is a skill that is lacking in the defense core, especially after the departure of T.J. Brodie in the 2020 offseason. Despite being 37 years old last season, Mark Giordano continues to be a steady two-way presence, playing heavy minutes while contributing at both ends of the ice. Perhaps the biggest takeaway from these results is how much Rasmus Andersson struggled. Playing tough minutes on the top defense pairing for the first time in his career with Giordano during the first half of the season as a 24-year old, Andersson seemed overwhelmed with the increased responsibilities against top players. I would be curious to see how Andersson's numbers have changed in the second half of the season, when Darryl Sutter was named head coach and moved Andersson to a pairing with Noah Hanifin where he seemed to play better. Conversely, it would also be interesting to see if Giordano's numbers improved after being moved away from Andersson to form a new pairing with Tanev, as the Giordano-Tanev pairing looked very good in the second half of the season. Did Andersson drag Giordano down? Or was it simply a case where the chemistry between the two was lacking and their playing styles didn't complement each other?

Scoring Metrics

Let's now look at the scoring metrics to see who excelled the most at influencing the generation of high danger scoring chances at 5-on-5. Let's look at the statistic HDCF/60, which is a rate that represents the number of high danger scoring chances produced against the opposition when a player is on the ice for every 60 minutes of 5-on-5 playing. To classify a scoring chance as "high danger", let's look at the chart below (provided by war-on-ice.com):

Hockey Analytics Web Scraper - Danger Area Chart

According to Natural Stat Trick, each shot attempt (or Corsi) made in the offensive zone is assigned an initial value based on the area in the zone where it was made. Shot attempts from anywhere outside the yellow or red areas are assigned an initial value of 1. Attempts from the yellow area are assigned an initial value of 2. Attempts from the red area between the circles and between the faceoff dots and the goal line are assigned an initial value of 3. Another score of 1 is added to the initial value if the attempt is considered a rush shot (any attempt made within 4 seconds of ANY event in the neutral or defensive zone without a stoppage in play) or a rebound (any attempt made within 3 seconds of another blocked, missed, or saved attempt without a stoppage in play). A score of 1 is subtracted from the initial value if the shot attempt is blocked.

Any scoring chance with a score of 1 or less is considered to be "low danger". A scoring chance with a score of exactly 2 is considered to be "medium danger", while a scoring chance with a score of 3 or more is considered to be "high danger".

Keeping all that in mind, let's now look at the HDCF/60 for the Flames forwards. To give an idea of sample size, each player's ice time is also included.

Hockey Analytics Web Scraper - High Danger Scoring Chances for Forwards

Taking ice time and roles into consideration, Mangiapane, Backlund, Gaudreau, and Tkachuk were the Flames forwards that made the team the most dangerous offensively when they were on the ice, averaging at least 10 high danger chances for every 60 minutes of 5-on-5 ice time. This does match the eye test, and suggests that each of them have been effective in their roles as forwards on the top two lines, due to the rate of dangerous scoring chances they help create with that large sample size. Ryan continues to be an analytics darling, as his strong HDCF/60 combined with his strong CF% suggests that he is a strong puck possession player who also helps create meaningful scoring chances with that possession time. It's important to take context into consideration, as that this suggests that Ryan is excellent offensively in his role as fourth liner, and does not necessarily mean he would continue to produce the same results if he was used higher in the lineup.

Next, let's look at HDCF/60 for the Flames defensemen.

Hockey Analytics Web Scraper - High Danger Scoring Chances for Defensemen

Chris Tanev continues to stand out in a positive way. Although he is not known as an offensive defenseman or a point producer, his play does benefit the Flames offense. Not only did he play the most 5-on-5 minutes out of all defensemen, his on-ice presence assisted in the production of high danger scoring chances at a higher rate than any other Flames defenseman. This does match the eye test, as he is excellent at killing opposition plays and their puck possession time, which in turn leads to possession time and scoring chances for the Flames. Juuso Valimaki ranking second among defensemen is surprising, as I did not notice him being involved much in the creation of dangerous scoring chances during last season. If anything, I thought he played too "safe", and did not make enough plays with the puck when he had it. If he was able to produce efficient results like this while playing "safe", I'm excited at what he can do when he develops and figures it out! Again, Rasmus Andersson was a dissapointment. He is a player who is relied on for his offensive production, but did not help generate enough high danger chances during 5-on-5 play, especially considering the high amount of minutes he logged. Hopefully, this season can be looked at as a learning experience for the young 24-year old, who I am still very high on.

Additionally, another offensive metric I can look at is expected goals per 60 (xGF/60). Expected goal models by the analytics community uses past information of previous shot attempts taken, in order to assess the quality of new shots. By using variables from previous shots like shot location, events leading up to the shot, and shooting percentage averages across the league in previous seasons, all new shots taken will be assigned an expected goal value. Quite simply, xGF for a player denotes the expected number of goals scored by the team, based on the quality of shots that occur when that player is on the ice. xGF is also an independent value, unaffected by whether or not a shot resulted in a goal.

First, let's figure out what the average xGF/60 is and which players had an xGF/60 above the average:

Hockey Analytics Web Scraper - Skaters with an Expected Goals Rate Above the Mean

When on the ice during 5-on-5 play, Mangiapane led the team with the Flames generating expected goals at a clip of almost two and a half for every 60 minutes he is on the ice. This is very impressive for a third-year player and a former 6th round pick, and shows how his playmaking, goal-scoring ability, and tenacity have combined to make him one of the most dangerous Flames forwards. As you can see above the table, the average xGF/60 was 2.25. I was very glad to see that Gaudreau and Tkachuk were above the team average, especially since they are both relied on as top two line forwards. For that same reason, I was surprised that Lindholm was not on this list as I thought he would help the team generate more with the role and ice time he's been given as the team's number 1 centre.

In which range did most of the Flames skaters fall in terms of expected goals? Using xGF%, I made a Plotly donut chart containing the number of players that fall within different ranges of xGF%:

Hockey Analytics Web Scraper - Expected Goals Percentage Donut Chart

None of the counts inside each range are particularly surprising, but the most interesting takeaway from this chart is how Plotly defined the ranges. An xGF% of 47.0 to 49.9 is considered the worst range on the team and an xGF% of 49.9 to 52.7 is considered below average. Even if a player breaks even on xGF% and is expected to slightly out score opponents when they are on the ice, it's still considered a below average result on this team. It's not just enough to play safe low-event hockey, the aim should be to create dangerous chances at a high rate.

An alternative way of visualizing this is to use a histogram. Using the Seaborn displot, a Kernel Density Estimate (KDE) can also be superimposed onto the histogram to provide a smooth distribution visual:

Hockey Analytics Web Scraper - Expected Goals Percentage Histogram and KDE

I also added the mean and the skew values in the title of this chart in order to give the viewer a better idea of what the average xGF% is and how asymmetrical the KDE distribution is.

Defensive Metrics

What about the defensive side of the game? Let's look at the HDCA/60 rates to see which forwards excelled at helping the team prevent high danger scoring chances.

Hockey Analytics Web Scraper - High Danger Scoring Chances Against for Forwards

Similar to how they finished bottom 5 in CF% amongst forwards, the group of Gaudreau, Monahan, Simon, and Ritchie also appear in the bottom 5 in HDCA/60. They struggled with not only puck possession, but the team also bled high danger scoring chances at a higher rate when they were on the ice. Once again, Derek Ryan's results as the fourth line centre are outstanding. Not only are his possession and chance generation numbers excellent, his defensive numbers are strong and the Flames don't give up much when he's on the ice - just 5.7 high danger scoring chances every 60 minutes!

What about the HDCA/60 for defensemen?

Hockey Analytics Web Scraper - High Danger Scoring Chances Against for Defensemen

Although Giordano had a solid CF%, he finished second last in HDCA/60. This can be attributed to the struggles of Andersson, his pairing partner for the first half of the season. The disparity in Andersson's HDCA/60 compared to the other defensemen (nearly a rate of one and a half more chances allowed by the team with him on the ice than the the second worst amount) is concerning, especially considering the number of minutes he played. On the positive side, Tanev was once again outstanding as the Flames allowed the lowest rate of high danger chances per 60 amongst defensemen with him on the ice. Noah Hanifin, who was Tanev's defense partner for the first half of the year also showed impressive preventive numbers before an injury ended his season prematurely.

Earlier in our analysis, we looked at expected goals (xGF) as an additional metric when exploring how the Flames players performed offensively. Let's now look at the team average for expected goals allowed per 60 minutes (xGA/60) and see which players finished below the average. Who encountered the lowest amount of quality shots on their own net when they were on the ice during 5-on-5 play?

Hockey Analytics Web Scraper - Expected Goals Allowed for Skaters

The average xGA/60 for the Flames skaters was 1.94. When Tanev is on the ice 5-on-5, the Flames clearly doesn't give up very much in terms of quality shots and scoring chances - his expected goals against is only 1.47 for every 60 minutes which ranks first on the team! The Flames are lucky to have such an exceptional defender on their back end, as he makes anyone he plays with look better. Hanifin during the first half of last season and Quinn Hughes in his rookie season in Vancouver both looked like they took steps forward in their overall game when they were both paired with Tanev. Giordano also saw his improvement in his game when he played with Tanev in the second half of last season, although it didn't improve his xGA/60 enough to see him on this table. Again, Elias Lindholm wasn't on this list which is a surprise as he has always been lauded for his two-way game. It makes me wonder - he may be an adequate number 1 centre, but would his numbers improve if he was better slotted as a number 2 centre against lesser competition?

Deployment Metrics

How much does player deployment factor in the generation and prevention of quality shots and expected goals? Let's look at offensive zone starts, which is the number of shifts for a player that started with an offensive zone faceoff. Let's use the Plotly library to create a vertical bar chart to highlight which players saw the high and lowest rates of offensive zone starts:

Hockey Analytics Web Scraper - Offensive Zone Starts Chart

Not surprisingly, Gaudreau saw a high number of his shifts start in offensive zone as the team's offensive catalyst and their most skilled player. On the other hand, it is surprising that Hanifin saw the lowest amount of offensive zone starts when looking at his past production and his billing as an top offensive defenseman in his draft year.

Let's see if there is a relationship between offensive zone starts and expected goals by building a scatter plot with the Matplotlib library:

Hockey Analytics Web Scraper - OZS and Expected Goals Scatter Plot

Looking at this scatter plot, there isn't a visible correlation between deployment and expected goals. Depending on skill and compete level, players with a similar rate of offensive zone starts can vary from being effective in generating quality shots and making the most of their starts, to being very ineffective. To highlight this, I picked eight players and displayed their results in a table:

Hockey Analytics Web Scraper - OZS and Expected Goals Table

Here, we can see a substantial difference between the xGF/60 of Gaudreau and Monahan, despite both being relied on as key offensive players and were deployed similarly. The same goes for the pairs of Mangiapane and Ritchie, Nordstrom and Tkachuk, and Lucic and Backlund. It is very impressive that Backlund had the second highest xGF/60 on the team, especially considering he had the fourth lowest rate of offensive zone starts.

Is there a better way to visualize this? What if we added a third layer of metrics to xGF/60 and Offensive Zone Starts/60 by including HDCF/60? This can be easily done by creating a scatter plot with the Seaborn library:

Hockey Analytics Web Scraper - OZS, xGF, and HDCF Scatter Plot

From this chart, we can see a positive correlation between high danger scoring chances and expected goals. Offensive zone starts don't appear to be a significant factor, as the size of the bubbles varies across the chart. As I said earlier, the effectiveness of each player in creating offense from their offensive zone starts probably boils down to the skills and compete level of each player.

As a bonus, Seaborn makes it possible to create joint plots. Let's make a joint plot to provide an alternative look at the relationship between HDCF/60 and xGF/60:

Hockey Analytics Web Scraper - OZS and HDCF Joint Plot

The positive relationship between HDCF/60 and xGF/60 almost looks linear. To model this, let's use the Seaborn regplot to show the linear regression line on a Seaborn scatter plot:

Hockey Analytics Web Scraper - Linear Regression Plot

The data points actually align closely with the linear regression line, and show the positive relationship between HDCF/60 and xGF/60. For every 60 minutes of 5-on-5 playing time, the greater the rate of high danger scoring chances created by a team with a player on the ice, the greater the rate of expected goals.

Let's now use the linear model to make a prediction. For example, how much xGF/60 does the model estimate for a player with an HDCF/60 of 13? First, we must find the y-intercept (theta 0 in the equation pictured below) and the slope of the line (theta 1 in the equation pictured below) by using the LinearRegression method from the scikit-learn library:

Hockey Analytics Web Scraper - Linear Regression Calculation

According to the results, the y-intercept is 0.85 (the xGF/60 if HDCF/60 is zero) and the slope is 0.14, which is also another indicator of a positive relationship between the two variables. The R-squared value is 0.55, which indicates that HDCF/60 has a had moderate effect on the xGF/60 value. Using the linear equation in the previous image, let's now figure out what the model estimate for xGF/60 is for a player with an HDCF/60 of 13.

Hockey Analytics Web Scraper - Linear Regression Result

The linear model predicts that a player who helps the Flames produce at a rate of 13 high danger scoring chances per 60 minutes of 5-on-5 play will have an expected goals of 2.6 in that time frame. Neat!

How varied is the xGF/60 for players with an Offensive Zone Starts/60 above or below the average rate? In order to do this, I added a new column named "Above Avg Off. Starts?" to the DataFrame. Thanks to the where method from NumPy, each player will have a value of either "No" or "Yes" for this new column, depending on if the player's Offensive Zone Starts/60 is above or below the average rate:

Hockey Analytics Web Scraper - Adding New Column

Having this new column will help visualize this question using a Plotly box chart:

Hockey Analytics Web Scraper - Adding New Column Box Chart

The variation in xGF/60 is considerably higher for the players who were given offensive zone starts at an above-average rate! To me, this suggests either one of two things: the Flames didn't deploy the right players enough for their offensive zone starts, or the team simply lacked the personnel needed to capitalize on the offensive zone starts they were given. Personally, I lean towards the latter as I think the Flames roster lacked scoring depth and relied on too few for too much on offense.

Just for fun, I used HDCF/60 instead of xGF/60 and converted this information to a combined histogram and box plot with Plotly. This allows us to see which percentages of the roster fall in various ranges of HDCF/60, as well as whether the value falls in the lower quartile, median, or the upper quartile of players.

Hockey Analytics Web Scraper - Adding New Column Histogram and Box Plot

Going back to xGF/60 and offensive zone starts, I continued to explore alternative ways of visualizing this information by using a Kernel Density Estimate chart in Seaborn. This allowed me to create a chart with smooth distribution.

Hockey Analytics Web Scraper - Adding New Columns KDE

How about defensive zone starts? Which players did the Flames rely on the most when starting in their own end? Let's see using a horizontal bar chart made with the Plotly library:

Hockey Analytics Web Scraper - Defensive Zone Starts Chart

A useful aspect of Plotly is that I can see the exact value for a bar by hovering my mouse over it, and don't need to create a table. According to the chart, Elias Lindholm was heavily relied on defensively, with a rate of 10.95 defensive zone starts for every 60 minutes of 5-on-5 ice time. Tkachuk, Backlund, Giordano, Tanev, and Andersson also saw a high rate of defensive zone starts. As young defensemen with less than 100 NHL games, Kylington and Valimaki were heavily sheltered and faced very little shift starts in the defensive zone.

Let's use Seaborn again to view the relationship between HDCA/60, xGA/60, and Defensive Zone Starts/60:

Hockey Analytics Web Scraper - DZS, xGA, HDCA

As expected, there's a positive correlation between a player's high danger chances against and the expected goals against. Compared to the chart for offensive zone starts, the bubble sizes indicating Defensive Zone Starts/60 appear more closely matched, with most of the players who were relied more in the defensive zone being bunched up between 1.8 to 2.0 expected goals against, with only a few outliers.

Just For Fun

As I played around with the data, I also wanted to see if I could separate xGF/60 by position.

Hockey Analytics Web Scraper - Position Box Chart

Of course, because players are listed under their official position on NHL.com in Natural Stat Trick and not necessarily where they actually played during the year (eg. Tkachuk played RW during the last quarter of the season, Bennett played all three forward positions at various points), this isn't a great table to use. Only way this table may useful is if someone wishes to only slot these players in their natural position, and wants to see who performed the best when deciding who they want to keep. This table does highlight the deficiency of natural right-wingers on the Flames roster.

Summary

In terms of forwards, Mangiapane and Ryan's results were pleasant surprises. Mangiapane is emerging as a cornerstone player for the Flames, and could very well end up being Flames' best player by the end of next season after seeing his performance at the 2021 IIHF Men's World Championship and the positive trend in his game. Although it was unfortunate that Ryan left the team in the offseason, it isn't a huge loss in the grand scheme of things when you consider that he was a fourth line centre. Gaudreau and Tkachuk are important pieces of the Flames' top two lines, but Monahan is a big question mark with his underwhelming results on posession and offense. Backlund continues to be a strong two-way performer who produces results on offense despite being heavily relied on in the defensive zone. Lindholm may generate stronger results on both sides of the puck if the Flames acquired a star top line centre who can attract tougher matchups from opposing teams and Lindholm was slotted as a number 2 centre.

On defense, Tanev exceeded all expectations and was simply outstanding as a top pairing defenseman according to all metrics. Andersson had an extremely disappointing year and didn't yield the returns on offense and defense that a coach would expect with the ice time he received. Giordano was solid (especially on a pairing with Tanev), but his results may have suffered as a result of being paired with Andersson for most of the tear. Kylington was definitely underutilized, and I'm extemely happy to see him play as well as he has in the 2021-22 season so far. Hanifin is a safe and steady minute-muncher who plays low-event hockey and provides stability to the defense corps. I would be interested to see if he can yield more offense if he was given more offensive zone starts.

Outcome and Takeaways

This was a difficult but rewarding project that required lots of time, but also allowed me to connect my newfound knowledge of programming to my interest in Hockey. I found that the various data exploration and visualization tools provided by the Pandas, Plotly, Seaborn, and Matplotlib libraries were very effective in seeking answers to the questions I listed at the beginning of part 2 of this project. The data handling techniques I learned from the 100 Days of Code were extemely useful and I was satisfied with how I applied them for this project. It was amazing seeing how Selenium and Beuatiful Soup combined to automate the process of opening a URL, selecting HTML elements, scraping data off the URL, and creating a CSV with the data.

Admittedly, I was not a big analytics supporter. Although I believe that the eye test is valuable, I now believe that analytics are also useful, as long as they are considered in the right context. Analytics in hockey only represent how a player performs in a specific role, on a specific team, under a specific coach, and is not necessarily a predictive indicator of how a player would do with a new role, team, or coach.

Possible Improvements Going Forward

A possible improvement for future data explorations would be to merge this advanced statistics DataFrame with another DataFrame that contains the actual goals, assists, and other counting stats that everyone sees, in order to see how puck possession time and chance generation translates into actual goals and assists for each player. I would love to also add historical data from previous seasons in order to create line charts in order to see the year-to-year trends in player performance, as well as metrics for player pairings or trios in order to get a snapshot of which line combinations were effective or ineffective.


External Links