Expected Goals leverages extreme gradient boosting, an advanced machine learning technique, to calculate the probability that an unblocked shot attempt will become a goal based on factors like shot distance, angle, and the event which occurred prior to the shot. Expected goals can be interpreted as "weighted shots."
A Brief History
Alan Ryder broke ground in 2004 when he published hockey's first expected goal model titled "Shot Quality," but it wasn't until Dawson Sprigings and Asmae Toumi published their expected goal model in October of 2015 that expected goals ascended into hockey popularity. Today, expected goals have usurped Corsi (shot attempts) as the go-to underlying metric for analyzing teams and skaters, and NHL arenas have even featured expected goal data on the jumbotron at intermissions.
A part of me wishes that we had stuck to calling expected goal models "Shot Quality" models instead, because I think that the term "Expected Goals" implies that these models are solely predictive in nature, which isn't necessarily the case. Even if expected goal shares were completely useless for predicting future goals at the team level, expected goals would still be extremely useful for describing past events and telling us which teams relied heavily on goaltending and shooting prowess, or were weighed down by poor shooting and goaltending, and even which shots the goaltender deserved most of the blame for.
Model Variables
I trained my model using extreme gradient boosting, a hyper-efficient machine learning technique commonly used for regression and binary classification problems. In other words, I showed my computer a bunch of shots, told it which of them were goals, and then used extremely powerful software to teach it to predict the outcome of new shots. I accounted for the following variables in my model:
Shot Distance & Angle
The two most important variables in determining goal probability
Shot Type
Wrist shot, slap shot, tip, backhand, etc.
Prior Event
The type of event which occurred most recently, its location and distance, how recently it occurred, and which team the perpetrator was on
Pre-shot Movement
The speed at which distance changed since the prior event (inspired by Peter Tanner of Moneypuck)
Home/Away Status
Whether the shooting team is at home
Game Context
Score, period, and seconds played in the game at the time the shot was taken
Off-Wing Shooting
Whether the shooter is shooting from their off-wing (e.g., a right-handed shooter from the left circle)
Scorekeeper Bias Adjustment
I chose to make an adjustment for scorekeeper bias. The adjustment was quite rudimentary: for the past 3 seasons, I subtracted the average shot distance (by both teams) in all a team's away games from the average shot distance in all their home games. The resulting value from this calculation was a scorekeeper bias adjustment factor which I subtracted from the reported distance of all shots; the difference from this calculation was "adjusted distance" and I used this for my model in place of reported shot distance.
Example: Xcel Energy Center
The average reported distance of shots taken in games at Xcel Energy over the past 3 seasons was 2.54 feet further from the net than the average reported distance of shots taken in games where the Minnesota Wild were the away team. This gives an adjustment number of 2.54 feet. If the reported distance of a shot taken at Xcel Energy is 30 feet from the net, I subtract the adjustment number of 2.54 feet from the reported distance, and obtain an "adjusted distance" value of 27.46 feet from the net, which I use as the input for my model.
Training Methodology
Most expected goal models are trained on a minimum of five full seasons of data to train and then "tested" on an out-of-sample season. I tried building a model using this approach, but I kept running into a major issue: My expected goal values never added up to actual goals. The total number of expected goals I calculated for a given season were typically somewhere between 100 and 300 below the number of actual goals scored in those seasons.
While the collective body of NHL shooters may perform better than the collective body of goaltenders over a given season, a massive discrepancy between the two of them persisting over multiple seasons suggests that expected goal values are too low. Indeed, my values were, and the reason for this is that the NHL reduced maximum goaltender pant sizes prior to the 2017–2018 season and reduced maximum pad sizes prior to the 2018–2019 season. These changes play a big role in the new high-scoring environment we've grown accustomed to over the past three seasons, and using an expected goal model built on old data prior to these changes holds goaltenders to an unfairly high standard and shooters to an unfairly low standard by underestimating goal probability.
The data which every season's model was trained on varied greatly:
- 2007–2008 through 2009–2010: I removed 100 "target" games from the sample, trained the model on the remaining data, and then ran it on the target games I had removed. I repeated this process for every 100 game sample available in these 3 seasons. Unlike data from 2010–2011 through today, all shot location coordinates were exclusively sourced from ESPN's XML reports, which vary slightly from the NHL's API reports.
- 2010–2011 through 2016–2017: I simply removed one target season from the sample, trained the model on the remaining seasons, and ran the model on the target season. I repeated the process for all 7 of these seasons.
- 2017–2018 through 2021: I used the same process as for 2007–2008 through 2009–2010: Remove 100 target games from the sample, train the model on the remaining games, and then run the model on the target games. I split these seasons from 2010–2011 through 2016–2017 because of the adjustments that the NHL made to goaltender equipment regulation starting in 2017–2018.
The modeling technique which I used to train my model was similar to and inspired by the technique used by Evolvingwild in their expected goal model; the write-up for which was my introduction to the concept of extreme gradient boosting. I began the training process using cross validation, testing the model on cross-validated samples using different parameters each time with the goal of finding parameters that would maximize area under the curve.
Avoiding Overfitting
I couldn't just train the final model on every shot from the past 3 seasons and then test it at once, as this would lead to overfitting. (To put this more simply, if I showed my computer the shots that I was trying to test it on, it would become "too smart" and "cheat" in predicting goal probability by considering results it isn't supposed to know.)
In order to avoid overfitting, I removed 100 game samples from my training sample, trained my model on the other 3,524 games, and then "tested" my model on the 100 games in question, and repeated this process until I had tested my model on every game and saved the results. In total, this technically means that I built 74 different models for the past two seasons: 37 for even strength and 37 for the power play. But because each model was trained on almost all the same data, and each model used the exact same parameters and variables, it's easier and still mostly accurate to say that I just built two models: one for even strength, and one for the power play.
Special Situations
Penalty Shots & Shootouts
I spent some time working on an expected goal model for penalty shots and shootout shots, but I was remarked at how poorly they performed in testing. After some consideration, I decided to build the most rudimentary model possible for these shots, by assigning them all the same expected goal value of 0.31 goals, which reflects the percentage of shootout and penalty shot goals that became goals over the past three seasons.
I am comfortable doing this because unlike all shots at other situations where external variables shape the opportunity available to shooters to score and goaltenders to save, variables such as the location, angle, and shot type in a shootout are all influenced almost exclusively by the shooter and the goaltender.
Validation
I chose to test the results of my model using two target metrics. The first was area under curve (AUC), a metric which I also used as the target metric for the cross validation process. The second target metric was that expected goals would roughly equal actual goals over the aggregated sample size.
According to the documentation which I referenced for AUC, a value between 0.6 and 0.7 is poor, a value between 0.7 and 0.8 is fair, and a value between 0.8 and 0.9 is good; this means that at all situations and at even strength, the model is fair, and closer to good than bad in every single season.
| Season | Situation | AUC | xG per Goal |
|---|---|---|---|
| 2007–08 | All Situations | 0.769 | 0.985 |
| Even Strength | 0.778 | 0.990 | |
| Power Play | 0.710 | 0.977 | |
| Shorthanded | 0.767 | 0.931 | |
| 2008–09 | All Situations | 0.775 | 0.993 |
| Even Strength | 0.783 | 0.991 | |
| Power Play | 0.713 | 1.000 | |
| Shorthanded | 0.815 | 0.983 | |
| 2009–10 | All Situations | 0.760 | 1.033 |
| Even Strength | 0.767 | 1.017 | |
| Power Play | 0.700 | 1.071 | |
| Shorthanded | 0.787 | 1.092 | |
| 2010–11 | All Situations | 0.773 | 1.000 |
| Even Strength | 0.782 | 0.987 | |
| Power Play | 0.715 | 1.031 | |
| Shorthanded | 0.773 | 1.075 | |
| 2011–12 | All Situations | 0.775 | 0.999 |
| Even Strength | 0.781 | 0.988 | |
| Power Play | 0.722 | 1.028 | |
| Shorthanded | 0.807 | 1.047 | |
| 2012–13 | All Situations | 0.771 | 0.973 |
| Even Strength | 0.779 | 0.981 | |
| Power Play | 0.697 | 0.929 | |
| Shorthanded | 0.780 | 1.168 | |
| 2013–14 | All Situations | 0.773 | 0.991 |
| Even Strength | 0.781 | 0.981 | |
| Power Play | 0.715 | 1.042 | |
| Shorthanded | 0.783 | 0.849 | |
| 2014–15 | All Situations | 0.771 | 1.002 |
| Even Strength | 0.778 | 0.996 | |
| Power Play | 0.706 | 1.013 | |
| Shorthanded | 0.812 | 1.083 | |
| 2015–16 | All Situations | 0.771 | 1.024 |
| Even Strength | 0.780 | 1.036 | |
| Power Play | 0.696 | 0.996 | |
| Shorthanded | 0.782 | 0.953 | |
| 2016–17 | All Situations | 0.771 | 1.011 |
| Even Strength | 0.777 | 1.017 | |
| Power Play | 0.709 | 1.002 | |
| Shorthanded | 0.789 | 0.925 | |
| 2017–18 | All Situations | 0.766 | 1.032 |
| Even Strength | 0.772 | 1.030 | |
| Power Play | 0.697 | 1.050 | |
| Shorthanded | 0.828 | 0.958 | |
| 2018–19 | All Situations | 0.762 | 1.001 |
| Even Strength | 0.771 | 1.004 | |
| Power Play | 0.676 | 0.991 | |
| Shorthanded | 0.798 | 0.974 | |
| 2019–20 | All Situations | 0.772 | 0.989 |
| Even Strength | 0.779 | 0.984 | |
| Power Play | 0.696 | 1.004 | |
| Shorthanded | 0.821 | 1.003 | |
| 2020–21 | All Situations | 0.774 | 0.975 |
| Even Strength | 0.782 | 0.966 | |
| Power Play | 0.706 | 0.994 | |
| Shorthanded | 0.813 | 1.107 |
The expected goal models performed marginally worse in the 2007–2008 through 2009–2010 seasons than they did in later years, but still better than I expected. Whatever issues are present with the location coordinates from the first 3 seasons are not problematic enough to prevent the model from posting a respectable performance. On the power play, though, the model was classified as bad in 5 seasons, and closer to bad than good in the other 9.
These numbers bear out my general stance on public expected goal models: In the aggregate, they're fair, and I would say they're closer to good than bad. But on the power play in particular, they're missing a lot of important context.
Predictive Power: xG vs. Corsi
The predictive power of expected goals has recently been called into question. An analyst known as DragLikePull compared the correlation between 5-on-5 score-adjusted Corsi/expected goal shares in the first half of the season and 5-on-5 goal shares in the second half of the season and found that Corsi was better overall at predicting second-half goals than expected goals were. Based on these findings, he concluded that fans and analysts should discard expected goals at the team level and return to using shot attempts.
My research has led me to different conclusions on the predictive power of expected goals. I tested how well expected goal shares could predict future goal shares. I largely used the same method that DragLikePull outlined, calculating 5-on-5 goal shares, Corsi shares, and expected goal shares in the first half of the season, and comparing them to actual goal shares in the second half of the season. But my method differed in two ways: I did not apply a score-adjustment to any of the data, opting to compare the raw metrics to one another, and I also calculated a separate expected goal share value with all rebounds removed. (I defined rebounds as shots where the prior event was a shot by the same team that occurred no more than two seconds ago.)
First-Half to Second-Half Prediction (2014–2019)
Corsi (CF%)
R² = 0.21
First-half → Second-half goals
xGF%
R² = 0.29
First-half → Second-half goals
Rebound-Removed xGF%
R² = 0.32
First-half → Second-half goals (most predictive)
GF%
R² = 0.18
First-half → Second-half goals
My expected goal values without a score-adjustment perform significantly better than Corsi does with a score-adjustment, and they blow unadjusted Corsi out of the water, so I am comfortable saying that they currently have more predictive power; especially the expected goal values with rebounds removed.
I would also like to credit Peter Tanner of Moneypuck for bringing to my attention that expected goals with rebounds removed are more predictive.
Why Predictive Power Has Changed
My expected goal model is not the only one that has pulled ahead of Corsi in the last five years; Natural Stat Trick's has done the same. This is partially because the predictive power of Corsi has declined and partially because the predictive power of expected goals has improved. I have a theory for why each of these respective changes have occurred.
Goodhart's Law states that "When a measure becomes a target, it ceases to be a good measure." Corsi gained ground as a popular measure that NHL front offices used to improve their team, and that NHL player agents began using to make the case for their clients in the early portion of the 2010s, right around the same time that Corsi's predictive power began to decline. I would not say that we're quite at the point where Corsi is no longer a good measure, but it has indisputably declined, and I believe that is because it's become a target.
I suspect that the predictive power of expected goals has improved because the quality of data provided by the NHL's Real-Time Scoring System has improved.
A Note on Rebounds
Going forward, should we only use expected goals with rebounds removed? No. Rebounds are real events that happened, and until the NHL decides that rebound goals no longer count, any descriptive metric of past events should include rebound shots.
If you're strictly looking to predict which team will be the best team in the future, it may be best to use a metric that excludes rebounds, but I don't think that is how most people do or should actually use expected goal models.
Limitations
These numbers bear out my general stance on public expected goal models: In the aggregate, they're fair, and I would say they're closer to good than bad. But on the power play in particular, they're missing a lot of important context.
The expected goal models can't even be classified as "good" according to the AUC documentation, and there is certainly room for improvement. However, they represent a meaningful step forward from raw shot counts and provide valuable insight into shot quality and team/player performance.