A causal inference approach to fouling when up by three
[Note: the results shown here come from preliminary work. I am showing them because they produce sensible results using a promising method. However, I am hoping to produce more accurate predictions and results with more covariates (FT% of the opposing players present on the floor, or fouling team defensive rebounding rate for example) and larger sample size (NBA data for example).]
The question, and why current answers are not satisfactory
Coaches regularly have to make a decision about whether to foul when their team is up by 3 points with less than 24 seconds on the shot clock. The rational for fouling is that it prevents the opposite team from attempting a three to tie the game. By offering two free-throws to your opponent, the only way they can tie the game is by missing the second free-throw, grabbing the offensive rebound, and score, or by extending the game by fouling. This question has been widely studied, with different answers based on the methods and assumptions used. A probability analysis by DePauw Coach Bill Fenlon led to the conclusion that teams should foul. Empirical explorations based on simple comparisons of outcomes did not find much difference between fouling or not, with some variations according to the assumptions.
There are two main issues with these methods, however. First, they fail to account for the multiplicity and diversity of scenarios and contexts. They use league-wide (and sometimes historical) offensive rebound rates on free-throws, for example, to infer the probability that a team can grab an offensive rebound. But if you are a coach, you probably want to base your decision on the opposing team's offensive rebound strength, not the league's average. This leads me to a broader point: the answer of whether to foul or not fouling is clearly never going to be binary, and will always depend on game-specific (time left on clock) and team-specific (FT%, 3FG%, rebounding rates, etc) variables. A method that can take these variables as inputs in generating an optimal decision for a coach will be more valuable than a method that is based on historical and league-wide averages.
Second, existing empirical methods are based on observations of what actually happened: they compare outcomes when teams fouled to outcomes when teams did not foul. It is possible, however, that systematic differences exist between teams that decide to foul and teams that decide not to foul (better coaches, better shooters or rebounders, play home or away) or between situations where teams foul and situations where teams don't foul (time left on the clock). Teams with better coaches, for example, might foul more and also happen to win more (correlation), but we can't necessarily attribute this winning to the fact they fouled (causation). What we would like to know, ideally, is a counterfactual: if the team fouled, what would have happened to that same team, in that same game, had it decided not to foul? Or, vice-versa, if the team did not foul, what would have happened had it decided to foul? We obviously never observe the counterfactual, a team both fouling and not fouling in the same situation, which is the fundamental problem of causal inference. For this reason, we will never be able to know the exact effect a unique fouling decision had on the outcome of one game. What we can try to do, however, is find the effect across all games (Average Treatment Effect, ATE, in causal language) or across all games sharing some characteristics (Conditional Average Treatment Effect, CATE).
The maths behind the CATE
Formalizing the set-up can help clarifying it (skip this section if you don't think it will help you, but I promise it can!). Let t be the treatment, in our case, whether the team decided to foul or not, t can take two values, t = 0 (the team decided not to foul) or t = 1 (the team decided to foul). The team then either won or lost the game. We are calling the outcome of the game y, y = 1 if the game is won, y = 0 if the game is lost. Because the decision to foul impacted the result of the game, y is a function of t: y(t) = 1 or 0. So if the team decided to foul, we observe y(1). y(1) = 1 if the team that fouled won the game and y(1) = 0 if it lost. For that game, we never observe y(0), what would have happened had the team not foul. The effect of fouling on the game outcome, y(1) - y(0), can never be known.
Causal inference methods aim at estimating the Average Treatment Effect (ATE) over several game, E[Y(1)−Y(0)], and perfect identification requires the treatment (here, the choice to foul or not) to be randomized. Obviously, randomizing whether to foul is not an option. A growing literature, however, aims at estimating Conditional Average Treatment Effects (CATEs) in observational settings, i.e. settings without randomization of the treatment. For any set of covariates (for example, game situations and contexts, teams' measured skills) CATEs can estimate the effect of fouling or not on game result. The CATE function is formally defined as f(x) = E[Y(1)−Y(0)|X=x] where X is a vector of relevant covariates. Intuitively, the CATE function tries to answer the following question: for any game situation, with teams with given characteristics, do we expect a positive outcome from fouling when up by 3? The CATE function is fit using any supervised learning (usually BART or random forests) or regression methods. Once fitted, the algorithm can predict whether fouling has a ‘positive treatment effect’ (f(x) > 0) and therefore should be done for any set of covariates X. See https://arxiv.org/pdf/1706.03461.pdf for an example and more formal definition of such algorithms.
The set up with WNBA play by play data
I use play by play data from the 2002-2020 seasons. I start by isolating game situations with 30s left remaining in the second half until the 2005 season, and with 24s left remaining in the 4th quarter since the 2006 season (when the WNBA changed its shot clock and game format). I then filter situations where the score differential between the two teams is 3. 612 games fit that description and are therefore selected. Finally, I look at whether the team that was up by 3 fouled or not. In only 23 games out of the 612 did the team up by 3 decided to foul.
The next step is to collect the relevant covariates, i.e. any variable that we expect to be relevant in the decision to foul or not. I collect the following season-wide information about the team that is in position to be fouled (i.e. the team that is losing by 3): point per game, FG3%, FG2% and FT%, offensive rebounds, fouls and turnovers/game. More or better variables (offensive rebound rate, field goal percentages in the game or by players in the game, etc) will be considered, when available, in future iterations of this model. The FT% of the worst free-throw shooter among the players on the floor could be a very relevant information - providing you have enough time to wait until she gets the ball to target and foul her. Some information about the team that is in position to foul (up by 3) is also relevant: I collect their season average PPG, the time left on the clock, whether the team is playing at home and was the favorite to win the game. All of these variables, except from whether the team played at home and was the favorite, are standardized, meaning I look at the relative performance of a team compared to the others.
I use two outcome variables: whether the team in position to foul won the game (binary outcome), and by how much (continuous outcome). I am now ready to estimate the CATEs (using X-Learners and the BART as my estimation procedure).
I now have a function that provides an optimal decision (foul or not foul) for any given set of the covariates I used: whether you're home, the favorite, your PPG, your opponent's PPG, FG%, offensive/TO rates. This model has decent predictive power (although a model fit on a larger sample, such as NBA data, would perform better), but it comes with a trade-off with interpretation power: it is hard to describe the results from such a model. It can provide an optimal decision for a specific context, but because the answer will depend so much on the context, presenting and discussing results can be hard.
The first thing I do is running the model on my existing 612 games, to find what the optimal decision would have been. I find 94 positive CATEs out of the 612 games, meaning the model predicts teams should have fouled 94 times (remember, only 23 did). Teams made the optimal decision (fouling when they should have, not fouling when they shouldn't have) 80% of the time. Looking at the teams that faced at least 3 situations during a season where they were in position to foul up by 3, what were the teams who made the most optimal decisions? 32 of the 65 teams/seasons combinations made the right decision 100% of the time. For these teams, I'm only showing the teams that faced at least 5 game situations where they were in position to foul and got them all right:
Now, what are the teams that got it 'wrong' the most? The NY Liberty got it wrong 75% of the time in 2016, and were faced with the situation 4 times in the season, which would suggest that they lost 3 games because of sub-optimal fouling decisions. The Minnesota Lynx might have lost 4 games in 2016 because of suboptimal decisions (a third of the 6 decisions they faced).
Again, the results here should be treated with caution and should mostly be seen as illustrations of what the model could do. I am working with a small sample of game situations that fit what I am trying to explore, because of the relatively small number of games in WNBA history. I would probably trust the predictions of a model fitted with NBA data slightly more.
The last thing I want to do today is compare the situations that lead to positive CATEs (i.e. the model telling you you should foul) to the situations that lead to negative CATEs. For the graphs below, I classify each observed game situation into one of two category: the model predict you should foul, or the model predict you should not foul. Note that I'm still only generating predictions for the 612 observed game situations. The 'magic' of CATEs functions is that I could (and will) also generate predictions for any imaginary combinations of the covariates. I then look at what the average covariates are in the two groups (positive CATEs and negative CATEs).
Concretely, looking at the first graph below, I find that teams that score more (higher PPG), are playing at home, and were favorite to win the game, are less likely to want to foul (more negative CATEs). There is some evidence that the seconds left on the game clock also matter, in the direction one would except (the more seconds left on the clock, the least likely you are to want to foul), but the result is not statistically significant.
The key variables in deciding to foul, however, are as expected some of the opposing team's stats. The graph below shows that on average, teams should want to foul more when the opposite team has higher FG% (3FG, 2FG and FT), which is what one would expect, since good shooting teams are more likely to hit the game-tying three pointer if you let them shoot. The FT% result is slightly more surprising, as one could think you would want to foul poor FT shooting teams more. But I suspect (i) FT% is heavily correlated with the more relevant 3FG% (and looking at the FT% of the worst FT shooter on the floor, for example, might be more informative) and (ii) when you foul a team, you are almost hoping they make their free-throws, otherwise you open yourself up to an offensive rebound. As expected, teams are less likely to want to foul when the opposite teams are good offensive rebounding teams or if they commit a lot of turnovers.
These initial results are sensible and make sense intuitively, which is reassuring. The next steps are going to be working with better predictive covariates (rebounding rates instead of rebounds/game, stats of the players on the floor or game-specific stats), and try to replicate the analysis with a larger sample (such as NBA data) to improve the predictive accuracy!