Monte Carlo Simulation in R Studio: NBA Possessions
How to simulate NBA Possessions using R Studio
Before diving into the specifics of the following Possession function, it is worth taking a look at the variables that the function takes as parameters, and also the variables that the function returns. The next page shows a comprehensive list of all 22 variables, as well as an explanation of the parameters. For a detailed explanation of “returned” box score stats, visit https://www.nba.com/stats/help/glossary.
Similar to the previous free throw example, almost all of the variables returned are calculated by taking a binomial distribution with the probability of success being the probability of the outcome of interest occurring. The possession function is similar to a (slightly complicated) binomial decision tree, of which I will explain in the following subsections.
Time & Possession Start:
Time is calculated by simulating a chi-squared distribution with df = 5. Because possessions rarely take short amounts of time, I subtract the chi-squared value from this 21.5 (value found by experimentation until an accurate number of possessions per game were simulated). This turns the previously right-skewed distribution to be left-skewed. I then take the absolute value of this (we can’t have negative time), and any small negative values will be turned into small positive values. This process was calculated by experimentation until the distribution modeled the appropriate time distribution of NBA possessions (with inspiration and time examples from a statistician's analysis and also real life data). The distribution of the time of possession (before multipliers) can be as shown:
Additionally, possessions take different amounts of time based on the previous possession’s result. If the possession started with an inbound pass, the time remains unchanged. If the possession started with a defensive rebound, the time taken is multiplied by a factor of 110/171, which is derived from the league averages in 2022. If the possession starts with a steal or the other team committing a turnover, the time is multiplied by a factor of 83/171, three point percentage is boosted by a factor of 1.1, two point percentage is boosted by a factor of 1.1, and teams are twice as likely to attempt a 2 point shot. Turnovers and steals lead to these “fast break” multipliers, as tendencies and shooting percentages are drastically different on fast breaks.
Game Flow (While Loop):
After the time is set, the function goes into a loop that will run as long as the possession is not over. This is what I call “game flow”, which is a series of game event simulations and conditional statements. First I will explain how the series of conditional statements works, and then I will explain how each individual outcome is calculated.
First, the possession starts with two possible outcomes: a shot or a turnover. If there is a turnover, it can either be due to a steal or offensive error. If there is a shot, it can either be a three pointer or a two pointer. These shots can either be a make or a miss, and can also result in a foul. If there is a foul, the possession ends, and the team attempts the appropriate amount of free throws. When shots are missed, blocks are also simulated. If the shot is made, the possession ends, and there is an additional simulation run to determine if the shot was assisted. If there is a make and a foul, the possession ends and free throws are also attempted. If the shot is missed, there can be either an offensive rebound (and the possession restarts) or a defensive rebound (and the possession ends). This loop will iterate until a possession-ending event occurs, which is most often a defensive rebound. While this loop runs, all stats/outcomes are tracked, to be returned in a data frame at the end of the possession.
Calculating Specific Occurrences:
Although time was calculated by using the chi-squared distribution (and a significant amount of time was spent to optimize the appropriate time distribution), a large majority of the game events that were simulated are done so by using the binomial distribution. This is because almost all stats calculated are discrete, or calculated with discrete variables. For example, the continuous variable 3 point percentage is calculated by dividing the discrete variable “3 pointers made” by the discrete variable “3 pointers attempted”. Additionally, for example, when calculating a make or miss, the Bernoulli distribution is used as the sample size is only one.
Because of the nature of these distributions, the “coin flip” method was used in R, where I would make a simple data frame with two outcomes, and then calculate the value of a single sample with weighted probabilities based on the team’s tendencies. I would then test that result in a conditional statement, and run the appropriate code based on the single sample’s value. That being said, these discrete variables would often be expanded upon to create continuous variables that would ultimately follow bell-curve shaped distributions (for example, score differential would have a bell-shaped curve when running thousands of game simulations). This idea follows a general theme of this project: starting with a simple idea, expanding on the simple idea, and then re-making the simulation (and therefore the expansion) as accurate as possible.
There were three ways of calculating the probability of success. The first would be based on a combination of the two teams’ stats: for example shooting percentages would be based on the offensive team’s shooting percentage, multiplied by the ratio of offensive score to the opponent’s defensive score (as mentioned in the offensive and defensive rating section). Another example of this would be the turnover percentage: calculated by the average of a team’s turnover percentage and the opponent’s forced turnover percentage. The second type of stat is based on only one team’s tendencies. An example of this second type is a team’s free throw percentage. The third type of calculation would be based on league averages: I would use calculations based on league averages when the box score stat was already captured in a different team metric. For example, if a team has a higher defensive rating and a higher forced turnover percentage, steals would be captured in both of those metrics, so I wouldn’t want to boost the likelihood of additional steals occurring, as they would already be happening at a higher rate.