Creating an NBA Injury Database

Fast Break: Compiling injury report data & other updates

Nov 17, 2024

(Sorry Kawhi, someone had to be the header image)

In this “fast-break” article, I’ll detail the importance of injury data. I’ve also added a new feature to the website, and plans for more!

Click here for the data, or keep reading for the full write-up.

All of my data, in one place

When putting together the injury database, I decided it’s beneficial to share more data too. Many of my projects use unique datasets I’ve put together. You can now find many of them, for free, at the following page:

Data from Stat Surge

Vaughn Hajra

November 16, 2024

Read full story

Why injury data?

On a day-to-day basis in the NBA, player availability is a huge factor in game outcomes. My recent pre-season forecasts heavily discounted teams like the Clippers and 76ers due to stars with injury history. When players like Kawhi Leonard and Joel Embiid are healthy, they are undoubtedly stars. When they’re hurt though, their teams tend to struggle.

So why do most studies and models ignore injuries? The short answer is that there’s not a great (free) data source. So, people typically go with the “good enough” approach of just using games played.

Therefore, I took it upon myself to build an injury database. I’ve archived the past three seasons of injury data, plus this season’s injury designations so far, and put them in one place:

Downloadable NBA Injury Datasets

Vaughn Hajra

November 16, 2024

Read full story

Building an Injury Database

To put together the injury database, I used the NBA’s injury reports. Teams are required to submit their injury designations by 5pm the day before a game, unless playing in a back to back. For day two of back to backs, teams submit injuries by 1pm on gameday. This means at 2pm, a gameday’s entire injury report will be posted for all teams.

I wrote a script that’ll extract the text from these injury reports (posted as PDFs), and turn that text into a dataframe. I then went back through the past three seasons, and built a dataset with over 35,000 injury designations.

I wrote another script that will build a dataset for the current season. This will be scheduled to update daily, using the 2pm injury reports.

Next Steps

There are three additions I’d like to make to the current database:

Length of injury
Expected time remaining
Impact on expected wins

I’ll implement these three additions in a future blog. A quick overview of what I hope to do is as follows:

Length of injury is super straightforward. How many days has it been since something first appeared on the injury report?

Expected time remaining is a little trickier. I’d like to look at historical time missed for a specific injury, and give a range of dates for expected return.

Injury impact on expected wins is what I’m most excited for. With my weighted VORP system for predicting wins, I can remove players who are hurt. I’d do this for a distribution of possible return dates. Keeping track of games missed, and change in expected win percentage over the duration of an injury, you could then quantify the “cost” of that injury.

Thanks for giving this a read! For any methodology questions or to continue the conversation, reach out at vaughnhajra@gmail.com or on X/Twitter @vaughnhajra.