Defining NBA Player Roles with Machine Learning

Deep Dive: Linear Discriminant Analysis and a Gaussian Mixture Model to create an alternative to traditional NBA positions.

Feb 13, 2025

What do LeBron James, Austin Reaves, and JJ Reddick all have in common?

(Trick question: I’m not thinking about the Lakers here)

All three have been listed as a “shooting guard” at some point in their careers. In fact, LeBron has been listed at every position from point guard to center, further clouding what it means to have a “position” in today’s NBA. Even if you think of the three general groups (guard/wing/big), I’m not convinced that you can assign players to just one category.

In this analysis, I create player roles as an alternative to positions. This is aimed at having a more flexible perspective while capturing a player’s complete skill set. This analysis also offers a starting point for future projects that consider roster construction, minutes projections, and more.

Project Background Overview

The fluidity of basketball makes it both fascinating and challenging from an analytics perspective. When creating a starting lineup with a roster of 15, there are over 3,000 possible combinations of players.

In practicality, this number of possibilities can be reduced. How many players are active? How good are the players? And what role can those players take on?

In this project, I offer the following approach to defining player roles:

Step 1: Linear Discriminant Analysis (LDA) – To reduce dimensionality and capture the most informative player features, I apply LDA using a player's listed position as a prior.1
Step 2: Gaussian Mixture Model (GMM) – After reducing dimensionality with LDA, I apply a Gaussian Mixture Model to cluster players based on their transformed attributes.2

These two steps allow me to define new roles while considering existing information, such as player tendencies and their listed position.

Positions in Two Dimensions

A common challenge when trying to cluster players with unsupervised machine learning is the ‘curse of dimensionality.’ As you add more variables (dimensions), it becomes increasingly difficult to get quality results3.

For this project, I aim for a balanced mix of offense, defense, and rebounding. Using Linear Discriminant Analysis (LDA), I reduced the data from seven dimensions to just two while retaining 97.85% of the total variance, ensuring that the most important information was preserved for clustering.

Click on this footnote for a glossary of the final variables used4

Now that I’ve reduced the data to two dimensions, the next key step is interpreting what drives each component. To do this, I looked into the correlation between the input variables5 and the first two linear discriminants (LD1 and LD2).

Examining the correlations revealed that LD1 captures the distinction between perimeter-oriented and interior players. It shows a strong positive correlation with rebounding (TRB%) and shot-blocking (BLK%) while negatively correlating with three-point attempts (3PAr) and playmaking (AST%).

LD2 exhibited a strong positive correlation with assists (AST%) and a negative correlation with corner three attempts (corner_att). This suggests that players with higher LD2 values are more involved in facilitating and playmaking rather than operating off the ball or spacing the floor.

When interpreting the next chart, keep these concepts in mind:

Horizontal axis: Distinguishes perimeter-oriented players (left) from interior-oriented players (right)
Vertical axis: Differentiates off-ball players (bottom) from playmaking/facilitating players (top)

Note that these aren’t perfect representations of the axes but rather a general interpretation for comparing players in similar regions. For example, the post players who don’t shoot corner threes may be higher on the chart than their guard counterparts, but this doesn’t necessarily mean they are better playmakers.

Defining Player Roles

To define player roles, I implemented a Gaussian Mixture Model. This model was selected for its ability to capture the underlying structure of player data, allowing players to belong to multiple categories simultaneously. This approach provides a more nuanced classification than traditional clustering methods, enabling an accurate representation of the diverse and overlapping roles that players take on.6

Results, including each player’s dominant role (the group/cluster they most align with), as well as their unique blend of roles, are as follows:

Discussion of Roles

As you can see in the above chart, there are five general roles that a player can have. I’ve named them as the following:

Rim Runner / Post
Perimeter Playmaker
Interior Playmaker
Off-Ball Spacer
Point Forward

The table above outlines the average distribution of roles by position. Point guards and shooting guards primarily serve as playmakers, while off-ball spacers are mostly shooting guards, small forwards, and power forwards. Rim runners and post players are predominantly power forwards and centers. Interior playmakers are almost exclusively centers, with a few power forwards included. The point forward role, which blends elements of both perimeter and interior playmaking, is a unique category occupied by only a select few players.

The table above highlights the distinctions between the groups. For instance, players in the "perimeter playmaker" category attempt 21.2% of their three-pointers from the corner, whereas off-ball spacers take 34.8% of their threes from the corner. Likewise, while interior and perimeter playmakers share a similar assist percentage, interior playmakers tend to shoot from much closer range and attempt more two-pointers than three-pointers.

Countless similarities or differences can be made between groups. The great thing about this analysis is that a player isn’t constrained to just one role, they can take on parts of all five roles.

Applications and Future Improvements

This project is designed with future applications in mind. The groupings and roles established here serve as a foundational framework that can be refined and expanded over time. There are several applications that I find particularly exciting, which I’ll outline below.

The first, and my original inspiration for this project, is predicting minutes and rotations. When a player is signed, released, traded, or injured, teams must reassess their lineups and adjust accordingly. In a future project, I plan to build on this concept to develop a more systematic approach to rotation projections.

Another key application of this project is analyzing roster construction. Similar studies have explored this approach, and I believe there is much to uncover. One notable takeaway is the relative scarcity of high-level point forwards or interior playmakers. From a team-building perspective, it may be strategic to allocate salary toward those rare skill sets while filling wing positions with replacement-level 3-and-D players. A facilitating point guard would also be valuable, but if I had to choose just one, I’d prioritize a well-rounded two-way playmaker.

Because this analysis captures multiple years of data, it could also be used to track player progression over time.

There are several ways to improve this process, and I’m open to additional suggestions. Player roles are influenced by more factors than those included in this analysis, and incorporating additional variables could enhance the framework. Additionally, the project’s effectiveness is dependent on the quality of the underlying data.

One challenge with this type of analysis is that, unlike supervised machine learning, there is no clear target variable to evaluate model performance. Instead, assessing quality relies on understanding player tendencies, which introduces the possibility of bias. Incorporating methods to systematically validate the results would be a useful addition.

Final Thoughts

Acknowledgment: This project, more than some of my others, is built upon existing research. Thank you to those who have shared similar analyses, and I appreciate the countless conversations and perspectives that helped me throughout this project.

If you’ve made it this far, thank you for taking the time to give this a read! If you’d like to further the discussion or have questions related to my methods, I welcome the conversation:

Additionally, if you’re interested in keeping up to date with my work, consider subscribing to get articles sent straight to your email:

This article is slightly dated but does a good job of implementing LDA. They used K-Means clustering to define positions, and I built upon their findings by using more data and a different clustering algorithm.

This article uses GMM, which I decided was my preferred clustering method for this project. I hoped to build upon their findings by using LDA.

https://www.geeksforgeeks.org/curse-of-dimensionality-in-machine-learning/

Data via Basketball Reference, and note that some variables are listed as estimates.

The variables used in the final analysis are as follows:

Average field goal attempt distance [Dist.]
Percentage of field goals attempted that are threes [3PAr]
Percentage of three-pointers attempted that are taken from the corner [corner_pct]
Percentage of total rebounds while on the court [TRB%]
Percentage of teammate field goals a player assisted while on the court [AST%]
Percentage of opponent possessions that end with a player stealing the ball while on the court [STL%]
Percentage of opponent’s two-point field goals a player blocks while on the court [BLK%]

The average field goal attempt distance is not included in this chart. It correlates negatively with both LD1 (-0.52) and LD2 (-0.39)

For more on Gaussian Mixture Models, this resource is great:
https://scikit-learn.org/stable/modules/mixture.html
To define the name of each cluster, I used a combination of my basketball knowledge, cluster characteristics, and the factors that correlate with LD1 and LD2.