Machine Learning and the WNBA (K-Means Clustering in R)
Using unsupervised machine learning to identify potential WNBA trade targets.
This analysis aims to use pre-All-Star Break (ASB) stats from the WNBA to find similar athletes and group them accordingly. There are many potential benefits to this, from roster construction to trades. In a previous post, I attempted to find similar WNBA player archetypes based on their box score stats. Although I was generally pleased with much of that analysis, there was much room for improvement. This write-up details a follow-up analysis, building upon the previous post.
The biggest improvement from the previous post is normalizing the data. Instead of using raw box score data, I am using per-36 statistics. I also include minutes played in the analysis (I’ll explain why later). I next removed three-pointers made (3PM) from the points column, and included 3PM as a separate stat. This ensures no overlap between the variables.
The previous analysis struggled to capture defensive contributions effectively. I attempt to tackle this with two new ideas. First, I split offensive and defensive rebounds into two separate stats. I then added defensive win shares (DWS) per 36 minutes as another proxy for defense. I hope these improvements will do a better job of capturing an athlete’s defensive contribution.
The final improvement of this analysis is using Datawrapper to create tables. This allows for a streamlined write-up. Instead of listing every athlete multiple times, I put all the information in one table (at the top of this page). You can search by name, sort, and view all you need in one place.
In simple terms, k-means clustering finds similar data points and groups them. Behind the scenes, there is a lot more, but to understand this analysis that should suffice. If you’re interested in learning more or trying it out, I’ve linked a few articles at the bottom of this post. Additionally, if you have any suggestions for improvement, please comment!
Clustering Overview
When it comes to the actual clusters, some interesting patterns emerge. In the last post, I referred to these clusters as “tiers” which is something I’m less inclined to do here. If you had to assign them general labels though, this is how I’d do it:
General Cluster Labels:
Cluster 1: Role Players & “Hidden Gems”
Cluster 2: Star Players
Cluster 3: Inefficient (Mostly Bench) Players
Cluster 4: Quality Starters (generally offense-focused)
Cluster 5: Quality Starters (generally defense-focused)
Note that to be included in the analysis, a player had to play at least ten minutes this year and appear in at least three different games. This helps take out most of the skewing per 36 stats can have. As with any clustering analysis, there are always shortcomings. Think of these groups as people with either (1) similar playstyles or (2) similar opportunities within a coach’s system. There are definitely players in the wrong category (Britney Griner is an obvious example) but in general, I’m pretty pleased with how this turned out. When investigating the cluster averages, it seems there were three main separating factors.
The first separating factor is scoring. Most of the clusters show different output in scoring (both in points and 3 pointers made), with cluster two being the clear leader in this category. I think that this general separation between clusters can be a sign of offensive skill separation. Obviously the other variables are considered too, but it’s good to see this pass the eye test.
The second separating factor is defensive win shares per 36 minutes, with clusters 3 and four standing out here. This difference is the best addition to this analysis, as now instead of simply grouping by scoring, defense is better-captured.
The third, and most clear is minutes played. I went back and forth on including this stat, but ultimately decided to leave it in. An average player on a bad team will get fewer minutes, and a great player on a good team won’t exactly contribute the same as if they were the only option (think the NY Liberty) but in general, minutes played is a good enough proxy for skill when all of the other variables are also included.
You could argue that minutes per game would be a better stat to use than raw minutes (Cameron Brink and Temi Fagbenle are good examples of shortcomings) but raw minutes also captures true contributions over the entire season. What I am trying to avoid is inflated minutes per game: think someone who signs a hardship contract, plays quality minutes for a week in place of an injured starter, and then goes back to not being on a team.
The last point I’d like to make is that this is not a tier list. It isn’t a perfect representation of quality (Britney Griner stands out as an outlier), rather a general grouping of play style based on box-score statistics and defensive win shares. The best part? That’s okay. This analysis is meant to serve as one tool in the belt, not a catch-all ranking system. In future posts I’ll dive into other tools, and compare and contrast accuracy, precision, and general usefulness. When it comes to how this analysis could be used, I’d like to zone in on group one.
Trade Potential and Undervalued Players
When investigating cluster averages, cluster one stands out as an interesting group. These are players who have very similar stats to cluster five, but way fewer minutes. Although there may be some injury issues here, there are also what I’d dub “hidden gems”. These players aren’t superstars by any means, and more often just solid rotation players, but to this point in the season they’ve been underutilized as compared to their true potential.
I think the true value of this analysis lies in the region of identifying potential rotation players. For better or worse, we already have a good understanding of who today’s superstars are. We can also probably guess fairly accurately of who tomorrow’s superstars will be. Like it or not, superstars being traded around in the WNBA is relatively uncommon and teams have the most room to improve through role player acquisitions.
By having a group of low-minutes players who are comparable to their high-minute counterparts, improvement remains a possibility. Using this clustering to create a list of potential trade candidates, and then further creating a shortlist would be my intended benefit of this write-up, and I’m interested to learn more about others’ thoughts too.
As I just mentioned, this analysis is rather inconclusive on its own but with the right qualitative analysis paired, it could be extremely valuable. Furthermore, similar techniques could be used for potential position changes or lineup switches. As always, if you have any suggestions for improvement or gaps in the analysis that I’ve missed please comment! If you’re interested in the code for this project leave a comment and I can provide it.