2 Comments
User's avatar
Casey's avatar

I don't mean to be a jerk, but I don't think you discovered anything here except the Central Limit Theorem. of course sample standard deviation depends on sample size! sometimes it's instructive to make up some data. (that's the great thing about data science. statisticians are supposed to know this stuff cold, but we can always code up a quick computer simulation.)

imagine two players, one shoots 30 3's a week, one shoots 15 a week, both shoot 40%. simulate shooting %age for 24 weeks.

bleph_curry = [sum(np.random.choice(2, 30, p=[.6,.4])) /30 for x in range(24)]

klay_chompson = [sum(np.random.choice(2, 15, p=[.6,.4])) /15 for x in range(24)]

## can you guess what this will be? it's not a trick question!

np.std(bleph_curry) / np.std(klay_chompson)

shooting percentage will also affect standard deviation - the closer it is to 50%, the higher the variance, because that's how the binomial distribution works (n*p*(1-p)). so bad shooters are going to look a bit more consistent than good shooters.

finally, even if you had the same sample size for each player, you'd want to show confidence intervals for the SD. based on a quick look at the data, there is not a statistically significant difference between the 95% CI for the SD of Klay Thompson [.06,.11] and Luka Doncic [.077, .14]

Expand full comment
Vaughn Hajra's avatar

Hi Casey, thank you for the feedback—getting a different perspective is always valuable! I appreciate your time and effort in suggesting simulations and bringing up statistical concepts like the Central Limit Theorem and confidence intervals. Let me clarify a few points about my approach and address the critiques directly:

1. Assumption of Consistent Shooting Distributions:

Your example assumes that players have a consistent 40% shooting percentage and the variability comes solely from sample size and the binomial distribution. However, one key aspect of my analysis is that real-world shooting percentages are not consistent throughout the season. The week-to-week variance may not be a perfect measure of it, but it serves as a solid starting point. Players experience hot and cold streaks due to external factors such as fatigue, offensive system changes, or even life events away from basketball.

While the Central Limit Theorem explains the relationship between sample size and standard deviation, it doesn't capture these real-world deviations from the binomial assumption. My analysis aims to measure this variability in practice, not under idealized conditions. Therefore, the question isn’t just whether standard deviation depends on sample size (which it does) but how much players' actual shooting consistency deviates from what we’d expect if they were truly constant shooters.

2. Simulations Are Valuable, But Context Matters: Your simulation with “Bleph Curry” and “Klay Chompson” is a helpful demonstration of theoretical variability. However, applying this to my analysis overlooks that my data already incorporates variability from real players with fluctuating shooting performances. Simulating players with static 40% accuracy ignores the inherent inconsistency that I’m interested in capturing.

For example, if I had modeled shooting purely as binomial distributions with fixed probabilities, the critique would be entirely valid. However, I used real-world data to explore patterns that are more complex than what theory alone predicts. A cool study (which you may have already thought of) would be adding "shocks" to a player's simulated three-point percentage to this type of simulation.

3. Confidence Intervals for Standard Deviation: You’re correct that confidence intervals for the standard deviation could provide more statistical rigor. My choice to omit formal hypothesis testing and CI analysis wasn’t due to oversight but because the primary goal of this piece was to serve as an exploratory analysis, inviting discussion and potential improvements. I deliberately framed the analysis as a starting point, not a definitive conclusion. Your suggestion to incorporate confidence intervals for standard deviation is a great next step, and I’ll look into it.

4. Practical Applications vs. Statistical Purity: While you’re right that statisticians often “know this stuff cold,” the focus of my article wasn’t solely statistical rigor—it was also about practical insights for understanding NBA player consistency. As you noted, "the great thing about data science" is being able to quickly explore ideas, and this piece is just that: a practical exploration rather than a definitive statistical study.

5. Volume vs. Consistency Adjustment: I completely agree with your point about sample size being a confounder. As I noted in the article, good shooters tend to take more shots, leading to naturally lower standard deviations. This is why I suggested potential next steps like fitting a trendline for standard deviation vs. volume and analyzing the residuals. Addressing this issue is on my list for future iterations, and your feedback emphasizes the importance of doing so.

Thanks again for engaging with the piece so thoroughly and for sharing your thoughts! It’s always helpful to hear from others who are willing to dig into the details. I’ll take your suggestions into account as I refine this analysis and explore related questions in future work.

Expand full comment