NBA Data Analysis
Abstract
This project analyzes data from the 2019-2020 NBA season to explore the relationships between players' salaries, their statistical performance, and physical attributes. Utilizing datasets from Kaggle, the study employs both supervised and unsupervised learning techniques, including linear and logistic regression, and hierarchical clustering. The goal is to ascertain if there is a linear correlation between a player's statistics and salary, to predict draft status based on physical characteristics, and to classify players based on performance metrics. The findings reveal complexities in predicting salary and draft status, highlighting the challenges in correlating performance metrics with financial compensation and draft order in the NBA.
Introduction
The National Basketball Association (NBA) is a premier basketball league where player valuation is a critical aspect of team management. In this project, we investigate the 2019-2020 NBA season to understand the relationship between player salaries and their on-court performance. We focus on whether a player’s salary can be predicted based on their statistics and explore the correlation between a player's physical attributes and their draft status. This analysis is pivotal in understanding the dynamics of player valuation in the NBA and could offer insights into effective team management strategies.
Methodology
Data Acquisition and Preparation
We utilized two datasets from Kaggle, focusing on the 2019-2020 NBA season. The datasets included player statistics and salaries, encompassing variables like age, height, weight, college, country, draft details, performance metrics, and salary. After merging and cleaning the datasets, we had 529 entries with 22 variables.
Analytical Techniques
- Supervised Learning (Linear and Logistic Regression): We analyzed the linear relationship between player statistics (points, rebounds, assists) and salaries. The logistic regression model was used to predict the draft status based on physical attributes.
- Unsupervised Learning (Hierarchical Clustering): This method grouped players based on their attributes, such as salary and performance metrics, to identify patterns and trends.
Results
Supervised Learning
- Linear Regression: The best model had an R2 value of 0.5264 after transforming the response variable (salary). However, this value indicates a moderate linear correlation, suggesting complexity in predicting salaries based solely on performance metrics.
- Logistic Regression: This model suggested that physical attributes alone do not significantly impact draft status, as most players in the dataset were predicted to be drafted.
Unsupervised Learning
- Hierarchical Clustering: The clustering revealed patterns like high-salary players with lower-than-average heights and higher points averages. It also indicated that more experienced players are likely to have higher salaries.
Conclusion
The study highlights the intricacies involved in predicting NBA players' salaries based on their on-court performance and physical attributes. While the linear regression model provided some insight, its moderate R2 value suggests that player valuation in the NBA involves more complex factors than just performance metrics. Logistic regression and hierarchical clustering further indicated that physical attributes and draft status do not straightforwardly predict on-court performance or salary. The results underscore the multifaceted nature of player valuation in professional sports like basketball, where numerous variables influence a player's market value. Future research could benefit from incorporating more diverse data, including off-court factors, to develop more comprehensive models for player valuation.