# HDSC Winter ’23 Premiere Project

## by Beautiful Soup Group

# Introduction:

The NBA, or National Basketball Association, is a premier men’s professional basketball league in North America, comprising 30 teams, with 29 based in North America and one in Canada. Established in 1946, the NBA is renowned for its skilled players who are compensated each season. The dataset for NBA players includes various basketball athletes and their salaries spanning the 2019 to 2025 seasons, with payments varying widely based on factors like experience, performance, position, and team affiliation.

# Aim and Objectives:

The major aim of this project is to build a machine-learning model that will predict the salary price of basketball players. The objectives include:

- Data Sourcing
- Data Preparation
- Model Training
- Model Evaluation and Deployment
- Model Deployment

**Data Sourcing**

The dataset was sourced from Kaggle, having been extracted through web scraping from Basketball Reference. Here is the link for the data source,

https://www.kaggle.com/datasets/abdurahmanmaarouf/nba-player-salaries-as-at-2020.

**Data Preparation**

To ensure clean data for modeling, it is crucial to undergo the following highlighted processes.

**Data Collection:**the data is gathered from the link and loaded into the notebook for further analysis using the pandas library.**Data Discovery and Profiling:**The dataset consists of eleven (11) features and includes a total of 568 observations or samples. These features encompass player-related information such as Player, Player Rank, Player Team, Salary for various seasons (from 2019–2020 to 2024–2025), Signed Using, and Guaranteed.

One noteworthy data preprocessing step involved converting the data type of salary variables from objects to floats for accurate analysis. Additionally, these salary variables were adjusted to integer format, and any dollar signs originally attached to the numbers were removed.

Moreover, it’s worth mentioning that more than 5% of the data in the salaries for the last three seasons had missing values. Lastly, the ‘Signed Using’ column exhibited inconsistent values, which were standardized to lowercase for consistency.

**Data Cleaning:**The data cleaning and wrangling process involved addressing outliers and handling skewness. Within the dataset, six columns contained null values. Among these, three columns (‘2022–23’, ‘2023–24’, ‘2024–25’) had null value percentages exceeding 80%. To ensure accurate results, these columns were dropped. For the ‘2020–21’ and ‘2021–22’ columns with null values, the missing values were imputed with the respective column means. Additionally, the null values in the ‘Signed Using’ column were filled with the mode value, which is “1st round pick.”**Data Structuring:**The cleaned data is stored in a new comma-separated file while a copied version is used to perform modeling.**Data transformation:**The data transformation involves adding a feature that helps in understanding the behavior of the dataset.**Data Visualization:**This last step is done to enable us to understand the prepared dataset through visualization plots like histograms, scatter plots, box plots, line plots, and bar charts. After proper cleaning of the dataset, an EDA is performed to detect skewness and outliers. Some of the related questions to be explored in the dataset are:- Which of the players earn a higher salary?
- What are the average salaries paid to the players?
- Does the rank of players affect the salary of the players?
- Does the contract of the players have any effect on the salary?

During the process of addressing outliers, it was observed that most numerical columns exhibited a notable range of outliers, with the exception of the Rank column, which displayed a normal distribution. The ‘2022–23’ column had the highest number of outliers, likely due to the high percentage of null values present in the column. These null values were subsequently replaced with the mean value of the column.

It’s worth noting that while outliers can impact the data, they can often be effectively mitigated through data transformation and standardization, which should help resolve this issue.

Upon examining the dataset for relationships, the heatmap displayed the correlations between variables. Notably, all the income columns exhibited positive correlations with each other. However, none of the categorical variables appeared to have a substantial influence on income. Additionally, there was a positive correlation observed between the player Rank and the method of contract signing, referred to as “Signed Using.”

**Question: **Which of the player contract signed has the highest income: The chart below indicates that players signed under the maximum salary contract type are highly paid

**Question: **What is the relationship between the salary and the rank of the players? It can be deduced that the salaries of players have a negative correlation with the rank of the player. A player ranked as 1 (top player) has the highest salary while least ranked players have lower salaries.

**Model Training**

**Aim: **To predict the salaries of basketball players for the 2019–20 and 2020–21 seasons. To carry out the above aim, the objectives listed below are duly followed.

- Data collection
- Normalizing Data and Feature Selection
- Training the model
- Evaluating the model

**Data Collection:**The cleaned data resulting from the exploratory data analysis (EDA) is employed to develop the machine learning model. In this process, the columns labeled “2022–23”, “2023–24”, and “2024–25” were removed due to their high percentage of missing values exceeding 80%.

Since machine learning algorithms typically work with numerical inputs for efficient computation, the dataset, which includes both categorical and numerical variables, will undergo a transformation to convert categorical variables into numerical values. This transformation ensures compatibility with the machine learning algorithms.

**Categorical Variables **— In the dataset, the categorical variables are the type of contract the player signed (signed using) and the team of each player. These variables are converted into numerical variables using One Hot Encoder and Label Encoder.

**Label Encoder:**used to represent categorical values as numbers.**One-Hot Encoder:**represents categorical data as binary values 0s and 1s.

Fig 1: Converting Categorical Variables to Numerical Variables

**Numerical Variables **— This includes the rank of each player and also the salaries of the players. The heatmap shows the correlation between the variables is depicted below.

Observation: From the heatmap above, a strong positive correlation exists between salaries for each season. A correlation of 0.49 exists between rank and the type of contract signed by the basketball player.

**Normalizing the Data and Feature Selection:**

The values of the features are shifted and scaled to a range of 0 and 1. The independent and dependent variable chosen is shown respectively as x and y in the image below.

**Training the data:**

The data is split into train and test set with the former taking up 70% of the entire data while the latter uses 30% of the data.

Train_test_split from the model_selection class in the sklearn library is a function used to split the data. The entire dataset has 568 samples, 397 are used for training while 171 samples are set aside to be used for testing.

The shape of the dataset is displayed below.

# Defining, Fitting, and Making Predictions:

Linear Regression is a machine learning model used to predict the salaries of basketball players. The fit method is used to train the model on the training set while the predict method is used to predict to make predictions on the test set.

**Evaluating the Model:**

The model is evaluated using two metrics: mean absolute error (MAE) and r-squared error.

**Mean Absolute Error: **calculates the sum of the average of the absolute error between the predicted values and the true values.

**R-squared: **used to determine the goodness of fit of the model.

**Predicting the 2020 salaries based on the previous salary in 2019 and the rank of the player**

# Splitting the Data:

The features used as independent variables are “rank” and “2019–20”. The target variable is the 2020–21 season salary for the players. After normalizing the features, the dataset is split into train and test sets.

# Model Training, Prediction, and Evaluation:

This follows the same procedure as the steps in the model used above, a linear regression model is used. The mean absolute error after predicting is 0.07 while the coefficient of determination (R2 error) is 0.84.

# Decision Tree Algorithm

A model to predict the salaries for the 2020–21 season is developed using the decision tree algorithm.

# Conclusion:

Two machine learning algorithms, linear regression and decision tree, were employed for our predictions. The first model aimed to predict players’ salaries for the 2019–20 season, achieving a mean absolute error of 0.1 and an R-squared error of 0.65.

The second and third models were developed to predict salaries for the 2020–21 season using player rank and the previous season’s salary as inputs. The linear regression model achieved a higher coefficient of determination (R-squared) at 0.84, along with a mean absolute error of 0.07.

In conclusion, our predictions indicate that a player’s salary tends to increase with each successive season, particularly for top players. However, it’s important to note that the predictions are limited by the availability of independent variables and the presence of missing data, relying mainly on the player’s previous season salary and rank for forecasting.