The aim of the project was to build the best algorithm to predict prices of resold cars by using multiple parameters. The user should be able to easily predict prices of used cars.
Source code:
https://github.com/sanyogthescholar/used_car_price_prediction
Deployment:
http://sanyog.pythonanywhere.com/
Introduction
This project aims to build an accurate car price prediction system by using datasets from various car manufacturers.
Objectives
The aim of the project was to build the best algorithm to predict used car prices. This algorithm will help to improve the accuracy of predictions made to car buyers.
Dataset used for the analysis:
We combined 6 datasets
https://www.kaggle.com/mysarahmadbhat/linear-regression-in-depth/data
Exploration of the dataset
We needed a dataset that contained the prices of used cars with multiple parameters. For this, we merged 6 datasets, each from different brands.
The brands which our model was trained on are:
- Audi
- BMW
- Ford
- Hyundai
- Mercedes
- Toyota
Steps in EDA:
- Inspected the dataset by checking the number of rows, columns, and carried out a statistical summary of the dataset.
- Cleaning the dataset — Checked for missing values and found none.
- Performed visualizations on the dataset using bar charts, pie charts, line charts and scatter plots, and histograms.
Insights from the dataset
Here are some of the insights we got from our datasets.
The visualization above shows the price distribution of cars .
The visualization above shows:
- There is a strong negative correlation value of -0.75 between car production year and car mileage
- There is a positive correlation value of 0.52 between car production year and car price
- There is a positive correlation value of 0.61 between car price and engine size
- There is a strong negative correlation value of -0.88 between petrol and diesel
- There is a weak correlation between car transmission types (automatic, semi-auto, manual, etc) and car price
The visualization above shows that cars produced in recent years cost more than cars produced in former years with the exception of vintage cars .
The above visualization shows the average car prices by year.
After merging all the 6 datasets, we One Hot Encoded all the categorical variables(such as brand, fuel type, transmission, etc.) so that they got converted into numerical form, which can be directly given as input to our ML model.
We experimented with GridSearchCV and a few machine learning models such as Linear Regression, Lasso Regression, Decision Tree, etc.
After experimenting, we settled on Random Forest Regressor as it gave us the highest accuracy. After that, we started working on the deployment process. For the deployment, we have created a web app which is very user friendly and easy to use.
On the front end side, we have used Bulma (a CSS framework), HTML and CSS.
On the backend, Flask is used as a web framework. The ML model is saved in a pickle file which is loaded only once at runtime. When the user submits the details of their car they have to simply click submit.
The model takes in the input data from the form, predicts a value and it is sent to the front end.
Conclusion
From the exploratory analysis, one interesting insight we found out was that car prices steadily declined from the 1970s till the mid-1990s when they fluctuated for a couple of years and started rising steadily again.