注意:本文引用自专业人工智能社区Venus AI
更多AI知识请参考原站 ([www.aideeplearning.cn])
问题描述
汽车价格预测是一个旨在预估二手车市场中汽车售价的问题。这个问题涉及到分析各种影响汽车价格的因素,如品牌、车龄、性能参数等。准确的价格预测对于卖家定价和买家预算规划都非常重要。
项目目标
此项目的主要目标是开发一个预测模型,该模型能够根据汽车的各种特征准确预测其市场价值。这个模型应能处理不同类型的数据,包括数值数据和类别数据,并在预测准确度和计算效率之间取得平衡。
项目应用
- 二手车交易:帮助买家和卖家了解特定车辆的公平市场价值。
- 汽车评估:为汽车评估公司提供自动化的价值评估工具。
- 市场分析:分析市场趋势,预测未来价值。
- 个人决策支持:帮助个人用户在购买或出售汽车时做出更明智的决策。
数据集描述
这个数据集包含以下特征:
汽车ID,符号,汽车名称,燃油类型,吸气,门号,车身,驱动轮,发动机位置,轴距,车长,车宽,车高,整备质量,发动机类型,气缸数,发动机尺寸,燃油系统,硼比,冲程,压缩比,马力,峰值转速,城市英里数,高速公路英里数。
模型选择和科学计算库依赖
本项目使用的模型:
- 线性回归
- 决策树回归
- 随机森林回归
本项目依赖的科学计算库
- matplotlib==3.7.1
- pandas==2.0.2
- scikit_learn==1.2.2
- seaborn==0.13.0
项目详细代码
#imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
data = pd.read_csv('car_price.csv')
data.head(10)
1. 探索数据特性
print("Rows: ",data.shape[0])
print("Columns: ",data.shape[1])
Rows: 205 Columns: 26
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 car_ID 205 non-null int64 1 symboling 205 non-null int64 2 CarName 205 non-null object 3 fueltype 205 non-null object 4 aspiration 205 non-null object 5 doornumber 205 non-null object 6 carbody 205 non-null object 7 drivewheel 205 non-null object 8 enginelocation 205 non-null object 9 wheelbase 205 non-null float6410 carlength 205 non-null float6411 carwidth 205 non-null float6412 carheight 205 non-null float6413 curbweight 205 non-null int64 14 enginetype 205 non-null object 15 cylindernumber 205 non-null object 16 enginesize 205 non-null int64 17 fuelsystem 205 non-null object 18 boreratio 205 non-null float6419 stroke 205 non-null float6420 compressionratio 205 non-null float6421 horsepower 205 non-null int64 22 peakrpm 205 non-null int64 23 citympg 205 non-null int64 24 highwaympg 205 non-null int64 25 price 205 non-null float64 dtypes: float64(8), int64(8), object(10) memory usage: 41.8+ KB
data.isna().sum()
# 没有空值
car_ID 0 symboling 0 CarName 0 fueltype 0 aspiration 0 doornumber 0 carbody 0 drivewheel 0 enginelocation 0 wheelbase 0 carlength 0 carwidth 0 carheight 0 curbweight 0 enginetype 0 cylindernumber 0 enginesize 0 fuelsystem 0 boreratio 0 stroke 0 compressionratio 0 horsepower 0 peakrpm 0 citympg 0 highwaympg 0 price 0 dtype: int64
data.duplicated().sum()
# 没有重复值
0
data.groupby("CarName").sum(numeric_only=True)
# 删除 CarName, CarID 因为它不会给回归任务增加太多价值
data = data.drop(['car_ID','CarName'],axis=1)
data.head(1)
sns.histplot(data=data, x="price")
<Axes: xlabel='price', ylabel='Count'>
plt.figure(figsize=(15,7))
sns.heatmap(data.corr(numeric_only=True), annot=True)
plt.title("Data Correlation",size=15)
plt.show()
#燃料类型对价格的影响
sns.barplot(x="fueltype", y="price", data=data)
<Axes: xlabel='fueltype', ylabel='price'>
#车型对价格的影响
sns.boxplot(x ="carbody", y ="price", data = data)
<Axes: xlabel='carbody', ylabel='price'>
#门数对价格的影响
sns.boxplot(x ="doornumber", y ="price", data = data)
<Axes: xlabel='doornumber', ylabel='price'>
驱动器(FWD、RWD、AWD)对价格的影响
sns.boxplot(x ="drivewheel", y ="price", data = data)
<Axes: xlabel='drivewheel', ylabel='price'>
#绘制热图中最相关属性之间的成对关系
columns=data[['wheelbase','carlength','carwidth','curbweight','price']]
sns.pairplot(columns)
plt.show()
#linear relationship
columns=data[['horsepower','citympg','highwaympg','price']]
sns.pairplot(columns)
plt.show()
#linear relationship
以下属性集具有线性关系:
1.轴距、车长、车宽、整备质量和价格(基本上是所有物理属性)
2.马力、城市英里数、高速公路英里数和价格(基本上是与车辆功率相关的所有属性)
2. 训练模型
encoder = LabelEncoder()
data['fueltype'] = encoder.fit_transform(data['fueltype'])
fueltype = {index : label for index, label in enumerate(encoder.classes_)}
data['aspiration'] = encoder.fit_transform(data['aspiration'])
aspiration = {index : label for index, label in enumerate(encoder.classes_)}
data['doornumber'] = encoder.fit_transform(data['doornumber'])
doornumber = {index : label for index, label in enumerate(encoder.classes_)}
data['carbody'] = encoder.fit_transform(data['carbody'])
carbody = {index : label for index, label in enumerate(encoder.classes_)}
data['drivewheel'] = encoder.fit_transform(data['drivewheel'])
drivewheel = {index : label for index, label in enumerate(encoder.classes_)}
data['enginelocation'] = encoder.fit_transform(data['enginelocation'])
enginelocation = {index : label for index, label in enumerate(encoder.classes_)}
data['fuelsystem'] = encoder.fit_transform(data['fuelsystem'])
fuelsystem = {index : label for index, label in enumerate(encoder.classes_)}
data['enginetype'] = encoder.fit_transform(data['enginetype'])
enginetype = {index : label for index, label in enumerate(encoder.classes_)}
data['cylindernumber'] = encoder.fit_transform(data['cylindernumber'])
cylindernumber = {index : label for index, label in enumerate(encoder.classes_)}
data['fuelsystem'] = encoder.fit_transform(data['fuelsystem'])
fuelsystem = {index : label for index, label in enumerate(encoder.classes_)}
x = data.drop('price', axis=1)
y = data['price']
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(x)
#train, test split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=30,random_state=0)
- 随机森林回归
rf = RandomForestRegressor(n_estimators=100,max_depth=5, random_state=33) rf.fit(x_train, y_train)
print("Training r2_score: ",rf.score(x_train, y_train))
print("Testing r2_score: ",rf.score(x_test, y_test))
Training r2_score: 0.9753559007565417 Testing r2_score: 0.87367804775233
- 决策树回归
dt = DecisionTreeRegressor( max_depth=5,random_state=33) dt.fit(x_train, y_train)
print('Training r2_score: ' , dt.score(x_train, y_train))
print('Testing r2_score: ' , dt.score(x_test, y_test))
Training r2_score: 0.9735394081185511 Testing r2_score: 0.8226507572837073
2.线性回归
def evaluate(model,x_train , y_train, x_test , y_test, y_predict):print(f'train r2_score:{r2_score(y_train, model.predict(x_train))}' )print(f'test r2_score : {r2_score(y_test, y_predict)}')
model = LinearRegression()
model.fit(x_train,y_train)
y_predict=model.predict(x_test)
evaluate(model,x_train , y_train, x_test , y_test, y_predict)
train r2_score:0.889157847638672 test r2_score : 0.7289860743863041
项目资源下载
详情请见汽车价格的回归预测项目-VenusAI (aideeplearning.cn)