四个模型建模及数据分析整理(基于Titanic数据集)

目录

介绍: 

 二、数据

2.1引用数据

2.2检查缺失数据

2.2.1手动检查缺失数据

2.2.2查看某一个特征值为空数据

2.3补充缺失数据

 2.3.1盒图

 2.3.2手动用均值填补缺失数据

2.3.3手动用类别填补缺失数据

三、数据分析

3.1男女生存比例

3.2男女生存数

3.3船舱级别生存比例

 3.4船舱生存与死亡比例

3.5票价与生存关系

3.6年龄与生存关系

3.7性别年龄与生存关系

3.8性别、登口岸、年龄与生存关系

3.9性别、票价、年龄与生存关系

四、数据处理

 4.1处理性别

 4.2热图表相关性

4.3姓名处理

4.4处理家庭大小

4.5清除票

4.5处理票价

4.6数据整理

4.7用RandomForestRegressor预测年龄填补缺失数据

 五、建模

5.1LogisticRegression建模 

5.2Decision Tree建模

5.3randomforest建模

5.4Adaboost建模


介绍: 

AdaBoost(Adaptive Boosting)是一种集成学习方法,用于提高机器学习算法的准确性。它通过训练一系列弱分类器(即准确率略高于随机猜测的分类器),然后将它们组合成一个强分类器。

AdaBoost的基本思想是对训练样本进行加权,将分类错误的样本权值增大,再训练下一个分类器。经过多轮迭代,每个样本都会得到不同的权值,并且每个分类器都有一定的权重。最终,AdaBoost通过对所有分类器的加权求和,得到最终的分类结果。

在AdaBoost中,每个弱分类器的训练过程如下:

  1. 初始化样本权值为相等值。
  2. 使用权值训练一个弱分类器。
  3. 根据分类错误率计算该分类器的权重。
  4. 更新样本权值,增加分类错误的样本权值,并减小分类正确的样本权值。
  5. 重复步骤2到4,直到达到指定的弱分类器个数或达到某个条件。

最终的强分类器是将每个弱分类器的输出加权求和,并根据总体输出判断样本的分类。

AdaBoost模型有一些特点:

  1. 适用于二分类和多分类问题。
  2. 对于弱分类器的选择没有严格的限制,可以使用任何分类器作为基分类器。
  3. 在训练过程中,每个样本都被赋予不同的权重,以便更加关注错误分类的样本。
  4. 可以减少过拟合问题,因为它将注意力集中在难以分类的样本上。
  5. AdaBoost的训练速度较快,预测速度也较快。

总的来说,AdaBoost是一种强大的集成学习算法,可以通过组合多个弱分类器来提高分类的准确性。它在实际应用中取得了很好的效果,并广泛应用于各个领域。

 二、数据

2.1引用数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineimport warnings
warnings.filterwarnings('ignore')##忽略警告train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")passengerid = test.PassengerIdprint(train.info())
print('*'*80)
print(test.info())
'''结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  891 non-null    int64  1   Survived     891 non-null    int64  2   Pclass       891 non-null    int64  3   Name         891 non-null    object 4   Sex          891 non-null    object 5   Age          714 non-null    float646   SibSp        891 non-null    int64  7   Parch        891 non-null    int64  8   Ticket       891 non-null    object 9   Fare         891 non-null    float6410  Cabin        204 non-null    object 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
********************************************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):#   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  0   PassengerId  418 non-null    int64  1   Survived     418 non-null    int64  2   Pclass       418 non-null    int64  3   Name         418 non-null    object 4   Sex          418 non-null    object 5   Age          332 non-null    float646   SibSp        418 non-null    int64  7   Parch        418 non-null    int64  8   Ticket       418 non-null    object 9   Fare         417 non-null    float6410  Cabin        91 non-null     object 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None
'''

2.2检查缺失数据

2.2.1手动检查缺失数据

def missing_percentage(df):#显示空缺数据total = df.isnull().sum().sort_values(ascending=False)percent = round(df.isnull().sum().sort_values(ascending=False)/len(df)*100,2)return pd.concat([total,percent],axis=1,keys=['Total','Percent'])missing_percentage(train)#检查缺失数据
'''结果:Total 	Percent
Cabin 	687 	77.10
Age 	177 	19.87
Embarked 	2 	0.22
PassengerId 	0 	0.00
Survived 	0 	0.00
Pclass 	0 	0.00
Name 	0 	0.00
Sex 	0 	0.00
SibSp 	0 	0.00
Parch 	0 	0.00
Ticket 	0 	0.00
Fare 	0 	0.00
'''missing_percentage(test)
'''结果:Total 	Percent
Cabin 	327 	78.23
Age 	86 	20.57
Fare 	1 	0.24
PassengerId 	0 	0.00
Survived 	0 	0.00
Pclass 	0 	0.00
Name 	0 	0.00
Sex 	0 	0.00
SibSp 	0 	0.00
Parch 	0 	0.00
Ticket 	0 	0.00
Embarked 	0 	0.00
'''
def percent_value_counts(df,feature):#统计一个特征值的特征类别percent = pd.DataFrame(round(df.loc[:,feature].value_counts(dropna=False,normalize=True)*100,2))total = pd.DataFrame(df.loc[:,feature].value_counts(dropna=False))total.columns=["Total"]percent.columns=['Percent']return pd.concat([total,percent],axis=1)percent_value_counts(train,'Embarked')'''结果:Total 	Percent
S 	644 	72.28
C 	168 	18.86
Q 	77 	    8.64
NaN 2 	    0.22
'''

2.2.2查看某一个特征值为空数据

train[train.Embarked.isnull()]#查看某一特征值为空的数据

2.3补充缺失数据

 2.3.1盒图

sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize=(16,12),ncols=2)#前面表示图的大小,后面表示两个图ax1 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=train, ax = ax[0]); 
ax2 = sns.boxplot(x="Embarked", y="Fare", hue="Pclass", data=test,ax = ax[1]); ax1.set_title("Training set", fontsize=18)
ax2.set_title('Test set', fontsize=18)## Fixing Legends
leg_1 = ax1.get_legend() 
leg_1.set_title("PClass") 
legs = leg_1.texts
legs[0].set_text('upper') 
legs[1].set_text('Middle') 
legs[2].set_text('Lower')fig.show( )#可以看均值,通过均值补充空的值

train.Embarked.fillna("C",inplace=True)#填充空数据##缺失数据的Pclass为1,票价为80,根据盒图可以看出Pclass为1,票价均值为80的c
#所以用c补充数据

 2.3.2手动用均值填补缺失数据

print("Train Cabin missing:" + str(train.Cabin.isnull().sum()/len(train.Cabin))) 
#结果:Train Cabin missing:0.7710437710437711'''缺失数据量大,把训练集和测试集合并'''
survivers = train.Survived
train.drop(["Survived"],axis=1, inplace=True)all_data = pd.concat([train,test],ignore_index=False)#train和test连接起来#* Assign all the null values to N
all_data.Cabin.fillna("N", inplace=True)#先用N补充空数据percent_value_counts(all_data,"Cabin")#统计一个特征值的特征个数
'''结果:Total 	Percent
N 	1014 	77.46
C 	94 	    7.18
B 	65 	    4.97
D 	46 	    3.51
E 	41 	    3.13
A 	22 	    1.68
F 	21 	    1.60
G 	5 	    0.38
T 	1 	    0.08
'''all_data.groupby("Cabin")['Fare'].mean().sort_values()#每个舱位的票价均值
'''结果:
Cabin
G     14.205000
F     18.079367
N     19.132707
T     35.500000
A     41.244314
D     53.007339
E     54.564634
C    107.926598
B    122.383078
Name: Fare, dtype: float64
'''
def cabin_estimator(i):#"""grouping cabin feature by the first letter"""a = 0if i<16:a="G"elif i>=16 and i<27:a="F"elif i>=27 and i<38:a="T"elif i>=38 and i<47:a="A"elif i>= 47 and i<53:a="E"elif i>= 53 and i<54:a="D"elif i>=54 and i<116:a='C'else:a="B"return a#把为N的挑出来,不为N挑出来
with_N = all_data[all_data.Cabin == "N"]without_N = all_data[all_data.Cabin != "N"]##opplying cabin estimator function.
with_N['Cabin'] = with_N.Fare.apply(lambda x: cabin_estimator(x))#根据票价赋值## getting back train.
all_data = pd.concat([with_N, without_N],axis=0)#连接在一起## PassengerId helps us separate train and test.
all_data.sort_values(by ='PassengerId', inplace=True)## Separating train and test from all_data. 
train = all_data[:891]#再分训练集和测试集test = all_data[891:]#adding saved target variable with train. 
train['Survived']=survivers

2.3.3手动用类别填补缺失数据

test[test.Fare.isnull()]
'''结果:PassengerId 	Pclass 	Name 	Sex 	Age 	SibSp 	Parch 	Ticket 	Fare 	Cabin 	Embarked 	Survived
152 	1044 	3 	Storey, Mr. Thomas 	male 	60.5 	0 	0 	3701 	NaN 	B 	S 	0.0
'''missing_value = test[(test.Pclass == 3)&(test.Embarked == 'S')&(test.Sex == 'male')].Fare.mean()#求这类乘客的票价均值test.Fare.fillna(missing_value,inplace=True)#将均值赋给空

三、数据分析

3.1男女生存比例

import seaborn as sns
pal = {'male':"green", 'female':"Pink"}
sns.set(style="darkgrid")
plt.subplots(figsize = (15,8))
ax = sns.barplot(x = "Sex", y = "Survived", data=train, palette = pal,linewidth=5,order = ['female','male'],capsize = .05,)plt.title("Survived/Non-Survived Passenger Gender Distribution", fontsize = 25,loc = 'center', pad = 40)
plt.ylabel("% of passenger survived", fontsize = 15, )
plt.xlabel("Sex",fontsize = 15);

3.2男女生存数

pal = {1:"seagreen", 0:"gray"}
sns.set(style="darkgrid")
plt.subplots(figsize = (15,8))
ax = sns.countplot(x = "Sex", hue="Survived",data = train, linewidth=4, palette = pal
)## Fixing title, xlabel and ylabel
plt.title("Passenger Gender Distribution - Survived vs Not-survived", fontsize = 25, pad=40)
plt.xlabel("Sex", fontsize = 15);
plt.ylabel("# of Passenger Survived", fontsize = 15)## Fixing xticks
#labels = ['Female', 'Male']
#plt.xticks(sorted(train.Sex.unique()), labels)## Fixing legends
leg = ax.get_legend()
leg.set_title("Survived")
legs = leg.texts
legs[0].set_text("No")
legs[1].set_text("Yes")
plt.show()

 

3.3船舱级别生存比例

plt.subplots(figsize = (15,10))
sns.barplot(x = "Pclass", y = "Survived", data=train, linewidth=6,capsize = .05,errcolor='blue',errwidth = 3)
plt.title("Passenger Class Distribution - Survived vs Non-Survived", fontsize = 25, pad=40)
plt.xlabel("Socio-Economic class", fontsize = 15);
plt.ylabel("% of Passenger Survived", fontsize = 15);
names = ['Upper', 'Middle', 'Lower']
#val = sorted(train.Pclass.unique())
val = [0,1,2] ## this is just a temporary trick to get the label right. 
plt.xticks(val, names);

 

 3.4船舱生存与死亡比例

# Kernel Density Plot
fig = plt.figure(figsize=(15,8),)
## I have included to different ways to code a plot below, choose the one that suites you. 
ax=sns.kdeplot(train.Pclass[train.Survived == 0] , color='blue',shade=True,label='not survived')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Pclass'] , color='g',shade=True, label='survived', )
plt.title('Passenger Class Distribution - Survived vs Non-Survived', fontsize = 25, pad = 40)
plt.ylabel("Frequency of Passenger Survived", fontsize = 15, labelpad = 20)
plt.xlabel("Passenger Class", fontsize = 15,labelpad =20)
## Converting xticks into words for better understanding
labels = ['Upper', 'Middle', 'Lower']
plt.xticks(sorted(train.Pclass.unique()), labels);

3.5票价与生存关系

# Kernel Density Plot
fig = plt.figure(figsize=(15,8),)
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'Fare'] , color='blue',shade=True,label='not survived')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Fare'] , color='g',shade=True, label='survived')
plt.title('Fare Distribution Survived vs Non Survived', fontsize = 25, pad = 40)
plt.ylabel("Frequency of Passenger Survived", fontsize = 15, labelpad = 20)
plt.xlabel("Fare", fontsize = 15, labelpad = 20);

 

3.6年龄与生存关系

# Kernel Density Plot
fig = plt.figure(figsize=(15,8),)
ax=sns.kdeplot(train.loc[(train['Survived'] == 0),'Age'] , color='blue',shade=True,label='not survived')
ax=sns.kdeplot(train.loc[(train['Survived'] == 1),'Age'] , color='g',shade=True, label='survived')
plt.title('Age Distribution - Surviver V.S. Non Survivors', fontsize = 25, pad = 40)
plt.xlabel("Age", fontsize = 15, labelpad = 20)
plt.ylabel('Frequency', fontsize = 15, labelpad= 20);

 

3.7性别年龄与生存关系

pal = {1:"seagreen", 0:"gray"}
g = sns.FacetGrid(train,  col="Sex", row="Survived", margin_titles=True, hue = "Survived",palette=pal)
g = g.map(plt.hist, "Age", edgecolor = 'white');
g.fig.suptitle("Survived by Sex and Age", size = 25)
plt.subplots_adjust(top=0.90)

3.8性别、登口岸、年龄与生存关系

g = sns.FacetGrid(train, col="Sex", row="Embarked", margin_titles=True, hue = "Survived",palette = pal)
g = g.map(plt.hist, "Age", edgecolor = 'white').add_legend();
g.fig.suptitle("Survived by Sex and Age", size = 25)
plt.subplots_adjust(top=0.90)

3.9性别、票价、年龄与生存关系

g = sns.FacetGrid(train,hue="Survived", col ="Sex", margin_titles=True,palette=pal,)
g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
g.fig.suptitle("Survived by Sex, Fare and Age", size = 25)
plt.subplots_adjust(top=0.85)

 

四、数据处理

 4.1处理性别

male_mean = train[train['Sex'] == 1].Survived.mean()female_mean = train[train['Sex'] == 0].Survived.mean()
print ("Male survival mean: " + str(male_mean))
print ("female survival mean: " + str(female_mean))print ("The mean difference between male and female survival rate: " + str(female_mean - male_mean))'''结果:
Male survival mean: 0.18890814558058924
female survival mean: 0.7420382165605095
The mean difference between male and female survival rate: 0.5531300709799203
'''# separating male and female dataframe. 
import random
male = train[train['Sex'] == 1]
female = train[train['Sex'] == 0]## empty list for storing mean sample
m_mean_samples = []
f_mean_samples = []for i in range(50):m_mean_samples.append(np.mean(random.sample(list(male['Survived']),50,)))f_mean_samples.append(np.mean(random.sample(list(female['Survived']),50,)))# Print them out
print (f"Male mean sample mean: {round(np.mean(m_mean_samples),2)}")
print (f"Female mean sample mean: {round(np.mean(f_mean_samples),2)}")
print (f"Difference between male and female mean sample mean: {round(np.mean(f_mean_samples) - np.mean(m_mean_samples),2)}")'''结果:
Male mean sample mean: 0.18
Female mean sample mean: 0.74
Difference between male and female mean sample mean: 0.56
'''
# Placing 0 for female and 
# 1 for male in the "Sex" column. 
train['Sex'] = train.Sex.apply(lambda x: 0 if x == "female" else 1)
test['Sex'] = test.Sex.apply(lambda x: 0 if x == "female" else 1)
train.describe()#数据

survived_summary = train.groupby("Survived")
survived_summary.mean().reset_index()
'''结果:Survived 	PassengerId 	Pclass 	Sex 	Age 	SibSp 	Parch 	Fare
0 	0 	447.016393 	2.531876 	0.852459 	30.626179 	0.553734 	0.329690 	22.117887
1 	1 	444.368421 	1.950292 	0.318713 	28.343690 	0.473684 	0.464912 	48.395408
'''survived_summary = train.groupby("Sex")
survived_summary.mean().reset_index()
'''结果:Sex 	PassengerId 	Pclass 	Age 	SibSp 	Parch 	Fare 	Survived
0 	0 	431.028662 	2.159236 	27.915709 	0.694268 	0.649682 	44.479818 	0.742038
1 	1 	454.147314 	2.389948 	30.726645 	0.429809 	0.235702 	25.523893 	0.188908
'''survived_summary = train.groupby("Pclass")
survived_summary.mean().reset_index()
'''结果:Pclass 	PassengerId 	Sex 	Age 	SibSp 	Parch 	Fare 	Survived
0 	1 	461.597222 	0.564815 	38.233441 	0.416667 	0.356481 	84.154687 	0.629630
1 	2 	445.956522 	0.586957 	29.877630 	0.402174 	0.380435 	20.662183 	0.472826
2 	3 	439.154786 	0.706721 	25.140620 	0.615071 	0.393075 	13.675550 	0.242363
'''pd.DataFrame(abs(train.corr()["Survived"]).sort_values(ascending=False))#相关性
'''结果:Survived
Survived 	1.000000
Sex 	0.543351
Pclass 	0.338481
Fare 	0.257307
Parch 	0.081629
Age 	0.077221
SibSp 	0.035322
PassengerId 	0.005007
'''

 4.2热图表相关性

corr = train.corr()**2
corr.Survived.sort_values(ascending = False)## heatmeap to see the correlatton between features
# Generate a mask for the upper triangle (taken from seaborn example gallery)
import numpy as np
mask = np.zeros_like(train.corr(),dtype=bool)
mask[np.triu_indices_from(mask)]  = True
sns.set_style("whitegrid")plt.subplots(figsize = (15,12))
sns.heatmap(train.corr(),annot=True,mask = mask,cmap = 'RdBu', ## in order to reverse the bar replace "RdBu" with "RdBur'linewidths=.9,linecolor="white",fmt='.2g',center = 0,square=True)
plt.title("correlations Among Features", y = 1.03,fontsize = 20, pad = 40);

4.3姓名处理

# Creating a new colomn with a 
train['name_length'] = [len(i) for i in train.Name]
test['name_length'] = [len(i) for i in test.Name]def name_length_group(size):a = ''if (size <=20):a = 'short'elif (size <=35):a = 'medium'elif (size <=45):a = 'good'else:a = 'long'return atrain['nLength_group'] = train['name_length'].map(name_length_group)
test['nLength_group'] = test['name_length'].map(name_length_group)## Here "map" is python's built-in function. 
## "map" function basically takes a function and 
## returns an iterable list/tuple or in this case series. 
## However,"map" can also be used like map(function) e.g. map(name_length_group) 
## or map(function, iterable{list, tuple}) e.g. map(name_length_group, train[feature]]). 
## However, here we don't need to use parameter("size") for name_length_group because when we 
## used the map function like ".map" with a series before dot, we are basically hinting that series 
## and the iterable. This is similar to .append approach in python. list.append(a) meaning applying append on list. ## cuts the column by given bins based on the range of name_length
#group_names = ['short', 'medium', 'good', 'long']
#train['name_len_group'] = pd.cut(train['name_length'], bins = 4, labels=group_names)
## get the title from the name
train["title"] = [i.split('.')[0] for i in train.Name]
train["title"] = [i.split(',')[1] for i in train.title]
## Whenever we split like that, there is a good change that 
#we will end up with white space around our string values. Let's check that.print(train.title.unique())
'''结果:
[' Mr' ' Mrs' ' Miss' ' Master' ' Don' ' Rev' ' Dr' ' Mme' ' Ms' ' Major'' Lady' ' Sir' ' Mlle' ' Col' ' Capt' ' the Countess' ' Jonkheer']
'''## Let's fix that
train.title = train.title.apply(lambda x: x.strip())## We can also combile all three lines above for test set here
test['title'] = [i.split('.')[0].split(',')[1].strip() for i in test.Name]## However it is important to be able to write readable code, and the line above is not so readable. ## Let's replace some of the rare values with the keyword 'rare' and other word choice of our own. 
## train Data
train["title"] = [i.replace('Ms', 'Miss') for i in train.title]
train["title"] = [i.replace('Mlle', 'Miss') for i in train.title]
train["title"] = [i.replace('Mme', 'Mrs') for i in train.title]
train["title"] = [i.replace('Dr', 'rare') for i in train.title]
train["title"] = [i.replace('Col', 'rare') for i in train.title]
train["title"] = [i.replace('Major', 'rare') for i in train.title]
train["title"] = [i.replace('Don', 'rare') for i in train.title]
train["title"] = [i.replace('Jonkheer', 'rare') for i in train.title]
train["title"] = [i.replace('Sir', 'rare') for i in train.title]
train["title"] = [i.replace('Lady', 'rare') for i in train.title]
train["title"] = [i.replace('Capt', 'rare') for i in train.title]
train["title"] = [i.replace('the Countess', 'rare') for i in train.title]
train["title"] = [i.replace('Rev', 'rare') for i in train.title]## Now in programming there is a term called DRY(Don't repeat yourself), whenever we are repeating  
## same code over and over again, there should be a light-bulb turning on in our head and make us think
## to code in a way that is not repeating or dull. Let's write a function to do exactly what we 
## did in the code above, only not repeating and more interesting.## we are writing a function that can help us modify title column
def fuse_title(feature):"""This function helps modifying the title column"""result = ''if feature in ['the Countess','Capt','Lady','Sir','Jonkheer','Don','Major','Col', 'Rev', 'Dona', 'Dr']:result = 'rare'elif feature in ['Ms', 'Mlle']:result = 'Miss'elif feature == 'Mme':result = 'Mrs'else:result = featurereturn resulttest.title = test.title.map(fuse_title)
train.title = train.title.map(fuse_title)

4.4处理家庭大小

## bin the family size. 
def family_group(size):"""This funciton groups(loner, small, large) family based on family size"""a = ''if (size <= 1):a = 'loner'elif (size <= 4):a = 'small'else:a = 'large'return a## apply the family_group function in family_size
train['family_group'] = train['family_size'].map(family_group)
test['family_group'] = test['family_size'].map(family_group)train['is_alone'] = [1 if i<2 else 0 for i in train.family_size]
test['is_alone'] = [1 if i<2 else 0 for i in test.family_size]

4.5清除票

train.drop(['Ticket'],axis=1,inplace=True)
test.drop(['Ticket'],axis=1,inplace=True)

4.5处理票价

## Calculating fare based on family size. 
train['calculated_fare'] = train.Fare/train.family_size
test['calculated_fare'] = test.Fare/test.family_sizedef fare_group(fare):"""This function creates a fare group based on the fare provided"""a= ''if fare <= 4:a = 'Very_low'elif fare <= 10:a = 'low'elif fare <= 20:a = 'mid'elif fare <= 45:a = 'high'else:a = "very_high"return atrain['fare_group'] = train['calculated_fare'].map(fare_group)
test['fare_group'] = test['calculated_fare'].map(fare_group)#train['fare_group'] = pd.cut(train['calculated_fare'], bins = 4, labels=groups)

4.6数据整理

train.drop(['PassengerId'], axis=1, inplace=True)test.drop(['PassengerId'], axis=1, inplace=True)train = pd.get_dummies(train, columns=['title',"Pclass", 'Cabin','Embarked','nLength_group', 'family_group', 'fare_group'], drop_first=False)
test = pd.get_dummies(test, columns=['title',"Pclass",'Cabin','Embarked','nLength_group', 'family_group', 'fare_group'], drop_first=False)
train.drop(['family_size','Name', 'Fare','name_length'], axis=1, inplace=True)
test.drop(['Name','family_size',"Fare",'name_length'], axis=1, inplace=True)## rearranging the columns so that I can easily use the dataframe to predict the missing age values. 
train = pd.concat([train[["Survived", "Age", "Sex","SibSp","Parch"]], train.loc[:,"is_alone":]], axis=1)
test = pd.concat([test[["Age", "Sex"]], test.loc[:,"SibSp":]], axis=1)train

4.7用RandomForestRegressor预测年龄填补缺失数据

#用RandomForestRegressor预测年龄
# Importing RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor## writing a function that takes a dataframe with missing values and outputs it by filling the missing values. 
def completing_age(df):## gettting all the features except survivedage_df = df.loc[:,"Age":] temp_train = age_df.loc[age_df.Age.notnull()] ## df with age valuestemp_test = age_df.loc[age_df.Age.isnull()] ## df without age valuesy = temp_train.Age.values ## setting target variables(age) in y x = temp_train.loc[:, "Sex":].valuesrfr = RandomForestRegressor(n_estimators=1500, n_jobs=-1)rfr.fit(x, y)predicted_age = rfr.predict(temp_test.loc[:, "Sex":])df.loc[df.Age.isnull(), "Age"] = predicted_agereturn df## Implementing the completing_age function in both train and test dataset. 
completing_age(train)
completing_age(test);## Let's look at the his
plt.subplots(figsize = (22,10),)
sns.distplot(train.Age, bins = 100, kde = True, rug = False, norm_hist=False);

## create bins for age
def age_group_fun(age):"""This function creates a bin for age"""a = ''if age <= 1:a = 'infant'elif age <= 4: a = 'toddler'elif age <= 13:a = 'child'elif age <= 18:a = 'teenager'elif age <= 35:a = 'Young_Adult'elif age <= 45:a = 'adult'elif age <= 55:a = 'middle_aged'elif age <= 65:a = 'senior_citizen'else:a = 'old'return a## Applying "age_group_fun" function to the "Age" column.
train['age_group'] = train['Age'].map(age_group_fun)
test['age_group'] = test['Age'].map(age_group_fun)## Creating dummies for "age_group" feature. 
train = pd.get_dummies(train,columns=['age_group'], drop_first=True)
test = pd.get_dummies(test,columns=['age_group'], drop_first=True);

 五、建模

# separating our independent and dependent variable
X = train.drop(['Survived'], axis = 1)
y = train["Survived"]from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = .33, random_state=0)# Feature Scaling 数据差异大,进行标准化
## We will be using standardscaler to transform
from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()## transforming "train_x"
X_train = std_scale.fit_transform(X_train)
## transforming "test_x"
X_test = std_scale.transform(X_test)## transforming "The testset"
#test = st_scale.transform(test)

5.1LogisticRegression建模 

##LogisticRegression建模# import LogisticRegression model in python. 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score## call on the model object
logreg = LogisticRegression(solver='liblinear',penalty= 'l1',random_state = 42)## fit the model with "train_x" and "train_y"
logreg.fit(X_train,y_train)y_pred = logreg.predict(X_test)from sklearn.metrics import classification_report, confusion_matrix
# printing confision matrix
pd.DataFrame(confusion_matrix(y_test,y_pred),\columns=["Predicted Not-Survived", "Predicted Survived"],\index=["Not-Survived","Survived"] )'''结果:Predicted Not-Survived 	Predicted Survived
Not-Survived 	156 	           28
Survived 	     26 	           85
'''
from sklearn.metrics import classification_report, balanced_accuracy_score
print(classification_report(y_test, y_pred))'''结果:precision    recall  f1-score   support0       0.86      0.85      0.85       1841       0.75      0.77      0.76       111accuracy                           0.82       295macro avg       0.80      0.81      0.81       295
weighted avg       0.82      0.82      0.82       295
'''from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
#结果:0.8169491525423729

5.2Decision Tree建模

##Decision Tree建模
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
max_depth = range(1,30)
max_feature = [21,22,23,24,25,26,28,29,30,'auto']
criterion=["entropy", "gini"]param = {'max_depth':max_depth, 'max_features':max_feature, 'criterion': criterion}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid = param, verbose=False, cv=StratifiedShuffleSplit(n_splits=20, random_state=15),n_jobs = -1)
grid.fit(X, y) print( grid.best_params_)
print (grid.best_score_)
print (grid.best_estimator_)
'''结果:
{'criterion': 'entropy', 'max_depth': 5, 'max_features': 30}
0.828888888888889
DecisionTreeClassifier(criterion='entropy', max_depth=5, max_features=30)
'''dectree_grid = grid.best_estimator_
## using the best found hyper paremeters to get the score. 最好的模型参数
dectree_grid.score(X,y)#结果:0.8585858585858586

5.3randomforest建模

##randomforest建模
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
n_estimators = [140,145,150,155,160];
max_depth = range(1,10);
criterions = ['gini', 'entropy'];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)parameters = {'n_estimators':n_estimators,'max_depth':max_depth,'criterion': criterions}
grid = GridSearchCV(estimator=RandomForestClassifier(max_features='auto'),param_grid=parameters,cv=cv,n_jobs = -1)
grid.fit(X,y) print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)
'''结果:
0.8384328358208956
{'criterion': 'entropy', 'max_depth': 7, 'n_estimators': 160}
RandomForestClassifier(criterion='entropy', max_depth=7, max_features='auto',n_estimators=160)
'''rf_grid = grid.best_estimator_
rf_grid.score(X,y)
#结果:0.8742985409652076

5.4Adaboost建模

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier# 创建一个决策树分类器作为基本估计器
base_estimator = DecisionTreeClassifier()# 创建AdaBoost分类器并将决策树分类器作为基本估计器
ada_boost = AdaBoostClassifier(base_estimator=base_estimator)# 使用AdaBoost分类器进行训练和预测
ada_boost.fit(X_train, y_train)predictions = ada_boost.predict(X_test)from sklearn.metrics import accuracy_scoreaccuracy_score(y_test, y_pred)
#结果:0.8169491525423729from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplitn_estimators = [80,100,140,145,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
learning_r = [0.1,1,0.01,0.5]parameters = {'n_estimators':n_estimators,'learning_rate':learning_r}base_estimator = DecisionTreeClassifier()grid = GridSearchCV(AdaBoostClassifier(base_estimator= base_estimator, ## If None, then the base estimator is a decision tree.),param_grid=parameters,cv=cv,n_jobs = -1)
grid.fit(X,y) ada_grid = grid.best_estimator_
ada_grid.score(X,y)
#结果:0.9887766554433222

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/601386.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

ROS学习笔记(11)进一步深入了解ROS第五步

0.前提 我在学习宾夕的ROS公开课的时候发现&#xff0c;外国的对计算机的教育和国内的是完全不一样的&#xff0c;当你接触了外国的课程后回头看自己学的会发现好像自己啥也没学。我这里可以放出来给大家看一下。 1.Python and C 2.Python PDB Tutorial&#xff1a;Python Deb…

PyTorch|构建自己的卷积神经网络

如何搭建网络&#xff0c;这在深度学习中非常重要。简单来讲&#xff0c;我们是要实现一个类&#xff0c;这个类中有属性和方法&#xff0c;能够进行计算。 一般来讲&#xff0c;使用PyTorch创建神经网络需要三步&#xff1a; 继承基类&#xff1a;nn.Module 定义层属性 实现…

动态规划(不同路径1,不同路径2,整数拆分)

62.不同路径 力扣题目链接(opens new window) 一个机器人位于一个 m x n 网格的左上角 &#xff08;起始点在下图中标记为 “Start” &#xff09;。 机器人每次只能向下或者向右移动一步。机器人试图达到网格的右下角&#xff08;在下图中标记为 “Finish” &#xff09;。…

手机上连网络转接app,电脑连接手机,共用网络转接app的办法

方法一&#xff0c;&#xff08;不推荐&#xff09; 因为太简单了所以写一下 电脑安装MuMu模拟器&#xff0c;之后安装网络转接app&#xff0c;这个模拟器设置了从电脑上安装app和&#xff0c;安卓与电脑同步文件夹功能&#xff0c;实现文件共享。所以直接用就可以了。 方法二…

启动 Mac 时显示闪烁的问号

启动 Mac 时显示闪烁的问号 如果启动时在 Mac 屏幕上看到闪烁的问号&#xff0c;这意味着你的 Mac 无法找到自身的系统软件。 如果 Mac 启动时出现闪烁的问号且无法继续启动&#xff0c;请尝试以下步骤。 1.通过按住其电源按钮几秒钟来关闭 Mac。 2.按一下电源按钮&#xf…

你珍藏的那个表情包女孩,现在滤镜碎了一地。

♥ 为方便您进行讨论和分享&#xff0c;同时也为能带给您不一样的参与感。请您在阅读本文之前&#xff0c;点击一下“关注”&#xff0c;非常感谢您的支持&#xff01; 文|猴哥聊娱乐 编辑|侯欢庭 七年前&#xff0c;一个年仅三岁的小女孩以其无邪的表情包风靡网络&#xff0…

FindMy技术用于键盘

键盘是我们生活中不可或缺的输入工具&#xff0c;是人与计算机之间沟通的桥梁&#xff0c;无论是编写文档、浏览网页、玩游戏、或是进行复杂的数据分析&#xff0c;键盘都在其中发挥着关键的作用。此外&#xff0c;键盘还是各种软件的快捷键操作的关键。通过熟练地运用快捷键&a…

大学物理-实验篇——测量误差与数据处理(测量分类、误差、有效数字、逐差法)

目录 测量分类 测量次数角度 测量条件角度 误差 误差分类 系统误差 随机误差 异常值 误差描述 精密度&#xff08;Precision&#xff09; 正确度&#xff08;Trueness&#xff09; 准确度/精确度&#xff08;Accuracy&#xff09; 随机误差的处理 直接测量 算术…

Opencv(C++)学习之cv::calcHist 任意bin数量进行直方图计算

**背景&#xff1a;**当前网上常见的直方图使用方法都是默认使用256的范围&#xff0c;而对于使用特定范围的直方图方法讲的不够清楚。仔细研究后总结如下&#xff1a; 1、常见使用方法&#xff0c;直接对灰度图按256个Bin进行计算。 Mat mHistUn; int channels[1] { 0 }; {…

rollup 原理解析

✨专栏介绍 Rollup专栏是一个专门介绍Rollup打包工具的系列文章。Rollup是一个现代化的JavaScript模块打包工具&#xff0c;它可以将多个模块打包成一个或多个文件&#xff0c;以提高应用程序的性能和加载速度。 在Rollup专栏中&#xff0c;您将学习到如何安装和配置Rollup&a…

Ubuntu18 安装chatglm2-6b

记了下Ubuntu18 上安装chatglm2-6遇到的问题。 环境&#xff1a;Ubuntu18.04 V100(显卡) nvcc 11.6 显卡驱动cudacudnnaniconda chatglm6b 的安装 网上有很多&#xff0c; 不记录 了。 chatglm2-6b 我从别的地方拷贝的&#xff0c; 模型也包含了。 遇到的问题&#xf…

vue简单实现滚动条

背景&#xff1a;产品提了一个需求在一个详情页&#xff0c;一个form表单元素太多了&#xff0c;需要滚动到最下面才能点击提交按钮&#xff0c;很不方便。他的方案是&#xff0c;加一个滚动条&#xff0c;这样可以直接拉到最下面。 优化&#xff1a;1、支持滚动条&#xff0c;…

阿里云2核2G3M配置的云服务器可以搭建几个网站?

阿里云2核2G3M配置的云服务器可以搭建几个网站&#xff1f;对于这个问题&#xff0c;没有一个具体的答案&#xff0c;因为这取决于您的网站的流量和复杂程度。在一般情况下&#xff0c;这个配置可以支持搭建几个中小型网站。若您的网站需要大量的资源或处理高并发请求&#xff…

机器学习股票崩盘预测模型(企业建模_论文科研)AI model for stock crash prediction

对齐颗粒度&#xff0c;打通股票崩盘底层逻辑&#xff0c;形成一套组合拳&#xff0c;形成信用评级机制良性生态圈&#xff0c;重振股市信心&#xff01;--中国股市新展望&#xff01;By Toby&#xff01;2024.1.3 综合介绍 股票崩盘&#xff0c;是指证券市场上由于某种原因&a…

ctfshow——PHP特性

文章目录 web 89web 90web 91web 92web 93web 94web 95web 96web 97web 98web 99web 100——优先级、eval()用法web 101——RefelctionClass反射类web 102——php伪协议、hex2bin()web103web 104——sha1绕过web 105 web 89 使用人工分配 ID 键的数值型数组绕过preg_match. 两个…

Vue2 实现内容拖拽或添加 HTML 到 Tinymce 富文本编辑器的高级功能详解

在 Web 开发中&#xff0c;Tinymce 被广泛应用作为富文本编辑器。除了基础的文本编辑功能&#xff0c;Tinymce 还提供了一系列高级功能&#xff0c;使得文本编辑更加灵活和便捷。本文将介绍如何在 Tinymce 中实现一些高级功能&#xff0c;并深入了解每个工具的使用。 Tinymce …

Python中的eval和exec函数:深度解析两者的区别与使用场景

概要 Python中的eval和exec函数&#xff0c;它们都是非常强大的工具&#xff0c;用于动态执行代码。然而&#xff0c;它们在用途、用法和安全性方面存在显著的区别。在本文中&#xff0c;将深入探讨eval和exec函数的区别、用法以及示例代码&#xff0c;以帮助大家更好地理解和…

用单片机设计PLC电路图

自记&#xff1a; 以下为PMOS推挽输出及集成块光耦&#xff1a;

MediaPipeUnityPlugin Win10环境搭建(22年3月的记录,新版本已完全不同,这里只做记录)

https://github.com/homuler/MediaPipeUnityPlugin You cannot build libraries for Android with the following steps. 1、安装msys2配置系统环境变量Path添加 C:\msys64\usr\bin 执行 pacman -Su 执行 pacman -S git patch unzip 2、安装Python3.9.10 勾选系统环境变量 …

【性能测试入门】详解客户端性能测试和服务器端性能测试!

一&#xff1a;客户端性能测试和服务器端性能测试 客户端性能测试和服务器端性能测试是两个不同但相关的概念: 客户端性能测试: - 测试应用程序客户端(如Web浏览器、移动应用等)的性能,例如加载时间,响应时间等。 - 测试在不同系统配置(CPU、内存、网络等)下客户端的运行性…