比赛,幸福度_幸福与生活满意度

比赛,幸福度

What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?

人生的目的是什么? 那是幸福吗? 人们为什么要经历所有的痛苦和磨难? 是通过某种方式获得幸福吗?

I’m not the only person who believed the purpose of life is happiness. If you look around, most people are pursuing happiness in their lives.

我不是唯一相信生活的目标是幸福的人。 如果环顾四周,大多数人都会在生活中追求幸福。

On March 20th, the world celebrates the International Day of Happiness. The 2020 report ranked 156 countries by how happy their citizens perceive themselves based on their evaluations of their own lives. The rankings of national happiness are based on a Cantril ladder survey. Nationally representative samples of respondents are asked to think of a ladder, the best possible life for them being a 10, and the worst possible experience is a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The report correlates the results with various life factors. In the reports, experts in economics, psychology, survey analysis, and national statistics describe how well-being measurements can be used effectively to assess nations’ progress and other topics.

3月20日,世界庆祝国际幸福日。 2020年报告对 156个国家/地区进行了评估,根据其公民对自己生活的评价,他们对自己的快乐程度感到满意。 国家幸福感的排名基于Cantril阶梯调查。 在全国范围内,具有代表性的受访者样本被要求考虑一个阶梯,最适合他们的寿命是10,最糟糕的经历是0。然后,他们被要求以0到10的等级对自己当前的生活进行评分。 该报告将结果与各种生活因素相关联。 在报告中,经济学,心理学,调查分析和国家统计方面的专家描述了如何有效地使用幸福感测度来评估国家的进步和其他主题。

So, how happy are people today? Were people more comfortable in the past? How satisfied with their lives are people in different societies? How do our living conditions affect all of this?

那么,今天的人们有多幸福? 过去的人们更加舒适吗? 不同社会的人们对生活的满意度如何? 我们的生活条件如何影响所有这些?

特征分析 (Features Analyzed)

  • GDP: GDP per capita is a measure of a country’s economic output that accounts for its number of people.

    GDP :人均GDP是一国经济产出的量度,该数字是其人口数量的一部分。

  • Support: Social support means having friends and other people, including family, turning to in times of need or crisis to give you a broader focus and positive self-image. Social support enhances the quality of life and provides a buffer against adverse life events.

    支持 :社会支持意味着有朋友和其他人(包括家人)在有需要或遇到危机时求助于您,以给予您更广泛的关注和积极的自我形象。 社会支持提高了生活质量,并为不良生活事件提供了缓冲。

  • Health: Healthy Life Expectancy is the average number of years that a newborn can expect to live in “full health” — in other words, not hampered by disabling illnesses or injuries.

    健康 :健康预期寿命是指新生儿可以“完全健康”生活的平均年限,换句话说,不受疾病或伤害致残的影响。

  • Freedom: Freedom of choice describes an individual’s opportunity and autonomy to perform an action selected from at least two available options, unconstrained by external parties.

    自由:选择自由描述了个人执行机会的自主权和自主权,该机会是从至少两个不受外部团体约束的可用选项中选择的。

  • Generosity: is defined as the residual of regressing the national average of responses to the question, “Have you donated money to a charity in past months?” on GDP capita.

    慷慨 :定义为对问题“您在过去几个月中向慈善机构捐款吗?”的全国平均水平下降的残差。 人均GDP。

  • Corruption: The Corruption Perceptions Index (CPI) is an index published annually by Transparency International since 1995, which ranks countries “by their perceived levels of public sector corruption, as determined by expert assessments and opinion surveys.”

    腐败 :腐败感知指数(CPI)是透明国际自1995年以来每年发布的指数,该指数对国家“按专家评估和意见调查确定的对公共部门腐败的感知程度进行排名”。

大纲: (Outline:)

  1. Import Modules, Read the Dataset and Define an Evaluation Table

    导入模块,读取数据集并定义评估表
  2. Define a Function to Calculate the Adjusted R²

    定义一个函数来计算调整后的R²
  3. How Happiness Score is distributed

    幸福分数如何分配
  4. The relationship between different features with Happiness Score.

    不同功能与幸福分数之间的关系。
  5. Visualize and Examine Data

    可视化和检查数据
  6. Multiple Linear Regression

    多元线性回归
  7. Conclusion

    结论

Grab yourself a coffee, and join me on this journey towards predicting happiness!

喝杯咖啡,并加入我的幸福之旅吧!

1.导入模块,读取数据集并定义评估表 (1. Import Modules, Read the Dataset and Define an Evaluation Table)

To do some analysis, we need to set our environment up. First, we introduce some modules and read the data. The below output is the head of the data, but if you want to see more details, you might remove # signs in front of thedf_15.describe()and df_15.info()

要进行一些分析,我们需要设置环境。 首先,我们介绍一些模块并读取数据。 下面的输出是数据的开头,但是如果您想查看更多详细信息,可以删除df_15.describe()df_15.info()前面的#号。

# FOR NUMERICAL ANALYTICS
import numpy as np# TO STORE AND PROCESS DATA IN DATAFRAME
import pandas as pd
import os# BASIC VISUALIZATION PACKAGE
import matplotlib.pyplot as plt# ADVANCED PLOTING
import seaborn as seabornInstance# TRAIN TEST SPLIT
from sklearn.model_selection import train_test_split# INTERACTIVE VISUALIZATION
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)import statsmodels.formula.api as stats
from statsmodels.formula.api import ols
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from discover_feature_relationships import discover#2015 data
df_15 = pd.read_csv('2015.csv')
#df_15.describe()
#df_15.info()
usecols = ['Rank','Country','Score','GDP','Support',
'Health','Freedom','Generosity','Corruption']
df_15.drop(['Region','Standard Error', 'Dystopia Residual'],axis=1,inplace=True)
df_15.columns = ['Country','Rank','Score','Support',
'GDP','Health',
'Freedom','Generosity','Corruption']
df_15['Year'] = 2015 #add year column
df_15.head()
Image for post
output
输出

I only present the 2015 data code as an example; you could do similar for other years.Parts starting with Happiness, Whisker, and Dystopia. Residual are different targets. Dystopia Residual compares each countries scores to the theoretical unhappiest country in the world. Since the data from the years have a bit of a different naming convention, we will abstract them to a common name.

我仅以2015年数据代码为例。 其他年份您可以做类似的事情。从HappinessWhiskerDystopia开始的零件。 Residual是不同的目标。 Dystopia Residual将每个国家的得分与世界上理论上最不幸的国家进行比较。 由于这些年份的数据有一些不同的命名约定,因此我们将它们抽象为一个通用名称。

target = ['Top','Top-Mid', 'Low-Mid', 'Low' ]
target_n = [4, 3, 2, 1]df_15["target"] = pd.qcut(df_15['Rank'], len(target), labels=target)
df_15["target_n"] = pd.qcut(df_15['Rank'], len(target), labels=target_n)

We then combine all data file to finaldf

然后,我们将所有数据文件合并到finaldf

# APPENDING ALL TOGUETHER
finaldf = df_15.append([df_16,df_17,df_18,df_19])
# finaldf.dropna(inplace = True)#CHECKING FOR MISSING DATA
finaldf.isnull().any()# FILLING MISSING VALUES OF CORRUPTION PERCEPTION WITH ITS MEAN
finaldf.Corruption.fillna((finaldf.Corruption.mean()), inplace = True)
finaldf.head(10)
Image for post

We can see the statistical detail of our dataset by using describe() function:

我们可以使用describe()函数查看数据集的统计细节:

Image for post

Further, we define an empty dataframe. This dataframeincludes Root Mean Squared Error (RMSE), R-squared, Adjusted R-squared, and mean of the R-squared values obtained by the k-Fold Cross-Validation, which are the essential metrics to compare different models. Having an R-squared value closer to one and smaller RMSE means a better fit. In the following sections, we will fill this dataframe with the results.

此外,我们定义了一个空的dataframe 。 此dataframe包括Root Mean Squared Error (RMSE)R-squaredAdjusted R-squaredmean of the R-squared values obtained by the k-Fold Cross-Validation ,这是比较不同模型的基本指标。 R平方值更接近一个较小的RMSE意味着更合适。 在以下各节中,我们将用结果填充此dataframe

evaluation = pd.DataFrame({'Model':[],
'Details':[],
'Root Mean Squared Error (RMSE)': [],
'R-squared (training)': [],
'Adjusted R-squared (training)': [],
'R-squared (test)':[],
'Adjusted R-squared(test)':[],
'5-Fold Cross Validation':[]
})

2.定义一个函数来计算调整后的R² (2. Define a Function to Calculate the Adjusted R²)

R-squared increases when the number of features increases. Sometimes a more robust evaluator is preferred to compare the performance between different models. This evaluator is called adjusted R-squared, and it only increases, if the addition of the variable reduces the MSE. The definition of the adjusted R² is:

当特征数量增加时, R平方增加。 有时,最好使用功能更强大的评估器来比较不同模型之间的性能。 该评估器称为调整后的R平方,并且仅在变量相加会降低MSE时才会增加。 调整后的R²的定义为:

Image for post

3.幸福分数如何分配 (3. How Happiness Score is distributed)

As we can see below, the Happiness Score has values above 2.85 and below 7.76. So there is no single country which has a Happiness Score above 8.

正如我们在下面看到的,“幸福感分数”的值高于2.85,低于7.76。 因此,没有哪个国家的幸福分数高于8。

Image for post

4.不同特征与幸福分数之间的关系 (4. The relationship between different features with Happiness Score)

We want to predict Happiness Score, so our dependent variable here is Score other features such as GPD Support Health, etc., are our independent variables.

我们要预测幸福分数,因此我们的因变量是Score其他功能(例如GPD Support Health等)是我们的自变量。

人均国内生产总值 (GDP per capita)

We first use scatter plots to observe relationships between variables.

我们首先使用散点图来观察变量之间的关系。

Image for post
Gif by Author
Gif作者
'''Happiness score vs gdp per capital'''
px.scatter(finaldf, x="GDP", y="Score", animation_frame="Year",
animation_group="Country",
size="Rank", color="Country", hover_name="Country",
trendline= "ols")train_data, test_data = train_test_split(finaldf, train_size = 0.8, random_state = 3)
lr = LinearRegression()
X_train = np.array(train_data['GDP'],
dtype = pd.Series).reshape(-1,1)
y_train = np.array(train_data['Score'], dtype = pd.Series)
lr.fit(X_train, y_train)X_test = np.array(test_data['GDP'],
dtype = pd.Series).reshape(-1,1)
y_test = np.array(test_data['Score'], dtype = pd.Series)pred = lr.predict(X_test)#ROOT MEAN SQUARED ERROR
rmsesm = float(format(np.sqrt(metrics.mean_squared_error(y_test,pred)),'.3f'))#R-SQUARED (TRAINING)
rtrsm = float(format(lr.score(X_train, y_train),'.3f'))#R-SQUARED (TEST)
rtesm = float(format(lr.score(X_test, y_test),'.3f'))cv = float(format(cross_val_score(lr,finaldf[['GDP']],finaldf['Score'],cv=5).mean(),'.3f'))print ("Average Score for Test Data: {:.3f}".format(y_test.mean()))
print('Intercept: {}'.format(lr.intercept_))
print('Coefficient: {}'.format(lr.coef_))r = evaluation.shape[0]
evaluation.loc[r] = ['Simple Linear Regression','-',rmsesm,rtrsm,'-',rtesm,'-',cv]
evaluation

By using these values and the below definition, we can estimate the Happiness Score manually. The equation we use for our estimations is called hypothesis function and defined as

通过使用这些值和以下定义,我们可以手动估算幸福度得分。 我们用于估计的方程称为假设函数,定义为

Image for post

We also printed the intercept and coefficient for the simple linear regression.

我们还为简单的线性回归打印了截距和系数。

Image for post

Let’s show the result, shall we?

让我们显示结果,好吗?

Since we have just two dimensions at the simple regression, it is easy to draw it. The below chart determines the result of the simple regression. It does not look like a perfect fit, but when we work with real-world datasets, having an ideal fit is not easy.

由于在简单回归中只有二维,因此绘制起来很容易。 下图确定了简单回归的结果。 它看起来并不完美,但是当我们处理真实数据集时,要实现理想的拟合并不容易。

seabornInstance.set_style(style='whitegrid')plt.figure(figsize=(12,6))
plt.scatter(X_test,y_test,color='blue',label="Data", s = 12)
plt.plot(X_test,lr.predict(X_test),color="red",label="Predicted Regression Line")
plt.xlabel("GDP per Captita", fontsize=15)
plt.ylabel("Happiness Score", fontsize=15)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.legend()plt.gca().spines['right'].set_visible(False)
plt.gca().spines['top'].set_visible(False)
Image for post

The relationship between GDP per capita(Economy of the country) has a strong positive correlation with Happiness Score, that is, if the GDP per capita of a country is higher than the Happiness Score of that country, it is also more likely to be high.

人均GDP(国家经济)与幸福分数之间有很强的正相关关系,即,如果一个国家的人均GDP高于该国的幸福分数,则该可能性也较高。 。

支持 (Support)

To keep the article short, I won’t include the code in this part. The code is similar to the GDP feature above. I recommend you try to implement yourself. I will include the link at the end of this article for reference.

为了使文章简短,我将不在此部分中包含代码。 该代码类似于上面的GDP功能。 我建议您尝试实现自己。 我将在本文末尾包含链接以供参考。

Image for post

Social Support of countries also has a strong and positive relationship with Happiness Score. So, it makes sense that we need social support to be happy. People are also wired for emotions, and we experience those emotions within a social context.

各国的社会支持与幸福分数也有密切而积极的关系。 因此,有意义的是我们需要社会支持才能幸福。 人们也渴望情感,我们在社交环境中体验这些情感。

Image for post

健康预期寿命 (Healthy life expectancy)

Image for post

A healthy life expectancy has a strong and positive relationship with the Happiness Score, that is, if the country has a High Life Expectancy, it can also have a high Happiness Score. Being happy doesn’t just improve the quality of a person’s life. It may increase the quantity of our life as well. I will also be happy if I get a long healthy life. You?

健康的预期寿命与幸福指数有很强的正相关关系,也就是说,如果一个国家的预期寿命很高,那么幸福指数也可以很高。 快乐不仅会改善一个人的生活质量。 它也可能增加我们的生活量。 如果我长寿健康,我也会很高兴。 您?

Image for post

自由选择生活 (Freedom to make life choices)

Image for post

Freedom to make life choices has a positive relationship with Happiness Score. Choice and autonomy are more directly related to happiness than having lots of money. It gives us options to pursue meaning in our life, finding activities that stimulate and excite us. This is an essential aspect of feeling happy.

自由选择生活与幸福分数有正相关关系。 与拥有大量金钱相比,选择和自主与幸福更直接相关。 它为我们提供了在生活中追求意义,寻找刺激和激发我们活力的选择。 这是感到快乐的重要方面。

Image for post

Generosity

慷慨大方

Image for post

Generosity has a fragile linear relationship with the Happiness Score. Why the charity has no direct relationship with happiness score? Generosity scores are calculated based on the countries which give the most to nonprofits around the world. Countries that are not generous that does not mean they are not happy.

慷慨与幸福感评分之间存在线性关系。 为什么慈善与幸福分数没有直接关系? 慷慨度分数是根据世界上向非营利组织提供最多捐助的国家/地区计算得出的。 不慷慨的国家并不意味着他们不高兴。

Image for post

Perceptions of corruption

腐败感

Image for post

Distribution of Perceptions of corruption rightly skewed that means very less number of the country has high perceptions of corruption. That means most of the country has corruption problems.

腐败观念的分布正确地歪曲,这意味着该国对腐败观念的高度了解的国家很少。 这意味着该国大部分地区都存在腐败问题。

Image for post

How corruption feature impact on the Happiness Score?

腐败特征如何影响幸福感评分?

Perceptions of corruption data have highly skewed no wonder why the data has a weak linear relationship. Still, as we can see in the scatter plot, most of the data points are on the left side, and most of the countries with low perceptions of corruption have a Happiness Score between 4 to 6. Countries with high perception scores have a high Happiness Score above 7.

腐败数据的认知高度偏斜,也难怪为什么数据具有弱线性关系。 但是,正如我们在散点图中所看到的那样,大多数数据点都位于左侧,并且大多数对腐败的认知度较低的国家的幸福分数在4到6之间。得分高于7。

Image for post

5.可视化和检查数据 (5. Visualize and Examine Data)

We do not have big data with too many features. Thus, we have a chance to plot most of them and reach some useful analytical results. Drawing charts and examining the data before applying a model is a good practice because we may detect some possible outliers or decide to do normalization. This step is not a must but gets to know the data is always useful. We start with the histograms of dataframe.

我们没有功能太多的大数据。 因此,我们有机会绘制其中的大多数图并获得一些有用的分析结果。 在应用模型之前绘制图表并检查数据是一个好习惯,因为我们可能会发现一些可能的异常值或决定进行归一化。 此步骤不是必须的,但要了解数据总是有用的。 我们从dataframe的直方图开始。

# DISTRIBUTION OF ALL NUMERIC DATA
plt.rcParams['figure.figsize'] = (15, 15)
df1 = finaldf[['GDP', 'Health', 'Freedom',
'Generosity','Corruption']]h = df1.hist(bins = 25, figsize = (16,16),
xlabelsize = '10', ylabelsize = '10')
seabornInstance.despine(left = True, bottom = True)
[x.title.set_size(12) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()]
Image for post

Next, to give us a more appealing view of where each country is placed in the World ranking report, we use darker blue for countries that have the highest rating on the report (i.e., are the “happiest”), while the lighter blue represents countries with a lower ranking. We can see that countries in the European and Americas regions have a reasonably high ranking than ones in the Asian and African areas.

接下来,为了让我们对每个国家在世界排名报告中的位置更具吸引力,我们对报告中评分最高的国家(即最“幸福”的国家)使用深蓝色,而浅蓝色表示排名较低的国家。 我们可以看到,欧洲和美洲地区的国家比亚洲和非洲地区的国家具有较高的排名。

'''World Map
Happiness Rank Accross the World'''happiness_rank = dict(type = 'choropleth',
locations = finaldf['Country'],
locationmode = 'country names',
z = finaldf['Rank'],
text = finaldf['Country'],
colorscale = 'Blues_',
autocolorscale=False,
reversescale=True,
marker_line_color='darkgray',
marker_line_width=0.5)
layout = dict(title = 'Happiness Rank Across the World',
geo = dict(showframe = False,
projection = {'type': 'equirectangular'}))
world_map_1 = go.Figure(data = [happiness_rank], layout=layout)
iplot(world_map_1)
Image for post

Let’s check which countries are better positioned in each of the aspects being analyzed.

让我们检查一下哪个国家在所分析的各个方面都处于更好的位置。

fig, axes = plt.subplots(nrows=3, ncols=2,constrained_layout=True,figsize=(10,10))seabornInstance.barplot(x='GDP',y='Country',
data=finaldf.nlargest(10,'GDP'),
ax=axes[0,0],palette="Blues_r")
seabornInstance.barplot(x='Health' ,y='Country',
data=finaldf.nlargest(10,'Health'),
ax=axes[0,1],palette='Blues_r')
seabornInstance.barplot(x='Score' ,y='Country',
data=finaldf.nlargest(10,'Score'),
ax=axes[1,0],palette='Blues_r')
seabornInstance.barplot(x='Generosity' ,y='Country',
data=finaldf.nlargest(10,'Generosity'),
ax=axes[1,1],palette='Blues_r')
seabornInstance.barplot(x='Freedom' ,y='Country',
data=finaldf.nlargest(10,'Freedom'),
ax=axes[2,0],palette='Blues_r')
seabornInstance.barplot(x='Corruption' ,y='Country',
data=finaldf.nlargest(10,'Corruption'),
ax=axes[2,1],palette='Blues_r')
Image for post

检查解释变量之间的相关性 (Checking Out the Correlation Among Explanatory Variables)

mask = np.zeros_like(finaldf[usecols].corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = Truef, ax = plt.subplots(figsize=(16, 12))
plt.title('Pearson Correlation Matrix',fontsize=25)seabornInstance.heatmap(finaldf[usecols].corr(),
linewidths=0.25,vmax=0.7,square=True,cmap="Blues",
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
Image for post

It looks like GDP, Health, and Support are strongly correlated with the Happiness score. Freedom correlates quite well with the Happiness score; however, Freedom connects quite well with all data. Corruption still has a mediocre correlation with the Happiness score.

看起来GDPHealthSupport与幸福分数密切相关。 Freedom与幸福分数有很好的相关性。 但是, Freedom与所有数据的连接都很好。 Corruption与幸福感得分之间仍存在中等程度的相关性。

超越简单相关 (Beyond Simple Correlation)

In the scatterplots, we see that GDP, Health, and Support are quite linearly correlated with some noise. We find the auto-correlation of Corruption fascinating here, where everything is terrible, but if the corruption is high, the distribution is all over the place. It seems to be just a negative indicator of a threshold.

在散点图中,我们看到GDPHealthSupport与某些噪音线性相关。 我们发现这里的Corruption的自相关引人入胜,那里的一切都很糟糕,但是,如果腐败程度很高,分布到处都是。 这似乎只是阈值的否定指标。

Image for post

I found an exciting package by Ian Ozsvald that uses. It trains random forests to predict features from each other, going a bit beyond simple correlation.

我发现Ian Ozsvald使用了一个令人兴奋的软件包。 它训练随机森林来相互预测特征,这超出了简单的相关性。

# visualize hidden relationships in data
classifier_overrides = set()
df_results = discover.discover(finaldf.drop(['target', 'target_n'],axis=1).sample(frac=1),
classifier_overrides)

We use heat maps here to visualize how our features are clustered or vary over space.

我们在此处使用热图来可视化我们的要素如何在空间上聚集或变化。

fig, ax = plt.subplots(ncols=2,figsize=(24, 8))
seabornInstance.heatmap(df_results.pivot(index = 'target',
columns = 'feature',
values = 'score').fillna(1).loc[finaldf.drop(
['target', 'target_n'],axis = 1).columns,finaldf.drop(
['target', 'target_n'],axis = 1).columns],
annot=True, center = 0, ax = ax[0], vmin = -1, vmax = 1, cmap = "Blues")
seabornInstance.heatmap(df_results.pivot(index = 'target',
columns = 'feature',
values = 'score').fillna(1).loc[finaldf.drop(
['target', 'target_n'],axis=1).columns,finaldf.drop(
['target', 'target_n'],axis=1).columns],
annot=True, center=0, ax=ax[1], vmin=-0.25, vmax=1, cmap="Blues_r")
plt.plot()
Image for post

This gets more interesting. Corruption is a better predictor of the Happiness Score than Support. Possibly because of the ‘threshold’ we previously discovered?

这变得更加有趣。 与支持相比,腐败是幸福分数更好的预测指标。 可能是因为我们先前发现了“阈值”?

Moreover, although Social Support correlated quite well, it does not have substantial predictive value. I guess this is because all the distributions of the quartiles are quite close in the scatterplot.

而且,尽管社会支持的相关性很好,但它没有实质的预测价值。 我猜这是因为四分位数的所有分布在散点图中都非常接近。

6.多元线性回归 (6. Multiple Linear Regression)

In the thirst section of this article, we used a simple linear regression to examine the relationships between the Happiness Score and other features. We found a poor fit. To improve this model, we want to add more features. Now, it is time to create some complex models.

在本文的第三部分,我们使用了简单的线性回归来研究“幸福感评分”与其他功能之间的关系。 我们发现不合适。 为了改进此模型,我们想添加更多功能。 现在,该创建一些复杂的模型了。

We determined features at first sight by looking at the previous sections and used them in our first multiple linear regression. As in the simple regression, we printed the coefficients which the model uses for the predictions. However, this time we must use the below definition for our predictions if we want to make calculations manually.

我们通过查看前面的部分来乍一看确定特征,并将其用于我们的第一个多元线性回归中。 与简单回归一样,我们打印了模型用于预测的系数。 但是,这一次,如果我们要手动进行计算,则必须使用以下定义进行预测。

Image for post

We create a model with all features.

我们创建具有所有功能的模型。

# MULTIPLE LINEAR REGRESSION 1
train_data_dm,test_data_dm = train_test_split(finaldf,train_size = 0.8,random_state=3)independent_var = ['GDP','Health','Freedom','Support','Generosity','Corruption']
complex_model_1 = LinearRegression()
complex_model_1.fit(train_data_dm[independent_var],train_data_dm['Score'])print('Intercept: {}'.format(complex_model_1.intercept_))
print('Coefficients: {}'.format(complex_model_1.coef_))
print('Happiness score = ',np.round(complex_model_1.intercept_,4),
'+',np.round(complex_model_1.coef_[0],4),'∗ Support',
'+',np.round(complex_model_1.coef_[1],4),'* GDP',
'+',np.round(complex_model_1.coef_[2],4),'* Health',
'+',np.round(complex_model_1.coef_[3],4),'* Freedom',
'+',np.round(complex_model_1.coef_[4],4),'* Generosity',
'+',np.round(complex_model_1.coef_[5],4),'* Corrption')pred = complex_model_1.predict(test_data_dm[independent_var])
rmsecm = float(format(np.sqrt(metrics.mean_squared_error(
test_data_dm['Score'],pred)),'.3f'))
rtrcm = float(format(complex_model_1.score(
train_data_dm[independent_var],
train_data_dm['Score']),'.3f'))
artrcm = float(format(adjustedR2(complex_model_1.score(
train_data_dm[independent_var],
train_data_dm['Score']),
train_data_dm.shape[0],
len(independent_var)),'.3f'))
rtecm = float(format(complex_model_1.score(
test_data_dm[independent_var],
test_data_dm['Score']),'.3f'))
artecm = float(format(adjustedR2(complex_model_1.score(
test_data_dm[independent_var],test_data['Score']),
test_data_dm.shape[0],
len(independent_var)),'.3f'))
cv = float(format(cross_val_score(complex_model_1,
finaldf[independent_var],
finaldf['Score'],cv=5).mean(),'.3f'))r = evaluation.shape[0]
evaluation.loc[r] = ['Multiple Linear Regression-1','selected features',rmsecm,rtrcm,artrcm,rtecm,artecm,cv]
evaluation.sort_values(by = '5-Fold Cross Validation', ascending=False)
Image for post

We knew that GDP, Support, and Health are quite linearly correlated. This time, we create a model with these three features.

我们知道GDPSupportHealth是线性相关的。 这次,我们创建具有这三个功能的模型。

# MULTIPLE LINEAR REGRESSION 2
train_data_dm,test_data_dm = train_test_split(finaldf,train_size = 0.8,random_state=3)independent_var = ['GDP','Health','Support']
complex_model_2 = LinearRegression()
complex_model_2.fit(train_data_dm[independent_var],train_data_dm['Score'])print('Intercept: {}'.format(complex_model_2.intercept_))
print('Coefficients: {}'.format(complex_model_2.coef_))
print('Happiness score = ',np.round(complex_model_2.intercept_,4),
'+',np.round(complex_model_2.coef_[0],4),'∗ Support',
'+',np.round(complex_model_2.coef_[1],4),'* GDP',
'+',np.round(complex_model_2.coef_[2],4),'* Health')pred = complex_model_2.predict(test_data_dm[independent_var])
rmsecm = float(format(np.sqrt(metrics.mean_squared_error(
test_data_dm['Score'],pred)),'.3f'))
rtrcm = float(format(complex_model_2.score(
train_data_dm[independent_var],
train_data_dm['Score']),'.3f'))
artrcm = float(format(adjustedR2(complex_model_2.score(
train_data_dm[independent_var],
train_data_dm['Score']),
train_data_dm.shape[0],
len(independent_var)),'.3f'))
rtecm = float(format(complex_model_2.score(
test_data_dm[independent_var],
test_data_dm['Score']),'.3f'))
artecm = float(format(adjustedR2(complex_model_2.score(
test_data_dm[independent_var],test_data['Score']),
test_data_dm.shape[0],
len(independent_var)),'.3f'))
cv = float(format(cross_val_score(complex_model_2,
finaldf[independent_var],
finaldf['Score'],cv=5).mean(),'.3f'))r = evaluation.shape[0]
evaluation.loc[r] = ['Multiple Linear Regression-2','selected features',rmsecm,rtrcm,artrcm,rtecm,artecm,cv]
evaluation.sort_values(by = '5-Fold Cross Validation', ascending=False)
Image for post

When we look at the evaluation table, multiple linear regression -2 (selected features) is the best. However, I have doubts about its reliability, and I would prefer the multiple linear regression with all elements.

当我们查看评估表时,多元线性回归-2(选定要素)是最好的。 但是,我对它的可靠性感到怀疑,我更喜欢所有元素的多元线性回归。

X = finaldf[[ 'GDP', 'Health', 'Support','Freedom','Generosity','Corruption']]
y = finaldf['Score']'''
This function takes the features as input and
returns the normalized values, the mean, as well
as the standard deviation for each feature.
'''
def featureNormalize(X):
mu = np.mean(X) ## Define the mean
sigma = np.std(X) ## Define the standard deviation.
X_norm = (X - mu)/sigma ## Scaling function.
return X_norm, mu, sigmam = len(y) ## length of the training data
X = np.hstack((np.ones([m,1]), X)) ## Append the bias term (field containing all ones) to X.
y = np.array(y).reshape(-1,1) ## reshape y to mx1 array
theta = np.zeros([7,1]) ## Initialize theta (the coefficient) to a 3x1 zero vector.'''
This function takes in the values for
the training set as well as initial values
of theta and returns the cost(J).
'''def cost_function(X,y, theta):
h = X.dot(theta) ## The hypothesis
J = 1/(2*m)*(np.sum((h-y)**2)) ## Implementing the cost function
return J'''
This function takes in the values of the set,
as well the intial theta values(coefficients), the
learning rate, and the number of iterations. The output
will be the a new set of coefficeients(theta), optimized
for making predictions, as well as the array of the cost
as it depreciates on each iteration.
'''num_iters = 2000 ## Initialize the iteration parameter.
alpha = 0.01 ## Initialize the learning rate.
def gradientDescentMulti(X, y, theta, alpha, iterations):
m = len(y)
J_history = []
for _ in range(iterations):
temp = np.dot(X, theta) - y
temp = np.dot(X.T, temp)
theta = theta - (alpha/m) * temp
J_history.append(cost_function(X, y, theta)) ## Append the cost to the J_history array
return theta, J_historyprint('Happiness score = ',np.round(theta[0],4),
'+',np.round(theta[1],4),'∗ Support',
'+',np.round(theta[2],4),'* GDP',
'+',np.round(theta[3],4),'* Health',
'+',np.round(theta[4],4),'* Freedom',
'+',np.round(theta[5],4),'* Generosity',
'+',np.round(theta[6],4),'* Corrption')
Image for post

Print out the J_history; it’s approximately 0.147. We can conclude that using gradient descent gives us a best-fit model to predict the Happiness Score.

打印出J_history; 大约是0.147。 我们可以得出结论,使用梯度下降为我们提供了预测幸福分数的最佳拟合模型。

We can also use statsmodels to predict the Happiness Score.

我们还可以使用statsmodels来预测幸福分数。

# MULTIPLE LR
import statsmodels.api as smX_sm = X = sm.add_constant(X)
model = sm.OLS(y,X_sm)
model.fit().summary()
Image for post

Our coefficient is very close to what we get here.

我们的系数非常接近我们在这里得到的系数。

The code in this note is available on Github.

该注释中的代码可在Github上获得 。

7.结论 (7. Conclusion)

It seems like the common criticism for “The World Happiness Report” is quite valid. A high focus on GDP and strongly correlated features such as family and life expectancy.

似乎对“世界幸福报告”的普遍批评是完全正确的。 高度重视GDP以及与家庭和预期寿命等紧密相关的特征。

Do high GDP make you happy? In the end, we are who we are. I spend my July to analyze why people are not satisfied or don’t live fulfilling lives. I cared about the “why.” After finished analyzing the data, it raises a question: how can we chase happiness?

高GDP会让您开心吗? 最后,我们就是我们。 我花了七月的时间来分析人们为什么不满意或过着充实的生活。 我关心“为什么”。 在完成数据分析之后,提出了一个问题:我们如何追逐幸福?

Image for post
meme米姆

Just a few months ago, since the lock-down, I did everything to chase happiness.

就在几个月前,自禁闭以来,我竭尽全力追求幸福。

  • I define myself as a minimalist. But I started to purchase many things, and I thought that makes me happy.

    我将自己定义为极简主义者。 但是我开始购买很多东西,我认为这让我很高兴。
  • I spent a lot of time on social media, I stayed connected with others, and I expected that makes me happy.

    我在社交媒体上花费了很多时间,与其他人保持联系,我希望这会让我感到高兴。
  • I started writing on Medium; I got volunteer jobs that match my skills and interest, and I believed money and working make me happy.

    我开始在Medium上写作; 我得到了符合我的技能和兴趣的志愿工作,并且我相信金钱和工作使我感到高兴。
  • I tried to go vegan, and I hoped that makes me happy.

    我尝试去素食主义者,希望这让我感到高兴。

At the end of the day, I am lying in my bed, alone, and thinking, “Is happiness achievable? What’s next in this endless pursuit of happiness?”

一天结束时,我独自一人躺在床上,心想:“可以实现幸福吗? 在对幸福的无尽追求中,下一步是什么?”

Well, I realize I am chasing something random that I- myself believe makes me happy.

好吧,我意识到我在追求随机的东西,我自己相信这会让我开心。

While writing this article, I ran into a quote by Ralph Waldo Emerson:

在撰写本文时,我遇到了Ralph Waldo Emerson的一句话:

“The purpose of life is not to be happy. It is to be useful, to be honorable, to be compassionate, to have it make some difference that you have lived and lived well.”

“生活的目的不是幸福。 有益,光荣,富有同情心,对您的生活和生活有一定的影响。”

The dot is finally connected. Happiness can’t be a goal in itself. It is merely a byproduct of usefulness.

点终于连接好了。 幸福本身并不是目标。 它仅仅是有用性的副产品。

What makes me happy is when I’m useful. Ok, but how? One day I woke up and thought to myself: “What am I doing for this world?” And the answer was nothing. And that same day, I started writing on Medium. I turned my school notes to articles with hope people will learn a thing or two after reading them. For you, it can be anything, like painting, creating a product, supporting your family and friends, anything you feel like doing. Please don’t take it too seriously. And the most important, don’t overthink it. Just do something useful. Anything.

使我快乐的是当我有用时。 好的,但是如何? 有一天,我醒来对自己说:“我为这个世界做什么?” 答案是什么。 就在同一天,我开始在Medium上写作。 我将我的学校笔记转为文章,希望人们阅读后能学到一两个东西。 对您而言,您可以做任何事情,例如绘画,创建产品,支持您的家人和朋友。 请不要太在意它。 最重要的是,不要想太多。 做一些有用的事情。 没事

Stay safe and healthy!

保持安全健康!

  1. World Happiness Report 2020.

    2020年世界幸福报告 。

  2. Kaggle World Happiness Report.

    Kaggle世界幸福报告 。

  3. The happiness countries in the world of 2020.

    2020年世界幸福国家 。

  4. Engineering a happiness prediction model.

    设计幸福预测模型 。

翻译自: https://towardsdatascience.com/happiness-and-life-satisfaction-ecdc7d0ab9a5

比赛,幸福度

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389388.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

带有postgres和jupyter笔记本的Titanic数据集

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.PostgreSQL是一个功能强大的开源对象关系数据库系统&am…

Django学习--数据库同步操作技巧

同步数据库:使用上述两条命令同步数据库1.认识migrations目录:migrations目录作用:用来存放通过makemigrations命令生成的数据库脚本,里面的生成的脚本不要轻易修改。要正常的使用数据库同步的功能,app目录下必须要有m…

React 新 Context API 在前端状态管理的实践

2019独角兽企业重金招聘Python工程师标准>>> 本文转载至:今日头条技术博客 众所周知,React的单向数据流模式导致状态只能一级一级的由父组件传递到子组件,在大中型应用中较为繁琐不好管理,通常我们需要使用Redux来帮助…

机器学习模型 非线性模型_机器学习模型说明

机器学习模型 非线性模型A Case Study of Shap and pdp using Diabetes dataset使用糖尿病数据集对Shap和pdp进行案例研究 Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (h…

5分钟内完成胸部CT扫描机器学习

This post provides an overview of chest CT scan machine learning organized by clinical goal, data representation, task, and model.这篇文章按临床目标,数据表示,任务和模型组织了胸部CT扫描机器学习的概述。 A chest CT scan is a grayscale 3…

Pytorch高阶API示范——线性回归模型

本文与《20天吃透Pytorch》有所不同,《20天吃透Pytorch》中是继承之前的模型进行拟合,本文是单独建立网络进行拟合。 代码实现: import torch import numpy as np import matplotlib.pyplot as plt import pandas as pd from torch import …

作业要求 20181023-3 每周例行报告

本周要求参见:https://edu.cnblogs.com/campus/nenu/2018fall/homework/2282 1、本周PSP 总计:927min 2、本周进度条 代码行数 博文字数 用到的软件工程知识点 217 757 PSP、版本控制 3、累积进度图 (1)累积代码折线图 &…

算命数据_未来的数据科学家或算命精神向导

算命数据Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune Telling房地产销售价格,回归和分类:数据科学是算命的未来 As we all know, I am unusually blessed with totally-real psychic abilities.众…

openai-gpt_为什么到处都看到GPT-3?

openai-gptDisclaimer: My opinions are informed by my experience maintaining Cortex, an open source platform for machine learning engineering.免责声明:我的看法是基于我维护 机器学习工程的开源平台 Cortex的 经验而 得出 的。 If you frequent any part…

Pytorch高阶API示范——DNN二分类模型

代码部分: import numpy as np import pandas as pd from matplotlib import pyplot as plt import torch from torch import nn import torch.nn.functional as F from torch.utils.data import Dataset,DataLoader,TensorDataset""" 准备数据 &qu…

OO期末总结

$0 写在前面 善始善终,临近期末,为一学期的收获和努力画一个圆满的句号。 $1 测试与正确性论证的比较 $1-0 什么是测试? 测试是使用人工操作或者程序自动运行的方式来检验它是否满足规定的需求或弄清预期结果与实际结果之间的差别的过程。 它…

数据可视化及其重要性:Python

Data visualization is an important skill to possess for anyone trying to extract and communicate insights from data. In the field of machine learning, visualization plays a key role throughout the entire process of analysis.对于任何试图从数据中提取和传达见…

【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解

👨‍💻博客主页:花无缺 欢迎 点赞👍 收藏⭐ 留言📝 加关注✅! 本文由 花无缺 原创 收录于专栏 【洛谷算法题】 文章目录 【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解🌏题目…

python多项式回归_如何在Python中实现多项式回归模型

python多项式回归Let’s start with an example. We want to predict the Price of a home based on the Area and Age. The function below was used to generate Home Prices and we can pretend this is “real-world data” and our “job” is to create a model which wi…

充分利用UC berkeleys数据科学专业

By Kyra Wong and Kendall Kikkawa黄凯拉(Kyra Wong)和菊川健多 ( Kendall Kikkawa) 什么是“数据科学”? (What is ‘Data Science’?) Data collection, an important aspect of “data science”, is not a new idea. Before the tech boom, every industry al…

02-web框架

1 while True:print(server is waiting...)conn, addr server.accept()data conn.recv(1024) print(data:, data)# 1.得到请求的url路径# ------------dict/obj d["path":"/login"]# d.get(”path“)# 按着http请求协议解析数据# 专注于web业…

ai驱动数据安全治理_AI驱动的Web数据收集解决方案的新起点

ai驱动数据安全治理Data gathering consists of many time-consuming and complex activities. These include proxy management, data parsing, infrastructure management, overcoming fingerprinting anti-measures, rendering JavaScript-heavy websites at scale, and muc…

铁拳nat映射_铁拳如何重塑我的数据可视化设计流程

铁拳nat映射It’s been a full year since I’ve become an independent data visualization designer. When I first started, projects that came to me didn’t relate to my interests or skills. Over the past eight months, it’s become very clear to me that when cl…

DengAI —如何应对数据科学竞赛? (EDA)

了解机器学习 (Understanding ML) This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions li…

java.net.SocketException: Software caused connection abort: socket write erro

场景:接口测试 编辑器:eclipse 版本:Version: 2018-09 (4.9.0) testng版本:TestNG version 6.14.0 执行testng.xml时报错信息: 出现此报错原因之一:网上有人说是testng版本与eclipse版本不一致造成的&#…