探索性数据分析入门_入门指南:R中的探索性数据分析

探索性数据分析入门

When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we push through an algorithm and hand over the results.

当我开始学习数据科学的旅程时,我通读了多篇文章,其中强调了理解您的数据的重要性。 对我来说这没有意义。 我很天真,以为我们已经交出了我们通过算法推送并交出结果的数据。

Yes, I wasn’t exactly the brightest. But I’ve learned my lesson and today I want to impart what I picked from my sleepless nights trying to figure out my data. I am going to use the R language to demonstrate EDA.

是的,我并不是最聪明的人。 但是我已经吸取了教训,今天我想讲讲我从不眠之夜中挑选出的东西,以弄清楚我的数据。 我将使用R语言来演示EDA。

WHY R?

为什么R?

Because it was built from the get-go keeping data science in mind. It’s easy to pick up and get your hands dirty and doesn’t have a steep learning curve, *cough* Assembly *cough*.

因为它是从一开始就牢记数据科学而构建的。 它很容易拿起并弄脏您的手,没有陡峭的学习曲线,*咳嗽* 组装 *咳嗽*。

Before I start, This article is a guide for people classified under the tag of ‘Data Science infants.’ I believe both Python and R are great languages, and what matters most is the Story you tell from your data.

在开始之前,本文是针对归类为“数据科学婴儿”标签的人们的指南。 我相信Python和R都是很棒的语言,最重要的是您从数据中讲述的故事。

为什么使用此数据集? (Why this dataset?)

Well, it’s where I think most of the aspiring data scientists would start. This data set is a good starting place to heat your engines to start thinking like a data scientist at the same time being a novice-friendly helps you breeze through the exercise.

好吧,这是我认为大多数有抱负的数据科学家都会从那里开始的地方。 该数据集是加热引擎以像数据科学家一样开始思考的良好起点,同时对新手友好,可以帮助您轻而易举地完成练习。

我们如何处理这些数据? (How do we approach this data?)

  • Will this variable help use predict house prices?

    这个变量是否有助于预测房价?
  • Is there a correlation between these variables?

    这些变量之间有相关性吗?
  • Univariate Analysis

    单变量分析
  • Multivariate Analysis

    多元分析
  • A bit of Data Cleaning

    一点数据清理
  • Conclude with proving the relevance of our selected variables.

    最后证明我们选择的变量的相关性。

Best of luck on your journey to master Data Science!

在掌握数据科学的过程中祝您好运!

Now, we start with importing packages, I’ll explain why these packages are present along the way…

现在,我们从导入程序包开始,我将解释为什么这些程序包一直存在...

easypackages::libraries("dplyr", "ggplot2", "tidyr", "corrplot", "corrr", "magrittr",   "e1071","ggplot2","RColorBrewer", "viridis")
options(scipen = 5) #To force R to not use scientfic notationdataset <- read.csv("train.csv")
str(dataset)

Here, in the above snippet, we use scipen to avoid scientific notation. We import our data and use the str() function to get the gist of the selection of variables that the dataset offers and the respective data type.

在此,在上面的代码段中,我们使用scipen来避免科学计数法。 我们导入数据并使用str()函数来获取数据集提供的变量以及相应数据类型的选择依据。

Image for post

The variable SalePrice is the dependent variable which we are going to base all our assumptions and hypothesis around. So it’s good to first understand more about this variable. For this, we’ll use a Histogram and fetch a frequency distribution to get a visual understanding of the variable. You’d notice there’s another function i.e. summary() which is essentially used to for the same purpose but without any form of visualization. With experience, you’ll be able to understand and interpret this form of information better.

变量SalePrice是因变量,我们将基于其所有假设和假设。 因此,最好先了解更多有关此变量的信息。 为此,我们将使用直方图并获取频率分布以直观了解变量。 您会注意到还有另一个函数,即summary(),该函数本质上用于相同的目的,但没有任何形式的可视化。 凭借经验,您将能够更好地理解和解释这种形式的信息。

ggplot(dataset, aes(x=SalePrice)) + 
theme_bw()+
geom_histogram(aes(y=..density..),color = 'black', fill = 'white', binwidth = 50000)+
geom_density(alpha=.2, fill='blue') +
labs(title = "Sales Price Density", x="Price", y="Density")summary(dataset$SalePrice)
Image for post
Image for post

So it is pretty evident that you’ll find many properties in the sub $200,000 range. There are properties over $600,000 and we can try to understand why is it so and what makes these homes so ridiculously expensive. That can be another fun exercise…

因此,很明显,您会找到许多价格在20万美元以下的物业。 有超过60万美元的物业,我们可以试着理解为什么会这样,以及是什么使这些房屋如此昂贵。 那可能是另一个有趣的练习……

在确定要购买的房屋的价格时,您认为哪些变量影响最大? (Which variables do you think are most influential when deciding a price for a house you are looking to buy?)

Now that we have a basic idea about SalePrice we will try to visualize this variable in terms of some other variable. Please note that it is very important to understand what type of variable you are working with. I would like you to refer to this amazing article which covers this topic in more detail here.

现在,我们对SalePrice有了基本的了解,我们将尝试根据其他变量来形象化此变量。 请注意,了解要使用的变量类型非常重要。 我想你指的这个惊人的物品,其更为详细地介绍这个主题在这里 。

Moving on, We will be dealing with two kinds of variables.

继续,我们将处理两种变量。

  • Categorical Variable

    分类变量
  • Numeric Variable

    数值变量

Looking back at our dataset we can discern between these variables. For starters, we run a coarse comb across the dataset and guess pick some variables which have the highest chance of being relevant. Note that these are just assumptions and we are exploring this dataset to understand this. The variables I selected are:

回顾我们的数据集,我们可以区分这些变量。 首先,我们对数据集进行粗梳,并猜测选择一些具有最大相关性的变量。 请注意,这些只是假设,我们正在探索此数据集以理解这一点。 我选择的变量是:

  • GrLivArea

    GrLivArea
  • TotalBsmtSF

    TotalBsmtSF
  • YearBuilt

    建立年份
  • OverallQual

    综合素质

So which ones are Quantitive and which ones are Qualitative out of the lot? If you look closely the OveralQual and YearBuilt variable then you will notice that these variables can never be Quantitative. Year and Quality both are categorical by nature of this data however, R doesn’t know that. For that, we use factor() function to convert a numerical variable to categorical so R can interpret the data better.

那么,哪些是定量的,哪些是定性的呢? 如果仔细查看OveralQualYearBuilt变量,您会注意到这些变量永远不会是定量的。 年和质量都是根据此数据的性质分类的,但是R不知道。 为此,我们使用factor()函数将数值变量转换为分类变量,以便R可以更好地解释数据。

dataset$YearBuilt <- factor(dataset$YearBuilt)
dataset$OverallQual <- factor(dataset$OverallQual)

Now when we run str() on our dataset we will see both YearBuilt and OverallQual as factor variables.

现在,当我们在数据集上运行str()时 ,我们会将YearBuilt和TotalQual都视为因子变量。

We can now start plotting our variables.

现在,我们可以开始绘制变量。

关系非常复杂 (Relationships are (NOT) so complicated)

Taking YearBuilt as our first candidate we start plotting.

YearBuilt作为我们的第一个候选人,我们开始绘图。

ggplot(dataset, aes(y=SalePrice, x=YearBuilt, group=YearBuilt, fill=YearBuilt)) +
theme_bw()+
geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=1)+
theme(legend.position="none")+
scale_fill_viridis(discrete = TRUE) +
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Year Built vs. Sale Price", x="Year", y="Price")
Image for post

Old houses sell for less as compared to a recently built house. And as for OverallQual,

与最近建造的房屋相比,旧房屋的售价更低。 至于TotalQuality

ggplot(dataset, aes(y=SalePrice, x=OverallQual, group=OverallQual,fill=OverallQual)) +
geom_boxplot(alpha=0.3)+
theme(legend.position="none")+
scale_fill_viridis(discrete = TRUE, option="B") +
labs(title = "Overall Quality vs. Sale Price", x="Quality", y="Price")
Image for post

This was expected since you’d naturally pay more for the house which is of better quality. You won’t want your foot to break through the floorboard, will you? Now that the qualitative variables are out of the way we can focus on the numeric variables. The very first candidate we have here is GrLivArea.

这是预料之中的,因为您自然会为质量更好的房子付出更多。 您不希望您的脚摔破地板,对吗? 现在,定性变量已不复存在,我们可以将重点放在数字变量上。 我们在这里拥有的第一个候选人是GrLivArea

ggplot(dataset, aes(x=SalePrice, y=GrLivArea)) +
theme_bw()+
geom_point(colour="Blue", alpha=0.3)+
theme(legend.position='none')+
labs(title = "General Living Area vs. Sale Price", x="Price", y="Area")
Image for post

I would be lying if I said I didn’t expect this. The very first instinct of a customer is to check the area of rooms. And I think the result will be the same for TotalBsmtASF. Let’s see…

如果我说我没想到这一点,我会撒谎。 顾客的本能是检查房间的面积。 而且我认为结果与TotalBsmtASF相同。 让我们来看看…

ggplot(dataset, aes(x=SalePrice, y=TotalBsmtSF)) +
theme_bw()+
geom_point(colour="Blue", alpha=0.3)+
theme(legend.position='none')+
labs(title = "Total Basement Area vs. Sale Price", x="Price", y="Area")
Image for post

那么我们能说些什么呢? (So what can we say about our cherry-picked variables?)

GrLivArea and TotalBsmtSF both were found to be in a linear relation with SalePrice. As for the categorical variables, we can say with confidence that the two variable which we picked were related to SalePrice with confidence.

发现GrLivAreaTotalBsmtSF都与SalePrice呈线性关系。 至于分类变量,我们可以放心地说,我们选择的两个变量与SalePrice有信心。

But these are not the only variables and there’s more to than what meets the eye. So to tread over these many variables we’ll take help from a correlation matrix to see how each variable correlate to get a better insight.

但是,这些并不是唯一的变量,还有比眼球更重要的事情。 因此,要遍历这些变量,我们将从关联矩阵中获取帮助,以了解每个变量之间的关联方式,从而获得更好的见解。

相关图的时间 (Time for Correlation Plots)

So what is Correlation?

那么什么是相关性?

Correlation is a measure of how well two variables are related to each other. There are positive as well as negative correlation.

相关性是两个变量之间相关程度的度量。 正相关和负相关。

If you want to read more on Correlation then take a look at this article. So let’s create a basic Correlation Matrix.

如果您想阅读有关Correlation的更多信息,请阅读本文 。 因此,让我们创建一个基本的“相关矩阵”。

M <- cor(dataset)
M <- dataset %>% mutate_if(is.character, as.factor)
M <- M %>% mutate_if(is.factor, as.numeric)
M <- cor(M)mat1 <- data.matrix(M)
print(M)#plotting the correlation matrix
corrplot(M, method = "color", tl.col = 'black', is.corr=FALSE)
Image for post

请不要关闭此标签。 我保证会好起来的。 (Please don’t close this tab. I promise it gets better.)

But worry not because now we’re going to get our hands dirty and make this plot interpretable and tidy.

但是不用担心,因为现在我们要动手做,使这段情节变得可解释和整洁。

M[lower.tri(M,diag=TRUE)] <- NA                   #remove coeff - 1 and duplicates
M[M == 1] <- NAM <- as.data.frame(as.table(M)) #turn into a 3-column table
M <- na.omit(M) #remove the NA values from aboveM <- subset(M, abs(Freq) > 0.5) #select significant values, in this case, 0.5
M <- M[order(-abs(M$Freq)),] #sort by highest correlationmtx_corr <- reshape2::acast(M, Var1~Var2, value.var="Freq") #turn M back into matrix
corrplot(mtx_corr, is.corr=TRUE, tl.col="black", na.label=" ") #plot correlations visually
Image for post

现在,这看起来更好而且可读。 (Now, this looks much better and readable.)

Looking at our plot we can see numerous other variables that are highly correlated with SalePrice. We pick these variables and then create a new dataframe by only including these select variables.

查看我们的图,我们可以看到许多与SalePrice高度相关的其他变量。 我们选择这些变量,然后仅通过包含这些选择变量来创建新的数据框。

Now that we have our suspect variables we can use a PairPlot to visualize all these variables in conjunction with each other.

现在我们有了可疑变量,我们可以使用PairPlot将所有这些变量相互可视化。

newData <- data.frame(dataset$SalePrice, dataset$TotalBsmtSF, 
dataset$GrLivArea, dataset$OverallQual,
dataset$YearBuilt, dataset$FullBath,
dataset$GarageCars )pairs(newData[1:7],
col="blue",
main = "Pairplot of our new set of variables"
)
Image for post

在处理数据时,请清理数据 (While you’re at it, clean your data)

We should remove some useless variables which we are sure of not being of any use. Don’t apply changes to the original dataset though. Always create a new copy in case you remove something you shouldn’t have.

我们应该删除一些肯定不会有任何用处的无用变量。 但是不要将更改应用于原始数据集。 始终创建一个新副本,以防万一您删除了不该拥有的内容。

clean_data <- dataset[,!grepl("^Bsmt",names(dataset))]      #remove BSMTx variablesdrops <- c("clean_data$PoolQC", "clean_data$PoolArea", 
"clean_data$FullBath", "clean_data$HalfBath")

clean_data <- clean_data[ , !(names(clean_data) %in% drops)]#The variables in 'drops'are removed.

单变量分析 (Univariate Analysis)

Taking a look back at our old friend, SalePrice, we see some extremely expensive houses. We haven’t delved into why is that so. Although we do know that these extremely pricey houses don’t follow the pattern which other house prices are following. The reason for such high prices could be justified but for the sake of our analysis, we have to drop them. Such records are called Outliers.

回顾一下我们的老朋友SalePrice ,我们看到了一些极其昂贵的房子。 我们还没有深入研究为什么会这样。 尽管我们确实知道这些极其昂贵的房子没有遵循其他房价正在遵循的模式。 如此高的价格的原因是有道理的,但出于我们的分析考虑,我们必须将其降低。 这样的记录称为离群值。

Simple way to understand Outliers is to think of them as that one guy (or more) in your group who likes to eat noodles with a spoon instead of a fork.

理解离群值的简单方法是将其视为小组中的一个(或多个)喜欢用勺子而不是叉子吃面条的人。

So first, we catch these outliers and then remove them from our dataset if need be. Let’s start with the catching part.

因此,首先,我们捕获这些离群值,然后根据需要将其从数据集中删除。 让我们从捕捉部分开始。

#Univariate Analysisclean_data$price_norm <- scale(clean_data$SalePrice)    #normalizing the price variablesummary(clean_data$price_norm)plot1 <- ggplot(clean_data, aes(x=factor(1), y=price_norm)) +
theme_bw()+
geom_boxplot(width = 0.4, fill = "blue", alpha = 0.2)+
geom_jitter(
width = 0.1, size = 1, aes(colour ="red"))+
geom_hline(yintercept=6.5, linetype="dashed", color = "red")+
theme(legend.position='none')+
labs(title = "Hunt for Outliers", x=NULL, y="Normalized Price")plot2 <- ggplot(clean_data, aes(x=price_norm)) +
theme_bw()+
geom_histogram(color = 'black', fill = 'blue', alpha = 0.2)+
geom_vline(xintercept=6.5, linetype="dashed", color = "red")+
geom_density(aes(y=0.4*..count..), colour="red", adjust=4) +
labs(title = "", x="Price", y="Count")grid.arrange(plot1, plot2, ncol=2)
Image for post

The very first thing I did here was normalize SalePrice so that it’s more interpretable and it’s easier to bottom down on these outliers. The normalized SalePrice has Mean= 0 and SD= 1. Running a quick ‘summary()’ on this new variable price_norm give us this…

我在这里所做的第一件事就是对SalePrice进行规范化,以使其更易于解释,并且更容易查明这些异常值。 归一化的SalePrice的均值= 0SD = 1 。 在这个新变量price_norm上运行一个快速的“ summary()” ,可以给我们这个…

Image for post

So now we know for sure that there ARE outliers present here. But do we really need to get rid of them? From the previous scatterplots we can say that these outliers are still following along with the trend and don’t need purging yet. Deciding what to do with outliers can be quite complex at times. You can read more on outliers here.

因此,现在我们可以肯定地知道这里有异常值。 但是我们真的需要摆脱它们吗? 从以前的散点图可以看出,这些离群值仍在跟随趋势,并且不需要清除。 决定如何处理异常值有时可能非常复杂。 您可以在这里有关离群值的信息 。

双变量分析 (Bi-Variate Analysis)

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of a relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences. There are three types of bivariate analysis.

双变量分析是对两个变量(属性)的同时分析。 它探讨了两个变量之间关系的概念,是否存在关联和这种关联的强度,或者两个变量之间是否存在差异以及这些差异的重要性。 双变量分析有三种类型。

  • Numerical & Numerical

    数值与数值
  • Categorical & Categorical

    分类和分类
  • Numerical & Categorical

    数值和分类

The very first set of variables we will analyze here are SalePrice and GrLivArea. Both variables are Numerical so using a Scatter Plot is a good idea!

我们将在此处分析的第一组变量是SalePriceGrLivArea 。 这两个变量都是数值变量,因此使用散点图是个好主意!

ggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "General Living Area vs. Sale Price", y="Price", x="Area")
Image for post

Immediately, we notice that 2 houses don’t follow the linear trend and affect both our results and assumptions. These are our outliers. Since our results in future are prone to be affected negatively by these outliers, we will remove them.

立刻,我们注意到有2所房屋没有遵循线性趋势,并且影响了我们的结果和假设。 这些是我们的异常值。 由于我们未来的结果很容易受到这些异常值的负面影响,因此我们将其删除。

clean_data <- clean_data[!(clean_data$GrLivArea > 4000),]   #remove outliersggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "General Living Area vs. Sale Price [Outlier Removed]", y="Price", x="Area")
Image for post

The outlier is removed and the x-scale is adjusted. Next set of variables which we will analyze are SalePrice and TotalBsmtSF.

移除异常值并调整x比例。 我们将分析的下一组变量是SalePriceTotalBsmtSF

ggplot(clean_data, aes(y=SalePrice, x=TotalBsmtSF)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "Total Basement Area vs. Sale Price", y="Price", x="Basement Area")
Image for post

The observations here adhere to our assumptions and don’t need purging. If it ain’t broke, don’t fix it. I did mention that it is important to tread very carefully when working with outliers. You don’t get to remove them every time.

此处的观察结果符合我们的假设,无需清除。 如果没有损坏,请不要修复。 我确实提到过,在处理异常值时,请务必谨慎行事。 您不会每次都删除它们。

是时候深入一点了 (Time to dig a bit deeper)

We based a ton of visualization around ‘SalePrice’ and other important variables, but what If I said that’s not enough? It’s not Because there’s more to dig out of this pit. There are 4 horsemen of Data Analysis which I believe people should remember.

我们围绕“ SalePrice”和其他重要变量进行了大量可视化处理,但是如果我说那还不够怎么办? 不是因为有更多东西需要挖掘。 我相信人们应该记住4位数据分析骑士。

  • Normality: When we talk about normality what we mean is that the data should look like a normal distribution. This is important because a lot of statistic tests depend upon this (for example — t-statistics). First, we would check normality with just a single variable ‘SalePrice’(It’s usually better to start with a single variable). Though one shouldn’t assume that univariate normality would prove the existence of multivariate normality(which is comparatively more sought after), but it helps. Another thing to note is that in larger samples i.e. more than 200 samples, normality is not such an issue. However, A lot of problems can be avoided if we solve normality. That’s one of the reasons we are working with normality.

    正态性 :当谈论正态性时,我们的意思是数据看起来应该像正态分布。 这很重要,因为很多统计检验都依赖于此(例如t统计)。 首先,我们将仅使用单个变量“ SalePrice”(通常最好从单个变量开始)检查正态性。 尽管不应该假设单变量正态性会证明多元正态性的存在(相对较受追捧),但这很有帮助。 要注意的另一件事是,在较大的样本(即200多个样本)中,正态性不是问题。 但是,如果我们解决正态性,可以避免很多问题。 这就是我们进行正常工作的原因之一。

  • Homoscedasticity: Homoscedasticity refers to the ‘assumption that one or more dependent variables exhibit equal levels of variance across the range of predictor variables’. If we want the error term to be the same across all values of the independent variable, then Homoscedasticity is to be checked.

    均方根性 :均方根性是指“ 一个或多个因变量在预测变量范围内表现出相等水平的方差假设 ”。 如果我们希望误差项在自变量的所有值上都相同,则将检查同方差。

  • Linearity: If you want to assess the linearity of your data then I believe scatter plots should be the first choice. Scatter plots can quickly show the linear relationship(if it exists). In the case where patterns are not linear, it would be worthwhile to explore data transformations. However, we need not check for this again since our previous plots have already proved the existence of a linear relationship.

    线性度 :如果您想评估数据的线性度,那么我相信散点图应该是首选。 散点图可以快速显示线性关系(如果存在)。 在模式不是线性的情况下,值得探索数据转换。 但是,由于我们以前的曲线已经证明存在线性关系,因此我们无需再次检查。

  • Absence of correlated errors: When working with errors, if you notice a pattern where one error is correlated to another then there’s a relationship between these variables. For example, In a certain case, one positive error makes a negative error across the board then that would imply a relationship between errors. This phenomenon is more evident with time-sensitive data. If you do find yourself working with such data then try and add a variable that can explain your observations.

    缺少相关错误 :处理错误时,如果您注意到一种模式,其中一个错误与另一个错误相关,则这些变量之间存在关联。 例如,在某些情况下,一个正错误会在整个范围内产生一个负错误,然后暗示错误之间的关系。 对于时间敏感的数据,这种现象更加明显。 如果您发现自己正在使用此类数据,请尝试添加一个可以解释您的观察结果的变量。

我认为我们应该开始做,而不是说 (I think we should start doing rather than saying)

Starting with SalePrice. Do keep an eye on the overall distribution of our variable.

SalePrice开始。 请注意变量的总体分布。

plot3 <- ggplot(clean_data, aes(x=SalePrice)) + 
theme_bw()+
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1.2)+
labs(title = "Sale Price Density", x="Price", y="Density")plot4 <- ggplot(clean_data, aes(sample=SalePrice))+
theme_bw()+
stat_qq(color="#69b3a2")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for SalePrice")grid.arrange(plot3, plot4, ncol=2)
Image for post

SalePrice is not normal! But we have another trick up our sleeves viz. log transformation. Now, one great thing about log transformation is that it can deal with skewed data and make it normal. So now it’s time to apply the log transformation over our variable.

促销价不正常! 但是,我们还有另外一个窍门。 日志转换。 现在,关于日志转换的一大优点是它可以处理偏斜的数据并使之正常。 因此,现在是时候将对数转换应用于我们的变量了。

clean_data$log_price <- log(clean_data$SalePrice)plot5 <- ggplot(clean_data, aes(x=log_price)) + 
theme_bw()+
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1)+
labs(title = "Sale Price Density [Log]", x="Price", y="Density")plot6 <- ggplot(clean_data, aes(sample=log_price))+
theme_bw()+
stat_qq(color="#69b3a2")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for SalePrice [Log]")grid.arrange(plot5, plot6, ncol=2)
Image for post

现在,使用其余的变量重复该过程。 (Now repeat the process with the rest of our variables.)

我们先和GrLivArea一起去 (We go with GrLivArea first)

Image for post

日志转换后 (After Log Transformation)

Image for post

现在用于TotalBsmtSF (Now for TotalBsmtSF)

Image for post

坚持,稍等! 我们这里有一些有趣的东西。 (Hold On! We’ve got something interesting here.)

Looks like TotalBsmtSF has some zeroes. This doesn’t bode well with log transformation. We’ll have to do something about it. To apply a log transformation here, we’ll create a variable that can get the effect of having or not having a basement (binary variable). Then, we’ll do a log transformation to all the non-zero observations, ignoring those with value zero. This way we can transform data, without losing the effect of having or not the basement.

看起来TotalBsmtSF有一些零。 这对日志转换不是一个好兆头。 我们必须对此做些事情。 要在此处应用对数转换,我们将创建一个变量,该变量可以具有或不具有地下室的效果(二进制变量)。 然后,我们将对所有非零观测值进行对数转换,而忽略值为零的观测值。 这样,我们可以转换数据,而不会失去拥有或不拥有地下室的影响。

#The step where I create a new variable to dictate which row to transform and which to ignore
clean_data <- transform(clean_data, cat_bsmt = ifelse(TotalBsmtSF>0, 1, 0))#Now we can do log transformation
clean_data$totalbsmt_log <- log(clean_data$TotalBsmtSF)clean_data<-transform(clean_data,totalbsmt_log = ifelse(cat_bsmt == 1, log(TotalBsmtSF), 0 ))plot13 <- ggplot(clean_data, aes(x=totalbsmt_log)) +
theme_bw()+
geom_density(fill="#ed557e", color="#e9ecef", alpha=0.5)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1)+
labs(title = "Total Basement Area Density [transformed]", x="Area", y="Density")plot14 <- ggplot(clean_data, aes(sample=totalbsmt_log))+
theme_bw()+
stat_qq(color="#ed557e")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for TotalBsmtSF [transformed]")grid.arrange(plot13, plot14, ncol=2)
Image for post

We can still see the ignored data points on the chart but hey, I can trust you with this, right?

我们仍然可以在图表上看到被忽略的数据点,但是,我可以相信您,对吗?

均方根性-等待我的拼写正确吗? (Homoscedasticity — Wait is my spelling correct?)

The best way to look for homoscedasticity is to work try and visualize the variables using charts. A scatter plot should do the job. Notice the shape which the data forms when plotted. It could look like an equal dispersion which looks like a cone or it could very well look like a diamond where a large number of data points are spread around the centre.

寻找同质性的最佳方法是尝试使用图表直观显示变量。 散点图可以完成这项工作。 注意绘制时数据形成的形状。 它可能看起来像一个均匀的色散,看起来像一个圆锥形,或者看起来非常像一个菱形,其中大量数据点围绕中心分布。

Starting with ‘SalePrice’ and ‘GrLivArea’…

从“ SalePrice”和“ GrLivArea”开始...

ggplot(clean_data, aes(x=grlive_log, y=log_price)) +
theme_bw()+
geom_point(colour="#e34262", alpha=0.3)+
theme(legend.position='none')+
labs(title = "Homoscedasticity : Living Area vs. Sale Price ", x="Area [Log]", y="Price [Log]")
Image for post

We plotted ‘SalePrice’ and ‘GrLivArea’ before but then why is the plot different? That’s right, because of the log transformation.

我们之前绘制了“ SalePrice”和“ GrLivArea”,但是为什么绘制不同? 是的,因为有日志转换。

If we go back to the previously plotted graphs showing the same variable, it is evident that the data has a conical shape when plotted. But after log transformation, the conic shape is no more. Here we solved the homoscedasticity problem with just one transformation. Pretty powerful eh?

如果我们回到显示相同变量的先前绘制的图,很明显,绘制时数据具有圆锥形状。 但是对数转换后,圆锥形状不再存在。 在这里,我们只用一种变换就解决了同方差问题。 很厉害吗?

Now let’s check ‘SalePrice’ with ‘TotalBsmtSF’.

现在,让我们用“ TotalBsmtSF”检查“ SalePrice”。

ggplot(clean_data, aes(x=totalbsmt_log, y=log_price)) +
theme_bw()+
geom_point(colour="#e34262", alpha=0.3)+
theme(legend.position='none')+
labs(title = " Homoscedasticity : Total Basement Area vs. Sale Price", x="Area [Log]", y="Price [Log]")
Image for post
Please take care of 0 for me :)
请为我照顾0 :)

就是这样,我们已经完成分析的结尾。 现在剩下的就是获取虚拟变量了……其余的你都知道了。 :) (That’s it, we’ve reached the end of our Analysis. Now all that’s left is to get the dummy variables and… you know the rest. :))

This work was possible thanks to Pedro Marcelino. I found his Analysis on this dataset in Python and wanted to re-write it in R. Give him some love!

感谢Pedro Marcelino使得这项工作成为可能。 我在Python中找到了他对此数据集的分析,并想用R重新编写它。给他一些爱!

翻译自: https://medium.com/@unkletam/beginners-guide-exploratory-data-analysis-in-r-47dac64d95fe

探索性数据分析入门

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388910.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

python web应用_为您的应用选择最佳的Python Web爬网库

python web应用Living in today’s world, we are surrounded by different data all around us. The ability to collect and use this data in our projects is a must-have skill for every data scientist.生活在当今世界中&#xff0c;我们周围遍布着不同的数据。 在我们的…

NDK-r14b + FFmpeg-release-3.4 linux下编译FFmpeg

下载资源 官网下载完NDK14b 和 FFmpeg 下载之后&#xff0c;更改FFmpeg 目录下configure问价如下&#xff1a; SLIBNAME_WITH_MAJOR$(SLIBPREF)$(FULLNAME)-$(LIBMAJOR)$(SLIBSUF) LIB_INSTALL_EXTRA_CMD$$(RANLIB)"$(LIBDIR)/$(LIBNAME)" SLIB_INSTALL_NAME$(SLI…

html中列表导航怎么和图片对齐_HTML实战篇:html仿百度首页

本篇文章主要给大家介绍一下如何使用htmlcss来制作百度首页页面。1)制作页面所用的知识点我们首先来分析一下百度首页的页面效果图百度首页由头部的一个文字导航&#xff0c;中间的一个按钮和一个输入框以及下边的文字简介和导航组成。我们这里主要用到的知识点就是列表标签(ul…

C# 依赖注入那些事儿

原文地址&#xff1a;http://www.cnblogs.com/leoo2sk/archive/2009/06/17/1504693.html 里面有一个例子差了些代码&#xff0c;补全后贴上。 3.1.3 依赖获取 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml;//定义…

在FAANG面试中破解堆算法

In FAANG company interview, Candidates always come across heap problems. There is one question they do like to ask — Top K. Because these companies have a huge dataset and they can’t always go through all the data. Finding tope data is always a good opti…

mysql springboot 缓存_Spring Boot 整合 Redis 实现缓存操作

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载&#xff0c;保留摘要&#xff0c;谢谢&#xff01;『 产品没有价值&#xff0c;开发团队再优秀也无济于事 – 《启示录》 』本文提纲一、缓存的应用场景二、更新缓存的策略三、运行 springboot-mybatis-redis 工程…

itchat 道歉_人类的“道歉”

itchat 道歉When cookies were the progeny of “magic cookies”, they were seemingly innocuous packets of e-commerce data that stored a user’s partial transaction state on their computer. It wasn’t disclosed that you were playing a beneficial part in a muc…

matlab软件imag函数_「复变函数与积分变换」基本计算代码

使用了Matlab代码&#xff0c;化简平时遇到的计算问题&#xff0c;也可以用于验算结果来自211工科专业2学分复变函数与积分变换课程求复角主值sym(angle(待求复数))%公式 sym(angle(1sqrt(3)*i))%举例代入化简将 代入关于z的函数f(z)中并化解&#xff0c;用于公式法计算无穷远点…

数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由

数据科学 pythonAs a new Data Scientist, you know that your path begins with programming languages you need to learn. Among all languages that you can select from Python is the most popular language for all Data Scientists. In this article, I will cover 7 r…

rabbitmq 不同的消费者消费同一个队列_RabbitMQ 消费端限流、TTL、死信队列

消费端限流1. 为什么要对消费端限流假设一个场景&#xff0c;首先&#xff0c;我们 Rabbitmq 服务器积压了有上万条未处理的消息&#xff0c;我们随便打开一个消费者客户端&#xff0c;会出现这样情况: 巨量的消息瞬间全部推送过来&#xff0c;但是我们单个客户端无法同时处理这…

动量策略 python_在Python中使用动量通道进行交易

动量策略 pythonMost traders use Bollinger Bands. However, price is not normally distributed. That’s why only 42% of prices will close within one standard deviation. Please go ahead and read this article. However, I have some good news.大多数交易者使用布林…

css3 变换、过渡效果、动画

1 CSS3 选择器 1.1 基本选择器 1.2 层级 空格 > .itemli ~ .item~p 1.3 属性选择器 [attr] [attrvalue] [attr^value] [attr$value] [attr*value] [][][] 1.4 伪类选择器 :link :visited :hover :active :focus :first-child .list li:first-child :last-chi…

mysql常用的存储引擎_Mysql存储引擎

什么是存储引擎&#xff1f;关系数据库表是用于存储和组织信息的数据结构&#xff0c;可以将表理解为由行和列组成的表格&#xff0c;类似于Excel的电子表格的形式。有的表简单&#xff0c;有的表复杂&#xff0c;有的表根本不用来存储任何长期的数据&#xff0c;有的表读取时非…

android studio设计模式和文本模式切换

转载于:https://www.cnblogs.com/judes/p/9437104.html

高斯模糊为什么叫高斯滤波_为什么高斯是所有发行之王?

高斯模糊为什么叫高斯滤波高斯分布及其主要特征&#xff1a; (Gaussian Distribution and its key characteristics:) Gaussian distribution is a continuous probability distribution with symmetrical sides around its center. 高斯分布是连续概率分布&#xff0c;其中心周…

C# webbrowser 代理

百度&#xff0c;google加自己理解后&#xff0c;将所得方法总结一下&#xff1a; 方法1&#xff1a;修改注册表Software//Microsoft//Windows//CurrentVersion//Internet Settings下 ProxyEnable和ProxyServer。这种方法适用于局域网用户&#xff0c;拨号用户无效。 1p…

C MySQL读写分离连接串_Mysql读写分离

一 什么是读写分离MySQL Proxy最强大的一项功能是实现“读写分离(Read/Write Splitting)”。基本的原理是让主数据库处理事务性查询&#xff0c;而从数据库处理SELECT查询。数据库复制被用来把事务性查询导致的变更同步到集群中的从数据库。当然&#xff0c;主服务器也可以提供…

从Jupyter Notebook到脚本

16 Aug: My second article: From Scripts To Prediction API8月16日&#xff1a;我的第二篇文章&#xff1a; 从脚本到预测API As advanced beginners, we know quite a lot: EDA, ML concepts, model architectures etc…… We can write a big Jupyter Notebook, click “Re…

加勒比海兔_加勒比海海洋物种趋势

加勒比海兔Ok, here’s a million dollar question: is the Caribbean really dying? Or, more specifically, are marine species found on Caribbean reefs becoming less abundant?好吧&#xff0c;这是一个百万美元的问题&#xff1a;加勒比海真的死了吗&#xff1f; 或者…

tornado 简易教程

引言 回想Django的部署方式 以Django为代表的python web应用部署时采用wsgi协议与服务器对接&#xff08;被服务器托管&#xff09;&#xff0c;而这类服务器通常都是基于多线程的&#xff0c;也就是说每一个网络请求服务器都会有一个对应的线程来用web应用&#xff08;如Djang…