探索性数据分析入门_入门指南:R中的探索性数据分析

探索性数据分析入门

When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we push through an algorithm and hand over the results.

当我开始学习数据科学的旅程时,我通读了多篇文章,其中强调了理解您的数据的重要性。 对我来说这没有意义。 我很天真,以为我们已经交出了我们通过算法推送并交出结果的数据。

Yes, I wasn’t exactly the brightest. But I’ve learned my lesson and today I want to impart what I picked from my sleepless nights trying to figure out my data. I am going to use the R language to demonstrate EDA.

是的,我并不是最聪明的人。 但是我已经吸取了教训,今天我想讲讲我从不眠之夜中挑选出的东西,以弄清楚我的数据。 我将使用R语言来演示EDA。

WHY R?

为什么R?

Because it was built from the get-go keeping data science in mind. It’s easy to pick up and get your hands dirty and doesn’t have a steep learning curve, *cough* Assembly *cough*.

因为它是从一开始就牢记数据科学而构建的。 它很容易拿起并弄脏您的手,没有陡峭的学习曲线,*咳嗽* 组装 *咳嗽*。

Before I start, This article is a guide for people classified under the tag of ‘Data Science infants.’ I believe both Python and R are great languages, and what matters most is the Story you tell from your data.

在开始之前,本文是针对归类为“数据科学婴儿”标签的人们的指南。 我相信Python和R都是很棒的语言,最重要的是您从数据中讲述的故事。

为什么使用此数据集? (Why this dataset?)

Well, it’s where I think most of the aspiring data scientists would start. This data set is a good starting place to heat your engines to start thinking like a data scientist at the same time being a novice-friendly helps you breeze through the exercise.

好吧,这是我认为大多数有抱负的数据科学家都会从那里开始的地方。 该数据集是加热引擎以像数据科学家一样开始思考的良好起点,同时对新手友好,可以帮助您轻而易举地完成练习。

我们如何处理这些数据? (How do we approach this data?)

  • Will this variable help use predict house prices?

    这个变量是否有助于预测房价?
  • Is there a correlation between these variables?

    这些变量之间有相关性吗?
  • Univariate Analysis

    单变量分析
  • Multivariate Analysis

    多元分析
  • A bit of Data Cleaning

    一点数据清理
  • Conclude with proving the relevance of our selected variables.

    最后证明我们选择的变量的相关性。

Best of luck on your journey to master Data Science!

在掌握数据科学的过程中祝您好运!

Now, we start with importing packages, I’ll explain why these packages are present along the way…

现在,我们从导入程序包开始,我将解释为什么这些程序包一直存在...

easypackages::libraries("dplyr", "ggplot2", "tidyr", "corrplot", "corrr", "magrittr",   "e1071","ggplot2","RColorBrewer", "viridis")
options(scipen = 5) #To force R to not use scientfic notationdataset <- read.csv("train.csv")
str(dataset)

Here, in the above snippet, we use scipen to avoid scientific notation. We import our data and use the str() function to get the gist of the selection of variables that the dataset offers and the respective data type.

在此,在上面的代码段中,我们使用scipen来避免科学计数法。 我们导入数据并使用str()函数来获取数据集提供的变量以及相应数据类型的选择依据。

Image for post

The variable SalePrice is the dependent variable which we are going to base all our assumptions and hypothesis around. So it’s good to first understand more about this variable. For this, we’ll use a Histogram and fetch a frequency distribution to get a visual understanding of the variable. You’d notice there’s another function i.e. summary() which is essentially used to for the same purpose but without any form of visualization. With experience, you’ll be able to understand and interpret this form of information better.

变量SalePrice是因变量,我们将基于其所有假设和假设。 因此,最好先了解更多有关此变量的信息。 为此,我们将使用直方图并获取频率分布以直观了解变量。 您会注意到还有另一个函数,即summary(),该函数本质上用于相同的目的,但没有任何形式的可视化。 凭借经验,您将能够更好地理解和解释这种形式的信息。

ggplot(dataset, aes(x=SalePrice)) + 
theme_bw()+
geom_histogram(aes(y=..density..),color = 'black', fill = 'white', binwidth = 50000)+
geom_density(alpha=.2, fill='blue') +
labs(title = "Sales Price Density", x="Price", y="Density")summary(dataset$SalePrice)
Image for post
Image for post

So it is pretty evident that you’ll find many properties in the sub $200,000 range. There are properties over $600,000 and we can try to understand why is it so and what makes these homes so ridiculously expensive. That can be another fun exercise…

因此,很明显,您会找到许多价格在20万美元以下的物业。 有超过60万美元的物业,我们可以试着理解为什么会这样,以及是什么使这些房屋如此昂贵。 那可能是另一个有趣的练习……

在确定要购买的房屋的价格时,您认为哪些变量影响最大? (Which variables do you think are most influential when deciding a price for a house you are looking to buy?)

Now that we have a basic idea about SalePrice we will try to visualize this variable in terms of some other variable. Please note that it is very important to understand what type of variable you are working with. I would like you to refer to this amazing article which covers this topic in more detail here.

现在,我们对SalePrice有了基本的了解,我们将尝试根据其他变量来形象化此变量。 请注意,了解要使用的变量类型非常重要。 我想你指的这个惊人的物品,其更为详细地介绍这个主题在这里 。

Moving on, We will be dealing with two kinds of variables.

继续,我们将处理两种变量。

  • Categorical Variable

    分类变量
  • Numeric Variable

    数值变量

Looking back at our dataset we can discern between these variables. For starters, we run a coarse comb across the dataset and guess pick some variables which have the highest chance of being relevant. Note that these are just assumptions and we are exploring this dataset to understand this. The variables I selected are:

回顾我们的数据集,我们可以区分这些变量。 首先,我们对数据集进行粗梳,并猜测选择一些具有最大相关性的变量。 请注意,这些只是假设,我们正在探索此数据集以理解这一点。 我选择的变量是:

  • GrLivArea

    GrLivArea
  • TotalBsmtSF

    TotalBsmtSF
  • YearBuilt

    建立年份
  • OverallQual

    综合素质

So which ones are Quantitive and which ones are Qualitative out of the lot? If you look closely the OveralQual and YearBuilt variable then you will notice that these variables can never be Quantitative. Year and Quality both are categorical by nature of this data however, R doesn’t know that. For that, we use factor() function to convert a numerical variable to categorical so R can interpret the data better.

那么,哪些是定量的,哪些是定性的呢? 如果仔细查看OveralQualYearBuilt变量,您会注意到这些变量永远不会是定量的。 年和质量都是根据此数据的性质分类的,但是R不知道。 为此,我们使用factor()函数将数值变量转换为分类变量,以便R可以更好地解释数据。

dataset$YearBuilt <- factor(dataset$YearBuilt)
dataset$OverallQual <- factor(dataset$OverallQual)

Now when we run str() on our dataset we will see both YearBuilt and OverallQual as factor variables.

现在,当我们在数据集上运行str()时 ,我们会将YearBuilt和TotalQual都视为因子变量。

We can now start plotting our variables.

现在,我们可以开始绘制变量。

关系非常复杂 (Relationships are (NOT) so complicated)

Taking YearBuilt as our first candidate we start plotting.

YearBuilt作为我们的第一个候选人,我们开始绘图。

ggplot(dataset, aes(y=SalePrice, x=YearBuilt, group=YearBuilt, fill=YearBuilt)) +
theme_bw()+
geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=1)+
theme(legend.position="none")+
scale_fill_viridis(discrete = TRUE) +
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Year Built vs. Sale Price", x="Year", y="Price")
Image for post

Old houses sell for less as compared to a recently built house. And as for OverallQual,

与最近建造的房屋相比,旧房屋的售价更低。 至于TotalQuality

ggplot(dataset, aes(y=SalePrice, x=OverallQual, group=OverallQual,fill=OverallQual)) +
geom_boxplot(alpha=0.3)+
theme(legend.position="none")+
scale_fill_viridis(discrete = TRUE, option="B") +
labs(title = "Overall Quality vs. Sale Price", x="Quality", y="Price")
Image for post

This was expected since you’d naturally pay more for the house which is of better quality. You won’t want your foot to break through the floorboard, will you? Now that the qualitative variables are out of the way we can focus on the numeric variables. The very first candidate we have here is GrLivArea.

这是预料之中的,因为您自然会为质量更好的房子付出更多。 您不希望您的脚摔破地板,对吗? 现在,定性变量已不复存在,我们可以将重点放在数字变量上。 我们在这里拥有的第一个候选人是GrLivArea

ggplot(dataset, aes(x=SalePrice, y=GrLivArea)) +
theme_bw()+
geom_point(colour="Blue", alpha=0.3)+
theme(legend.position='none')+
labs(title = "General Living Area vs. Sale Price", x="Price", y="Area")
Image for post

I would be lying if I said I didn’t expect this. The very first instinct of a customer is to check the area of rooms. And I think the result will be the same for TotalBsmtASF. Let’s see…

如果我说我没想到这一点,我会撒谎。 顾客的本能是检查房间的面积。 而且我认为结果与TotalBsmtASF相同。 让我们来看看…

ggplot(dataset, aes(x=SalePrice, y=TotalBsmtSF)) +
theme_bw()+
geom_point(colour="Blue", alpha=0.3)+
theme(legend.position='none')+
labs(title = "Total Basement Area vs. Sale Price", x="Price", y="Area")
Image for post

那么我们能说些什么呢? (So what can we say about our cherry-picked variables?)

GrLivArea and TotalBsmtSF both were found to be in a linear relation with SalePrice. As for the categorical variables, we can say with confidence that the two variable which we picked were related to SalePrice with confidence.

发现GrLivAreaTotalBsmtSF都与SalePrice呈线性关系。 至于分类变量,我们可以放心地说,我们选择的两个变量与SalePrice有信心。

But these are not the only variables and there’s more to than what meets the eye. So to tread over these many variables we’ll take help from a correlation matrix to see how each variable correlate to get a better insight.

但是,这些并不是唯一的变量,还有比眼球更重要的事情。 因此,要遍历这些变量,我们将从关联矩阵中获取帮助,以了解每个变量之间的关联方式,从而获得更好的见解。

相关图的时间 (Time for Correlation Plots)

So what is Correlation?

那么什么是相关性?

Correlation is a measure of how well two variables are related to each other. There are positive as well as negative correlation.

相关性是两个变量之间相关程度的度量。 正相关和负相关。

If you want to read more on Correlation then take a look at this article. So let’s create a basic Correlation Matrix.

如果您想阅读有关Correlation的更多信息,请阅读本文 。 因此,让我们创建一个基本的“相关矩阵”。

M <- cor(dataset)
M <- dataset %>% mutate_if(is.character, as.factor)
M <- M %>% mutate_if(is.factor, as.numeric)
M <- cor(M)mat1 <- data.matrix(M)
print(M)#plotting the correlation matrix
corrplot(M, method = "color", tl.col = 'black', is.corr=FALSE)
Image for post

请不要关闭此标签。 我保证会好起来的。 (Please don’t close this tab. I promise it gets better.)

But worry not because now we’re going to get our hands dirty and make this plot interpretable and tidy.

但是不用担心,因为现在我们要动手做,使这段情节变得可解释和整洁。

M[lower.tri(M,diag=TRUE)] <- NA                   #remove coeff - 1 and duplicates
M[M == 1] <- NAM <- as.data.frame(as.table(M)) #turn into a 3-column table
M <- na.omit(M) #remove the NA values from aboveM <- subset(M, abs(Freq) > 0.5) #select significant values, in this case, 0.5
M <- M[order(-abs(M$Freq)),] #sort by highest correlationmtx_corr <- reshape2::acast(M, Var1~Var2, value.var="Freq") #turn M back into matrix
corrplot(mtx_corr, is.corr=TRUE, tl.col="black", na.label=" ") #plot correlations visually
Image for post

现在,这看起来更好而且可读。 (Now, this looks much better and readable.)

Looking at our plot we can see numerous other variables that are highly correlated with SalePrice. We pick these variables and then create a new dataframe by only including these select variables.

查看我们的图,我们可以看到许多与SalePrice高度相关的其他变量。 我们选择这些变量,然后仅通过包含这些选择变量来创建新的数据框。

Now that we have our suspect variables we can use a PairPlot to visualize all these variables in conjunction with each other.

现在我们有了可疑变量,我们可以使用PairPlot将所有这些变量相互可视化。

newData <- data.frame(dataset$SalePrice, dataset$TotalBsmtSF, 
dataset$GrLivArea, dataset$OverallQual,
dataset$YearBuilt, dataset$FullBath,
dataset$GarageCars )pairs(newData[1:7],
col="blue",
main = "Pairplot of our new set of variables"
)
Image for post

在处理数据时,请清理数据 (While you’re at it, clean your data)

We should remove some useless variables which we are sure of not being of any use. Don’t apply changes to the original dataset though. Always create a new copy in case you remove something you shouldn’t have.

我们应该删除一些肯定不会有任何用处的无用变量。 但是不要将更改应用于原始数据集。 始终创建一个新副本,以防万一您删除了不该拥有的内容。

clean_data <- dataset[,!grepl("^Bsmt",names(dataset))]      #remove BSMTx variablesdrops <- c("clean_data$PoolQC", "clean_data$PoolArea", 
"clean_data$FullBath", "clean_data$HalfBath")

clean_data <- clean_data[ , !(names(clean_data) %in% drops)]#The variables in 'drops'are removed.

单变量分析 (Univariate Analysis)

Taking a look back at our old friend, SalePrice, we see some extremely expensive houses. We haven’t delved into why is that so. Although we do know that these extremely pricey houses don’t follow the pattern which other house prices are following. The reason for such high prices could be justified but for the sake of our analysis, we have to drop them. Such records are called Outliers.

回顾一下我们的老朋友SalePrice ,我们看到了一些极其昂贵的房子。 我们还没有深入研究为什么会这样。 尽管我们确实知道这些极其昂贵的房子没有遵循其他房价正在遵循的模式。 如此高的价格的原因是有道理的,但出于我们的分析考虑,我们必须将其降低。 这样的记录称为离群值。

Simple way to understand Outliers is to think of them as that one guy (or more) in your group who likes to eat noodles with a spoon instead of a fork.

理解离群值的简单方法是将其视为小组中的一个(或多个)喜欢用勺子而不是叉子吃面条的人。

So first, we catch these outliers and then remove them from our dataset if need be. Let’s start with the catching part.

因此,首先,我们捕获这些离群值,然后根据需要将其从数据集中删除。 让我们从捕捉部分开始。

#Univariate Analysisclean_data$price_norm <- scale(clean_data$SalePrice)    #normalizing the price variablesummary(clean_data$price_norm)plot1 <- ggplot(clean_data, aes(x=factor(1), y=price_norm)) +
theme_bw()+
geom_boxplot(width = 0.4, fill = "blue", alpha = 0.2)+
geom_jitter(
width = 0.1, size = 1, aes(colour ="red"))+
geom_hline(yintercept=6.5, linetype="dashed", color = "red")+
theme(legend.position='none')+
labs(title = "Hunt for Outliers", x=NULL, y="Normalized Price")plot2 <- ggplot(clean_data, aes(x=price_norm)) +
theme_bw()+
geom_histogram(color = 'black', fill = 'blue', alpha = 0.2)+
geom_vline(xintercept=6.5, linetype="dashed", color = "red")+
geom_density(aes(y=0.4*..count..), colour="red", adjust=4) +
labs(title = "", x="Price", y="Count")grid.arrange(plot1, plot2, ncol=2)
Image for post

The very first thing I did here was normalize SalePrice so that it’s more interpretable and it’s easier to bottom down on these outliers. The normalized SalePrice has Mean= 0 and SD= 1. Running a quick ‘summary()’ on this new variable price_norm give us this…

我在这里所做的第一件事就是对SalePrice进行规范化,以使其更易于解释,并且更容易查明这些异常值。 归一化的SalePrice的均值= 0SD = 1 。 在这个新变量price_norm上运行一个快速的“ summary()” ,可以给我们这个…

Image for post

So now we know for sure that there ARE outliers present here. But do we really need to get rid of them? From the previous scatterplots we can say that these outliers are still following along with the trend and don’t need purging yet. Deciding what to do with outliers can be quite complex at times. You can read more on outliers here.

因此,现在我们可以肯定地知道这里有异常值。 但是我们真的需要摆脱它们吗? 从以前的散点图可以看出,这些离群值仍在跟随趋势,并且不需要清除。 决定如何处理异常值有时可能非常复杂。 您可以在这里有关离群值的信息 。

双变量分析 (Bi-Variate Analysis)

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the concept of a relationship between two variables, whether there exists an association and the strength of this association, or whether there are differences between two variables and the significance of these differences. There are three types of bivariate analysis.

双变量分析是对两个变量(属性)的同时分析。 它探讨了两个变量之间关系的概念,是否存在关联和这种关联的强度,或者两个变量之间是否存在差异以及这些差异的重要性。 双变量分析有三种类型。

  • Numerical & Numerical

    数值与数值
  • Categorical & Categorical

    分类和分类
  • Numerical & Categorical

    数值和分类

The very first set of variables we will analyze here are SalePrice and GrLivArea. Both variables are Numerical so using a Scatter Plot is a good idea!

我们将在此处分析的第一组变量是SalePriceGrLivArea 。 这两个变量都是数值变量,因此使用散点图是个好主意!

ggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "General Living Area vs. Sale Price", y="Price", x="Area")
Image for post

Immediately, we notice that 2 houses don’t follow the linear trend and affect both our results and assumptions. These are our outliers. Since our results in future are prone to be affected negatively by these outliers, we will remove them.

立刻,我们注意到有2所房屋没有遵循线性趋势,并且影响了我们的结果和假设。 这些是我们的异常值。 由于我们未来的结果很容易受到这些异常值的负面影响,因此我们将其删除。

clean_data <- clean_data[!(clean_data$GrLivArea > 4000),]   #remove outliersggplot(clean_data, aes(y=SalePrice, x=GrLivArea)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "General Living Area vs. Sale Price [Outlier Removed]", y="Price", x="Area")
Image for post

The outlier is removed and the x-scale is adjusted. Next set of variables which we will analyze are SalePrice and TotalBsmtSF.

移除异常值并调整x比例。 我们将分析的下一组变量是SalePriceTotalBsmtSF

ggplot(clean_data, aes(y=SalePrice, x=TotalBsmtSF)) +
theme_bw()+
geom_point(aes(color = SalePrice), alpha=1)+
scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07")) +
labs(title = "Total Basement Area vs. Sale Price", y="Price", x="Basement Area")
Image for post

The observations here adhere to our assumptions and don’t need purging. If it ain’t broke, don’t fix it. I did mention that it is important to tread very carefully when working with outliers. You don’t get to remove them every time.

此处的观察结果符合我们的假设,无需清除。 如果没有损坏,请不要修复。 我确实提到过,在处理异常值时,请务必谨慎行事。 您不会每次都删除它们。

是时候深入一点了 (Time to dig a bit deeper)

We based a ton of visualization around ‘SalePrice’ and other important variables, but what If I said that’s not enough? It’s not Because there’s more to dig out of this pit. There are 4 horsemen of Data Analysis which I believe people should remember.

我们围绕“ SalePrice”和其他重要变量进行了大量可视化处理,但是如果我说那还不够怎么办? 不是因为有更多东西需要挖掘。 我相信人们应该记住4位数据分析骑士。

  • Normality: When we talk about normality what we mean is that the data should look like a normal distribution. This is important because a lot of statistic tests depend upon this (for example — t-statistics). First, we would check normality with just a single variable ‘SalePrice’(It’s usually better to start with a single variable). Though one shouldn’t assume that univariate normality would prove the existence of multivariate normality(which is comparatively more sought after), but it helps. Another thing to note is that in larger samples i.e. more than 200 samples, normality is not such an issue. However, A lot of problems can be avoided if we solve normality. That’s one of the reasons we are working with normality.

    正态性 :当谈论正态性时,我们的意思是数据看起来应该像正态分布。 这很重要,因为很多统计检验都依赖于此(例如t统计)。 首先,我们将仅使用单个变量“ SalePrice”(通常最好从单个变量开始)检查正态性。 尽管不应该假设单变量正态性会证明多元正态性的存在(相对较受追捧),但这很有帮助。 要注意的另一件事是,在较大的样本(即200多个样本)中,正态性不是问题。 但是,如果我们解决正态性,可以避免很多问题。 这就是我们进行正常工作的原因之一。

  • Homoscedasticity: Homoscedasticity refers to the ‘assumption that one or more dependent variables exhibit equal levels of variance across the range of predictor variables’. If we want the error term to be the same across all values of the independent variable, then Homoscedasticity is to be checked.

    均方根性 :均方根性是指“ 一个或多个因变量在预测变量范围内表现出相等水平的方差假设 ”。 如果我们希望误差项在自变量的所有值上都相同,则将检查同方差。

  • Linearity: If you want to assess the linearity of your data then I believe scatter plots should be the first choice. Scatter plots can quickly show the linear relationship(if it exists). In the case where patterns are not linear, it would be worthwhile to explore data transformations. However, we need not check for this again since our previous plots have already proved the existence of a linear relationship.

    线性度 :如果您想评估数据的线性度,那么我相信散点图应该是首选。 散点图可以快速显示线性关系(如果存在)。 在模式不是线性的情况下,值得探索数据转换。 但是,由于我们以前的曲线已经证明存在线性关系,因此我们无需再次检查。

  • Absence of correlated errors: When working with errors, if you notice a pattern where one error is correlated to another then there’s a relationship between these variables. For example, In a certain case, one positive error makes a negative error across the board then that would imply a relationship between errors. This phenomenon is more evident with time-sensitive data. If you do find yourself working with such data then try and add a variable that can explain your observations.

    缺少相关错误 :处理错误时,如果您注意到一种模式,其中一个错误与另一个错误相关,则这些变量之间存在关联。 例如,在某些情况下,一个正错误会在整个范围内产生一个负错误,然后暗示错误之间的关系。 对于时间敏感的数据,这种现象更加明显。 如果您发现自己正在使用此类数据,请尝试添加一个可以解释您的观察结果的变量。

我认为我们应该开始做,而不是说 (I think we should start doing rather than saying)

Starting with SalePrice. Do keep an eye on the overall distribution of our variable.

SalePrice开始。 请注意变量的总体分布。

plot3 <- ggplot(clean_data, aes(x=SalePrice)) + 
theme_bw()+
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1.2)+
labs(title = "Sale Price Density", x="Price", y="Density")plot4 <- ggplot(clean_data, aes(sample=SalePrice))+
theme_bw()+
stat_qq(color="#69b3a2")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for SalePrice")grid.arrange(plot3, plot4, ncol=2)
Image for post

SalePrice is not normal! But we have another trick up our sleeves viz. log transformation. Now, one great thing about log transformation is that it can deal with skewed data and make it normal. So now it’s time to apply the log transformation over our variable.

促销价不正常! 但是,我们还有另外一个窍门。 日志转换。 现在,关于日志转换的一大优点是它可以处理偏斜的数据并使之正常。 因此,现在是时候将对数转换应用于我们的变量了。

clean_data$log_price <- log(clean_data$SalePrice)plot5 <- ggplot(clean_data, aes(x=log_price)) + 
theme_bw()+
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1)+
labs(title = "Sale Price Density [Log]", x="Price", y="Density")plot6 <- ggplot(clean_data, aes(sample=log_price))+
theme_bw()+
stat_qq(color="#69b3a2")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for SalePrice [Log]")grid.arrange(plot5, plot6, ncol=2)
Image for post

现在,使用其余的变量重复该过程。 (Now repeat the process with the rest of our variables.)

我们先和GrLivArea一起去 (We go with GrLivArea first)

Image for post

日志转换后 (After Log Transformation)

Image for post

现在用于TotalBsmtSF (Now for TotalBsmtSF)

Image for post

坚持,稍等! 我们这里有一些有趣的东西。 (Hold On! We’ve got something interesting here.)

Looks like TotalBsmtSF has some zeroes. This doesn’t bode well with log transformation. We’ll have to do something about it. To apply a log transformation here, we’ll create a variable that can get the effect of having or not having a basement (binary variable). Then, we’ll do a log transformation to all the non-zero observations, ignoring those with value zero. This way we can transform data, without losing the effect of having or not the basement.

看起来TotalBsmtSF有一些零。 这对日志转换不是一个好兆头。 我们必须对此做些事情。 要在此处应用对数转换,我们将创建一个变量,该变量可以具有或不具有地下室的效果(二进制变量)。 然后,我们将对所有非零观测值进行对数转换,而忽略值为零的观测值。 这样,我们可以转换数据,而不会失去拥有或不拥有地下室的影响。

#The step where I create a new variable to dictate which row to transform and which to ignore
clean_data <- transform(clean_data, cat_bsmt = ifelse(TotalBsmtSF>0, 1, 0))#Now we can do log transformation
clean_data$totalbsmt_log <- log(clean_data$TotalBsmtSF)clean_data<-transform(clean_data,totalbsmt_log = ifelse(cat_bsmt == 1, log(TotalBsmtSF), 0 ))plot13 <- ggplot(clean_data, aes(x=totalbsmt_log)) +
theme_bw()+
geom_density(fill="#ed557e", color="#e9ecef", alpha=0.5)+
geom_density(color="black", alpha=1, adjust = 5, lwd=1)+
labs(title = "Total Basement Area Density [transformed]", x="Area", y="Density")plot14 <- ggplot(clean_data, aes(sample=totalbsmt_log))+
theme_bw()+
stat_qq(color="#ed557e")+
stat_qq_line(color="black",lwd=1, lty=2)+
labs(title = "Probability Plot for TotalBsmtSF [transformed]")grid.arrange(plot13, plot14, ncol=2)
Image for post

We can still see the ignored data points on the chart but hey, I can trust you with this, right?

我们仍然可以在图表上看到被忽略的数据点,但是,我可以相信您,对吗?

均方根性-等待我的拼写正确吗? (Homoscedasticity — Wait is my spelling correct?)

The best way to look for homoscedasticity is to work try and visualize the variables using charts. A scatter plot should do the job. Notice the shape which the data forms when plotted. It could look like an equal dispersion which looks like a cone or it could very well look like a diamond where a large number of data points are spread around the centre.

寻找同质性的最佳方法是尝试使用图表直观显示变量。 散点图可以完成这项工作。 注意绘制时数据形成的形状。 它可能看起来像一个均匀的色散,看起来像一个圆锥形,或者看起来非常像一个菱形,其中大量数据点围绕中心分布。

Starting with ‘SalePrice’ and ‘GrLivArea’…

从“ SalePrice”和“ GrLivArea”开始...

ggplot(clean_data, aes(x=grlive_log, y=log_price)) +
theme_bw()+
geom_point(colour="#e34262", alpha=0.3)+
theme(legend.position='none')+
labs(title = "Homoscedasticity : Living Area vs. Sale Price ", x="Area [Log]", y="Price [Log]")
Image for post

We plotted ‘SalePrice’ and ‘GrLivArea’ before but then why is the plot different? That’s right, because of the log transformation.

我们之前绘制了“ SalePrice”和“ GrLivArea”,但是为什么绘制不同? 是的,因为有日志转换。

If we go back to the previously plotted graphs showing the same variable, it is evident that the data has a conical shape when plotted. But after log transformation, the conic shape is no more. Here we solved the homoscedasticity problem with just one transformation. Pretty powerful eh?

如果我们回到显示相同变量的先前绘制的图,很明显,绘制时数据具有圆锥形状。 但是对数转换后,圆锥形状不再存在。 在这里,我们只用一种变换就解决了同方差问题。 很厉害吗?

Now let’s check ‘SalePrice’ with ‘TotalBsmtSF’.

现在,让我们用“ TotalBsmtSF”检查“ SalePrice”。

ggplot(clean_data, aes(x=totalbsmt_log, y=log_price)) +
theme_bw()+
geom_point(colour="#e34262", alpha=0.3)+
theme(legend.position='none')+
labs(title = " Homoscedasticity : Total Basement Area vs. Sale Price", x="Area [Log]", y="Price [Log]")
Image for post
Please take care of 0 for me :)
请为我照顾0 :)

就是这样,我们已经完成分析的结尾。 现在剩下的就是获取虚拟变量了……其余的你都知道了。 :) (That’s it, we’ve reached the end of our Analysis. Now all that’s left is to get the dummy variables and… you know the rest. :))

This work was possible thanks to Pedro Marcelino. I found his Analysis on this dataset in Python and wanted to re-write it in R. Give him some love!

感谢Pedro Marcelino使得这项工作成为可能。 我在Python中找到了他对此数据集的分析,并想用R重新编写它。给他一些爱!

翻译自: https://medium.com/@unkletam/beginners-guide-exploratory-data-analysis-in-r-47dac64d95fe

探索性数据分析入门

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388910.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

用Javascript代码实现浏览器菜单命令(以下代码在 Windows XP下的浏览器中调试通过

每当我们看到别人网页上的打开、打印、前进、另存为、后退、关闭本窗口、禁用右键等实现浏览器命令的链接&#xff0c;而自己苦于不能实现时&#xff0c;是不是感到很遗憾&#xff1f;是不是也想实现&#xff1f;如果能在网页上能实现浏览器的命令&#xff0c;将是多么有意思的…

mysql程序设计教程_MySQL教程_编程入门教程_牛客网

MySQL 索引MySQL索引的建立对于MySQL的高效运行是很重要的&#xff0c;索引可以大大提高MySQL的检索速度。打个比方&#xff0c;如果合理的设计且使用索引的MySQL是一辆兰博基尼的话&#xff0c;那么没有设计和使用索引的MySQL就是一个人力三轮车。拿汉语字典的目录页(索引)打比…

学习笔记整理之StringBuffer与StringBulider的线程安全与线程不安全

关于线程和线程不安全&#xff1a; 概述 编辑 如果你的代码所在的进程中有多个线程在同时运行&#xff0c;而这些线程可能会同时运行这段代码。如果每次运行结果和单线程运行的结果是一样的&#xff0c;而且其他的变量的值也和预期的是一样的&#xff0c;就是线程安全的。或者说…

python web应用_为您的应用选择最佳的Python Web爬网库

python web应用Living in today’s world, we are surrounded by different data all around us. The ability to collect and use this data in our projects is a must-have skill for every data scientist.生活在当今世界中&#xff0c;我们周围遍布着不同的数据。 在我们的…

NDK-r14b + FFmpeg-release-3.4 linux下编译FFmpeg

下载资源 官网下载完NDK14b 和 FFmpeg 下载之后&#xff0c;更改FFmpeg 目录下configure问价如下&#xff1a; SLIBNAME_WITH_MAJOR$(SLIBPREF)$(FULLNAME)-$(LIBMAJOR)$(SLIBSUF) LIB_INSTALL_EXTRA_CMD$$(RANLIB)"$(LIBDIR)/$(LIBNAME)" SLIB_INSTALL_NAME$(SLI…

C# WebBrowser自动填表与提交

C# WebBrowser自动填表与提交 默认分类 2007-04-18 15:47:17 阅读57 评论0 字号&#xff1a;大中小 订阅 要使我们的WebBrowser具有自动填表、甚至自动提交的功能&#xff0c;并不困难。   假设有一个最简单的登录页面&#xff0c;输入用户名密码&#xff0c;点“登录”…

html中列表导航怎么和图片对齐_HTML实战篇:html仿百度首页

本篇文章主要给大家介绍一下如何使用htmlcss来制作百度首页页面。1)制作页面所用的知识点我们首先来分析一下百度首页的页面效果图百度首页由头部的一个文字导航&#xff0c;中间的一个按钮和一个输入框以及下边的文字简介和导航组成。我们这里主要用到的知识点就是列表标签(ul…

C# 依赖注入那些事儿

原文地址&#xff1a;http://www.cnblogs.com/leoo2sk/archive/2009/06/17/1504693.html 里面有一个例子差了些代码&#xff0c;补全后贴上。 3.1.3 依赖获取 using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml;//定义…

asp.net core Serilog的使用

先贴上关于使用这个日志组件的一些使用方法&#xff0c;等有时间了在吧官方的文档翻译一下吧&#xff0c;现在真是没时间。 Serilog在使用上主要分为两大块&#xff1a; 第一块是主库&#xff0c;包括Serilog以及Serilog.AspNetCore&#xff0c;如果导入后一个的话会自动导入前…

在FAANG面试中破解堆算法

In FAANG company interview, Candidates always come across heap problems. There is one question they do like to ask — Top K. Because these companies have a huge dataset and they can’t always go through all the data. Finding tope data is always a good opti…

android webView的缓存机制和资源预加载

android 原生使用WebView嵌入H5页面 Hybrid开发 一、性能问题 android webview 里H5加载速度慢网络流量大 1、H5页面加载速度慢 渲染速度慢 js解析效率 js本身的解析过程复杂、解析速度不快&#xff0c;前端页面设计较多的js代码文件 手机硬件设备的性能 机型多&#xff0c;…

mysql springboot 缓存_Spring Boot 整合 Redis 实现缓存操作

摘要: 原创出处 www.bysocket.com 「泥瓦匠BYSocket 」欢迎转载&#xff0c;保留摘要&#xff0c;谢谢&#xff01;『 产品没有价值&#xff0c;开发团队再优秀也无济于事 – 《启示录》 』本文提纲一、缓存的应用场景二、更新缓存的策略三、运行 springboot-mybatis-redis 工程…

http压力测试工具及使用说明

http压力测试工具及使用说明 转 说明&#xff1a;介绍几款简单、易使用http压测工具&#xff0c;便于研发同学&#xff0c;压测服务&#xff0c;明确服务临界值&#xff0c;寻找服务瓶颈点。 压测时候可重点以下指标&#xff0c;关注并发用户数、TPS&#xff08;每秒事务数量&a…

itchat 道歉_人类的“道歉”

itchat 道歉When cookies were the progeny of “magic cookies”, they were seemingly innocuous packets of e-commerce data that stored a user’s partial transaction state on their computer. It wasn’t disclosed that you were playing a beneficial part in a muc…

使用Kubespray部署生产可用的Kubernetes集群(1.11.2)

Kubernetes的安装部署是难中之难&#xff0c;每个版本安装方式都略有区别。笔者一直想找一种支持多平台 、相对简单 、适用于生产环境 的部署方案。经过一段时间的调研&#xff0c;有如下几种解决方案进入笔者视野&#xff1a; 部署方案优点缺点Kubeadm官方出品部署较麻烦、不够…

android webView 与 JS交互方式

webView 与JS交互 Android调用JS代码的方法有&#xff1a; 通过WebView的loadUrl&#xff08;&#xff09;通过WebView的evaluateJavascript&#xff08;&#xff09; 对于JS调用Android代码的方法有3种&#xff1a; 通过WebView的addJavascriptInterface&#xff08;&…

matlab软件imag函数_「复变函数与积分变换」基本计算代码

使用了Matlab代码&#xff0c;化简平时遇到的计算问题&#xff0c;也可以用于验算结果来自211工科专业2学分复变函数与积分变换课程求复角主值sym(angle(待求复数))%公式 sym(angle(1sqrt(3)*i))%举例代入化简将 代入关于z的函数f(z)中并化解&#xff0c;用于公式法计算无穷远点…

数据科学 python_为什么需要以数据科学家的身份学习Python的7大理由

数据科学 pythonAs a new Data Scientist, you know that your path begins with programming languages you need to learn. Among all languages that you can select from Python is the most popular language for all Data Scientists. In this article, I will cover 7 r…

[luoguP4142]洞穴遇险

https://www.zybuluo.com/ysner/note/1240792 题面 戳我 解析 这种用来拼接的奇形怪状的东西&#xff0c;要不就是轮廓线\(DP\)&#xff0c;要不就是网络流。 为了表示奇数点&#xff08;即\((xy)\%21\)&#xff09;的危险值&#xff0c;把该点拆为两个点&#xff0c;连一条边长…

飞信虚拟机

做完了一个图片处理软件,突然想到上次上网看到C#程序脱离.NET FRAMEWORK运行的文章,于是决定自己动手试一下。 之前看到有用别的方法来实现的&#xff0c;但我还是选择了现在比较流行的软件飞信中带的VMDotNet&#xff0c;也就是所谓的.NET FRAMEWORK虚拟机吧。相信有很多人也已…