js值的拷贝和值的引用
介绍 (Introduction)
Welcome to this lesson on calculating p-values.
欢迎参加有关计算p值的课程。
Before we jump into how to calculate a p-value, it’s important to think about what the p-value is really for.
在我们开始计算p值之前,考虑一下p值的真正意义很重要。
假设检验复习 (Hypothesis Testing Refresher)
Without going into too much detail for this post, when establishing a hypothesis test, you will determine a null hypothesis. Your null hypothesis represents the world in which the two variables your assessing don’t have any given relationship. Conversely the alternative hypothesis represents the world where there is a statistically significant relationship such that you’re able to reject the null hypothesis in favor of the alternative hypothesis.
在不进行过多介绍的情况下,建立假设检验时,您将确定原假设。 您的零假设代表了您评估的两个变量没有任何给定关系的世界。 相反,替代假设表示存在统计学上显着关系的世界,这样您就可以拒绝原假设,而支持替代假设。
深潜 (Diving Deeper)
Before we move on from the idea of hypothesis testing… think about what we just said. You effectively need to prove that with little room for error, what we’re seeing in the real world could not be taking place in a world where these variables are not related or in a world where the relationship is independent.
在继续进行假设检验的想法之前,请思考一下我们刚才所说的内容。 您实际上需要证明,在几乎没有错误余地的情况下,在这些变量不相关的世界或在关系独立的世界中,我们在现实世界中看到的东西不可能发生。
Sometimes when learning concepts in statistics, you hear the definition, but take little time to conceptualize. There is often a lot of memorization of rule sets… I find that understanding the intuitive foundation of these principles will serve you far better when finding their practical applications.
有时,当学习统计学中的概念时,您会听到定义,但是花很少的时间来概念化。 规则集通常记忆很多。我发现了解这些原理的直观基础将在您发现其实际应用时为您提供更好的服务。
Continuing on this vein of thought. If you want to compare your real world stat with the fake world, that’s exactly what you should do.
继续这种思想脉络。 如果您想将真实世界的统计数据与假世界进行比较,那正是您应该做的。
As you’d guess we can calculate our observed statistic by creating a linear regression model where we explain our response variable as a function of our explanatory variable. Once we’ve done this we can quantify the relationship between these two variables using the slope or coefficient identified through our ols regression.
如您所料,我们可以通过创建线性回归模型来计算观察到的统计数据,在该模型中,我们将响应变量解释为解释变量的函数。 完成此操作后,我们可以使用通过ols回归确定的斜率或系数来量化这两个变量之间的关系。
But now we need to come up with a this idea of the null world… or the world where these variables are independent. This is something we don’t have, so we’ll need to simulate it. For our convenience, we’re going to leverage the infer package.
但是,现在我们需要提出一个关于零世界 ……或这些变量是独立的世界的想法。 这是我们所没有的,因此我们需要对其进行仿真。 为了方便起见,我们将利用推断包。
让我们计算观察到的统计数据 (Let’s Calculate our Observed Statistic)
First things first, let’s get our observed statistic!
首先,让我们获取观察到的统计信息!
The dataset we’re working with is a Seattle home prices dataset. I’ve used this dataset many times before and find it particularly flexible for demonstration. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.
我们正在使用的数据集是西雅图房屋价格数据集。 我以前曾多次使用过该数据集,并发现它对于演示特别灵活。 数据集的记录级别是按房屋和详细信息,价格,平方英尺,床位数,浴室数量等等。
Through the course of this post, we’ll be trying to explain price through a function of square footage.
在本文的整个过程中,我们将尝试通过平方英尺的功能来解释价格。
Let’s create our regression model
让我们创建回归模型
fit <- lm(price_log ~ sqft_living_log,
data = housing)
summary(fit)
As you can see in the output above, the statistic we’re after is the Estimate
for our explanatory variable, sqft_living_log
.
如您在上面的输出中看到的,我们需要的统计信息是我们的解释变量sqft_living_log
的Estimate
。
A very clean way to do this is to tidy our results such that rather than a linear model, we get a tibble. Tibbles, tables, or data frames are going to make it a lot easier for us to systematically interact with.
一种非常干净的方法是整理我们的结果,使我们得到的不是线性模型,而是小标题。 标语,表格或数据框将使我们更轻松地进行系统地交互。
We’ll then want to filter down to the sqft_living_log
term and we'll wrap it up by using the pull
function to return the estimate itself. This will return the slope as a number, which will make things easier to compare with our null distribution later on.
然后,我们希望过滤到sqft_living_log
项,并使用pull
函数返回估计值本身来对其进行包装。 这将以数字形式返回斜率,这将使以后更容易与空分布进行比较。
Take a look!
看一看!
lm(price_log ~ sqft_living_log,
data = housing)%>%
tidy()%>%
filter(term == 'sqft_living_log')%>%
pull(estimate)
是时候模拟了! (Time to Simulate!)
To kick things off, you should know there are various types of simulation. The one we’ll be using here is what’s called permutation.
首先,您应该知道有各种类型的模拟。 我们将在这里使用的就是所谓的permutation 。
Permutation is particularly helpful when it comes to showing a world where variables are independent of one another.
当显示一个变量相互独立的世界时,排列特别有用。
While we won’t be going into the specifics of how a permutation sample is created under the hood; it’s worth noting that the sample will be normal and center around 0 for the observed statistic.
虽然我们不会详细介绍如何在后台创建排列样本; 值得注意的是,样本将是正常的,并且在观察到的统计数据的中心大约为0。
In this case, the slope would center around 0 as we’re operating under the premise that there is no relationship between our explanatory and response variables.
在这种情况下,当我们在解释变量和响应变量之间没有关系的前提下进行操作时,斜率将以0为中心。
推断基本原理 (Infer Fundamentals)
A few things for you to know:
您需要了解的几件事:
specify is how we determine the relationship we’re modeling:
price_log~sqft_living_log
指定如何确定我们正在建模的关系:
price_log~sqft_living_log
hypothesize is where we designate
independence
假设是我们指定
independence
generate is how we determine the number of replications of our dataset we want to make. Note that if you did, one replicate and did not
calculate
it would return a sample dataset of the same size as the original dataset.generate是我们确定要复制的数据集的数量的方式。 请注意,如果您这样做了,则一次重复但不进行
calculate
将返回与原始数据集大小相同的样本数据集。- calculate allows you to determine the calculation in question (slope, mean, median, diff in means, etc.) 计算可让您确定相关的计算(斜率,均值,中位数,均值差异等)
library(infer)
set.seed(1) perm <- housing %>%
specify(price_log ~ sqft_living_log) %>%
hypothesize(null = 'independence') %>%
generate(reps = 100, type = 'permute') %>%
calculate('slope')perm
hist(perm$stat)
Same distribution with 1000 reps
分配相同,重复1000次
空采样分布 (Null Sampling Distribution)
Ok we’ve done it! We’ve created what is known as the null sampling distribution. What we’re seeing above is a distribution of 1000 slopes each modeled after 1000 simulations of independent data.
好的,我们完成了! 我们创建了所谓的空采样分布。 上面我们看到的是1000个坡度的分布,每个坡度都是在独立数据进行1000次模拟之后建模的。
This gives us just what we needed. A simulated world against which we can compare reality.
这给了我们我们所需要的。 一个可以与现实进行比较的模拟世界。
Taking the visual we just made, let’s use a density plot and add a vertical line for our observed slope, marked in red.
以我们刚刚制作的视觉效果,让我们使用密度图,并为观察到的斜率添加一条垂直线,用红色标记。
ggplot(perm, aes(stat)) +
geom_density()+
geom_vline(xintercept = obs_slope, color = 'red')
Visually, you can see that this is happening far beyond the occurrences of random chance.
从视觉上,您可以看到这种情况远远超出了随机机会的发生。
As you can guess from visually looking at this the p-value here is going to be 0. As to say, in 0% of the null sampling distribution is greater than or equal to our observed statistic.
从视觉上可以看出,这里的p值将为0。也就是说,在0%的原始抽样分布中,大于或等于我们观察到的统计量。
If in fact we were seeing cases where our permuted data was greater than or equal to our observed statistic, we would know that it was just random.
如果实际上我们看到的是排列的数据大于或等于观察到的统计数据的情况,那么我们将知道它只是随机的。
The reiterate the message here, the purpose of p-value is to give you an idea of how feasible it is that we saw such a slope randomly versus a statistically significant relationship.
在此重申此信息,p值的目的是让您了解我们随机看到这样的斜率与统计上显着的关系是多么可行。
计算P值 (Calculating P-value)
While we know what our p-value will be here, let’s get you set up with the calculation for p-value.
虽然我们知道这里的p值将是多少,但让我们开始设置p值的计算。
To re-prime this idea; p-value is the portion of replicates that were (randomly) greater than or equal to our observed slope.
重新提出这个想法; p值是重复(随机)大于或等于我们观察到的斜率的部分。
You’ll see in our summarise
function that we're checking to see whether our stat or slope is greater than or equal to the observed slope. Each record will be assigned TRUE or FALSE accordingly.. When you wrap that in a mean function, TRUE will represent 1 and FALSE 0, resulting in a proportion of the cases stat was greater than or equal to our observed slope.
您将在summarise
功能中看到,我们正在检查统计数据或斜率是否大于或等于观察到的斜率。 每条记录将被相应地分配为TRUE或FALSE。当您将其包装在平均值函数中时,TRUE将代表1,而FALSE为0,从而导致部分情况stat大于或等于我们观察到的斜率。
perm %>%
summarise(p_val = 2 * round(mean(stat >= obs_slope),2))
For the sake of identifying the case of a weaker relationship in which we would not have sufficient evidence to reject the null hypothesis, let’s look at price explained as a function of the year it was built.
为了确定关系较弱的情况,在这种情况下我们将没有足够的证据来拒绝原假设,让我们看一下价格作为其建立年份的函数。
Using the same calculation as above, this results in a p-value of 12%; which according to a standard confidence level of 95%, is not sufficient evidence to reject the null hypothesis.
使用与上述相同的计算,得出的p值为12%; 根据95%的标准置信度,这不足以拒绝原假设。
关于P值解释的最终说明 (Final Notes on P-value Interpretation)
One final thing I want to highlight just one more time….
最后一件事,我想再强调一次。
The meaning of 12%. We saw that when we randomly generated an independent sample… a whole 12% of the time, our randomly generated slope was as or more extreme…
意思是12%。 我们看到,当我们随机生成一个独立样本时……整整12%的时间里,我们随机生成的斜率等于或大于极限。
You might see such a result as much as 12% just due to random chance
由于随机机会,您可能会看到多达12%的结果
结论 (Conclusion)
That’s it! You’re a master of the calculating & understanding p-value.
而已! 您是计算和理解p值的大师。
In a few short minutes we have learned a lot:
在短短的几分钟内,我们学到了很多:
- hypothesis testing 假设检验
- linear regression refresher 线性回归更新
- sampling explanation 抽样说明
- learning about infer package 了解推断包
- building a sampling distribution 建立抽样分布
- visualizing p-value 可视化p值
- calculating p-value 计算p值
It’s easy to get lost when dissecting statistics concepts like p-value. My hope is that having a strong foundational understanding of the need and corresponding execution allows you to understand and correctly apply this to any variety of problems.
剖析p值之类的统计概念时,很容易迷失方向。 我希望对需求和相应的执行有深刻的基础理解,使您能够理解并正确地将其应用于各种问题。
If this was helpful, feel free to check out my other posts at https://medium.com/@datasciencelessons. Happy Data Science-ing!
如果这有帮助,请随时通过https://medium.com/@datasciencelessons查看我的其他帖子。 快乐数据科学!
翻译自: https://towardsdatascience.com/getting-to-the-bottom-of-p-value-the-intuitive-explanation-calculation-fec46bb15a92
js值的拷贝和值的引用
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391600.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!