贝叶斯 朴素贝叶斯_手动执行贝叶斯分析

贝叶斯 朴素贝叶斯

介绍 (Introduction)

Bayesian analysis offers the possibility to get more insights from your data compared to the pure frequentist approach. In this post, I will walk you through a real life example of how a Bayesian analysis can be performed. I will demonstrate what may go wrong when choosing a wrong prior and we will see how we can summarize our results. For you to follow this post, I assume you are familiar with the foundations of Bayesian statistics and with Bayes' theorem.

与纯频率论方法相比,贝叶斯分析提供了从数据中获得更多见解的可能性。 在本文中,我将向您介绍如何执行贝叶斯分析的真实示例。 我将演示选择错误的先验时可能出问题的地方,我们将看到如何总结我们的结果。 为了让您关注这篇文章,我假设您熟悉贝叶斯统计的基础和贝叶斯定理。

情境 (Scenario)

As an example analysis, we will discuss a real life problem from a physics lab. No worries, you don't need any physics knowledge for that. We want to determine the efficiency of a particle detector. A particle detector is a sensor that may produce a measurable signal when certain particles traverse it. The efficiency of the detector we want to evaluate is the chance that the detector actually measures the traversing particle. In order to measure this, we put the detector that we want to evaluate in between two other sensors in a sandwich-like structure. If we measure a signal in the top and bottom sensors we know that a particle should have also traversed the detector in the middle. A picture of the experimental setup is shown below.

作为示例分析,我们将讨论物理实验室中的现实生活中的问题。 不用担心,您不需要任何物理知识。 我们要确定粒子探测器的效率。 粒子检测器是一种传感器,当某些粒子经过时会产生可测量的信号。 我们要评估的检测器效率是检测器实际测量横越粒子的机会。 为了对此进行测量,我们将要评估的检测器放在其他两个传感器之间,呈三明治状。 如果我们在顶部和底部传感器中测量信号,我们知道粒子也应该在中间穿过检测器。 实验设置的图片如下所示。

For the measurement, we count the number of traversing particles N in a certain time (as reported by the top and bottom sensors) as well as the number of signals measured in our detector r. For this example, we assume N=100 and r=98.

为了进行测量,我们计算了一定时间内(由顶部和底部传感器报告的)遍历粒子N的数量,以及在探测器r中测得的信号数量。 对于此示例,我们假设N = 100r = 98

频频结果 (Frequentist Result)

In a frequentist approach, we could use our measured data and arrive at the conclusion that the efficiency of the detector is e = r/N = 98%. This gives us only a point estimate. If we want to answer more complicated questions, for example: "What is the probability that the efficiency of the detector is above 99%", then we need a more complex analysis.

在常用方法中,我们可以使用我们的测量数据得出结论,即探测器的效率为e = r / N = 98% 。 这仅给我们一个点估计。 如果我们想回答更复杂的问题,例如: “检测器的效率高于99%的概率是多少” ,那么我们需要进行更复杂的分析。

贝叶斯分析 (The Bayesian Analysis)

The goal of the Bayesian approach is to derive the full posterior probability distribution of the efficiency of the detector given our data p(e|D). In order to do so, we need Bayes' theorem:

贝叶斯方法的目标是在给定我们的数据p(e | D)的情况下 ,得出探测器效率的全部后验概率分布 为此,我们需要贝叶斯定理:

Bayes' Theorem: p(e|D) = p(D|e)p(e) / p(D)
Bayes' Theorem
贝叶斯定理

We will go over the different terms in the following.

下面我们将讨论不同的术语。

概率模型/可能性: p(D | e) (Probability Model / Likelihood: p(D|e))

As always in a Bayesian analysis, we need to select a model that describes the process we want to analyse, called the likelihood. For our problem, we can interpret the efficiency as the chance to have a success (r) out of a certain number of trails (N). This class of problems, similar to determining the chance of a coin showing head, can be modeled by the binomial distribution:

与贝叶斯分析一样,我们需要选择一个模型来描述我们要分析的过程,即可能性。 对于我们的问题,我们可以将效率解释为从一定数量的线索( N )中获得成功( r )的机会。 此类问题类似于确定硬币出现正面的机会,可以通过二项分布来建模:

Image for post
Binomial Distribution
二项分布

先前:p(e) (Prior: p(e))

Next, we need to define a prior. Here, we start with the most trivial choice, a flat prior. We will discuss the influence of a different prior choice later.

接下来,我们需要定义一个先验。 在这里,我们从最简单的选择开始,即优先选择。 稍后,我们将讨论不同的优先选择的影响。

Image for post

边际可能性:p(D) (Marginal Likelihood: p(D))

The marginal likelihood is the denominator in Bayes' theorem. Luckily it is just a normalization constant and not dependent on the efficiency. We can determine it numerical by finding the constant that normalizes the posterior to 1.

边际可能性是贝叶斯定理中的分母。 幸运的是,这只是一个归一化常数,与效率无关。 我们可以通过找到将后验归一化为1的常数来确定它的数值。

结果 (Results)

Now we can calculate the posterior following Bayes' theorem.

现在我们可以计算遵循贝叶斯定理的后验。

Image for post
Posterior distribution p(e|D) for N=100, r=98 with a flat prior
N = 100,r = 98,后验分布p(e | D)

You can see that the most probable value is e=98% which is the same as the intuitive frequentist result. But we obtained much more information here, as we got the full posterior probability distribution. For example, we can see that the distribution is asymmetric. An efficiency below 97% has a higher probability than an efficiency above 99%. And to both probabilities, we can assign exact numbers. How did we get this extra information? It is because we took advantage of more information, meaning we have assumed that the behaviour of the detector follows a binomial distribution as well as we assumed a flat prior distribution.

您可以看到最可能的值是e = 98% ,这与直观的常客结果相同。 但是,由于获得了完整的后验概率分布,我们在这里获得了更多的信息。 例如,我们可以看到分布是不对称的。 低于97%的效率比高于99%的效率更高的概率。 对于这两种概率,我们可以分配确切的数字。 我们如何获得这些额外信息? 这是因为我们利用了更多的信息,这意味着我们假设检测器的行为遵循二项式分布,并且假设了先验分布平坦。

先验的影响 (Influence of the Prior)

The prior plays an important role in a Bayesian analysis. In the following, we will see what happens if we change it. Let’s say we find a statement in the datasheet of the detector that the efficiency can be assumed to be gaussian distributed around 98% with a standard deviation of s=1%. In an older version of the datasheet, however, we find that the efficiency of the detector should be Gaussian distributed around 92% with the same standard deviation of s=1%. We incorporate this information into the posterior by changing the priors accordingly. The results for both cases can be seen below.

先验在贝叶斯分析中起重要作用。 在下面,我们将看到如果更改它会发生什么。 假设我们在检测器的数据表中找到一条陈述,即效率可以假定为高斯分布,其标准偏差为s = 1%,约为98 。 但是,在数据表的旧版本中,我们发现检测器的效率应为高斯分布,约为92%,且标准偏差为s = 1% 。 我们通过相应地更改先验将这些信息合并到后验中。 这两种情况的结果都可以在下面看到。

Image for post
Posterior and prior probabilities for different priors
不同先验的后验概率和先验概率

Here, the posterior is shown in the top panel and the corresponding priors in the panel below. The black curve shows the previous result with the flat prior. When changing the prior to a gaussian one with mean m=98% (green) the posterior peaks again at 98% and the confidence in our estimates are stronger compared to the case with the flat prior. The prior supports our data. While an efficiency below 95% still had a reasonable probability in the case of the flat prior, it is nearly excluded now. Taking the prior from the old data sheet that peaked at an efficiency of 92% (red), we can see that the posterior differs significantly from the other two. The most probable value is around 93%, completely changing our results. How can this be? The problem is that by choosing a wrong prior the data and the prior are not consistent with each other. This example shows, that choosing a wrong prior may have catastrophic consequences. It is important to always evaluate the consistency between the prior, the probability model and the posterior.

在这里,后部显示在顶部面板中,而相应的先验显示在下方面板中。 黑色曲线显示先前的结果,平坦的先验结果。 当将先验者转换为均值m = 98% (绿色)的高斯验算器时,后验峰再次以98%的峰值出现,并且与持平先验者相比,我们的估计信心更大。 先验支持我们的数据。 而效率低于 在之前持平的情况下,仍有95%的人具有合理的可能性,现在几乎将其排除在外。 从旧数据表中的先验数据以92%(红色)的效率达到峰值我们可以看到,后验数据与其他两个数据表明显不同。 最可能的值约为93%,完全改变了我们的结果。 怎么会这样? 问题在于,通过选择错误的先验,数据和先验数据彼此不一致。 此示例表明,选择错误的先验可能会带来灾难性的后果。 始终评估先验概率模型和后验模型之间的一致性很重要。

合并其他度量 (Incorporating Additional Measurements)

Another use case for a prior is an additional measurement. Imagine your colleague measured the same detector. He measured N1=300 and r1=280. How can we correctly make use of this data? We can use it as a prior for our analysis. The results are shown below.

先验的另一个用例是额外的度量。 想象一下您的同事测量了相同的检测器。 他测得N1 = 300r1 = 280 。 我们如何正确利用这些数据? 我们可以将其用作分析的先验条件。 结果如下所示。

Image for post
Using a previous measurement as a prior
使用先前的测量作为先前的

You can see the posterior distribution of our measurement (black) and the colleague's measurement (blue) both using flat priors. If we use our colleague's measurement as a prior for our analysis, we arrive at the green curve. The most probable value of the green curve is in between the other two curves, but more shifted to the blue curve as our colleague's measurement has more data. Also, the distribution for the green curve is slightly narrower compared to the other two. Side note: The resulting posterior is again a binomial distribution. Moreover, we will arrive at the same posterior as if we would redo the analysis and assume only one measurement with N=N1+N2=400 and r=r1+r2=378. As you would expect it, the results are also independent of the order the two measurements were performed. This can be easily verified analytically.

您可以使用平坦先验值来查看我们的度量的后验分布(黑色)和同事的度量(蓝色)。 如果我们将同事的测量结果作为分析的先验条件,则会得出绿色曲线。 绿色曲线的最可能值在其他两条曲线之间,但是随着我们同事的测量结果具有更多数据,更多地转移到了蓝色曲线。 此外,绿色曲线的分布比其他两条曲线略窄。 旁注 :产生的后验再次是二项分布。 此外,我们将得出相同的后验,就好像我们要重做分析并假设只有一个测量值N = N1 + N2 = 400r = r1 + r2 = 378一样 。 如您所料,结果也与两次测量的执行顺序无关。 可以很容易地进行分析验证。

如何呈现结果 (How to present your results)

After calculating the posterior, we now want to present our results. Ideally, you want to show the full posterior distribution, as this reflects the full information. However, this is not always possible and you may want to summarize it with a set of values. Often you want to give a point estimate along with an interval that summarizes the width of the distribution. There are different ways how to do this. Popular choices include:

在计算后验后,我们现在要展示我们的结果。 理想情况下,您希望显示完整的后验分布,因为这反映了完整的信息。 但是,这并非总是可能的,您可能需要用一组值对其进行总结。 通常,您需要给出一个点估计值以及一个总结分布宽度的间隔。 有不同的方法来执行此操作。 受欢迎的选择包括:

  • Expectation value & standard deviation

    期望值和标准偏差
  • Median & central interval

    中位和中心间隔
  • Mode & smallest interval

    模式和最小间隔

Additionally, we need to select how much probability should be included in the intervals (often used: 68% or 90%).

此外,我们需要选择在间隔中应包含多少概率(通常使用:68%或90%)。

For a normal distribution, all three choices of point estimate and confidence interval give identical results. However, in our case of a skewed distribution this is not the case.

对于正态分布,点估计和置信区间的所有三个选择都给出相同的结果。 但是,在我们的分布偏斜的情况下,情况并非如此。

Image for post
Different combinations of point estimates and corresponding intervals in order to summarize a posterior
点估计和相应间隔的不同组合,以便总结后验

You can see that all three choices lead to different results. None of these is wrong or correct, it is just important to report exactly what point estimates you used and how you constructed your intervals. Here we could say for example that the most probable value (mode) of our posterior is 0.98 with a confidence interval of 0.962-0.991 (smallest interval including 68% of the probability density).

您会看到所有三个选择导致不同的结果。 这些都不是错误或正确的,重要的是准确报告您使用的点估计以及间隔的构造方式。 在这里我们可以说,例如,我们后验的最可能值(众数)为0.98,置信区间为0.962-0.991(最小区间,包括68%的概率密度)。

结论 (Conclusions)

We performed a full Bayesian analysis starting by setting up a probability model, choosing appropriate priors all the way to summarizing the posterior with a point estimate and a corresponding interval. The advantage of the Bayesian approach is that we gain access to the full posterior probability distribution. This enabled us to elegantly incorporate prior knowledge, as for example the manufacturer's information, or a previous measurement. Furthermore, we saw that the choice of a wrong prior may have a significant influence on our results, highlighting that a careful choice of the prior and an evaluation of its consistency with the probability model and the posterior is of high importance in any Bayesian analysis.

我们从建立概率模型开始,进行了完整的贝叶斯分析,从一开始就选择适当的先验以总结出后验点,并给出点估计和相应的间隔。 贝叶斯方法的优点是我们可以访问全部后验概率分布。 这使我们能够优雅地结合先前的知识,例如制造商的信息或先前的测量。 此外,我们发现选择错误的先验可能会对我们的结果产生重大影响,强调在任何贝叶斯分析中,谨慎选择先验以及评估其与概率模型和后验的一致性都非常重要。

A python notebook producing the numbers and figures can be found here.

可以在此处找到生成数字和数字的python笔记本。

翻译自: https://towardsdatascience.com/performing-a-bayesian-analysis-by-hand-c589ab992916

贝叶斯 朴素贝叶斯

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388708.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

西工大java实验报告给,西工大数字集成电路实验 实验课6 加法器的设计

西工大数字集成电路实验练习六 加法器的设计一、使用与非门(NAND)、或非门(NOR)、非门(INV)等布尔逻辑器件实现下面的设计。1、仿照下图的全加器,实现一个N位的减法器。要求仿照图1画出N位减法器的结构。ABABABAB0123图1 四位逐位进位加法器的结构2、根据自己构造的…

DS二叉树--二叉树之数组存储

二叉树可以采用数组的方法进行存储,把数组中的数据依次自上而下,自左至右存储到二叉树结点中,一般二叉树与完全二叉树对比,比完全二叉树缺少的结点就在数组中用0来表示。,如下图所示 从上图可以看出,右边的是一颗普通的…

VS IIS Express 支持局域网访问

使用Visual Studio开发Web网页的时候有这样的情况:想要在调试模式下让局域网的其他设备进行访问,以便进行测试。虽然可以部署到服务器中,但是却无法进行调试,就算是注入进程进行调试也是无法达到自己的需求;所以只能在…

构建图像金字塔_我们如何通过转移学习构建易于使用的图像分割工具

构建图像金字塔Authors: Jenny Huang, Ian Hunt-Isaak, William Palmer作者: 黄珍妮 , 伊恩亨特伊萨克 , 威廉帕尔默 GitHub RepoGitHub回购 介绍 (Introduction) Training an image segmentation model on new images can be daunting, es…

PHP mongodb运用,MongoDB在PHP下的应用学习笔记

1、连接mongodb默认端口是:27017,因此我们连接mongodb:$mongodb new Mongo(localhost) 或者指定IP与端口 $mongodb new Mongo(192.168.127.1:27017) 端口可改变若mongodb开启认证,即--auth,则连接为: $mongodb new …

SpringBoot项目打war包部署Tomcat教程

一、简介 正常来说SpringBoot项目就直接用jar包来启动&#xff0c;使用它内部的tomcat实现微服务&#xff0c;但有些时候可能有部署到外部tomcat的需求&#xff0c;本教程就讲解一下如何操作 二、修改pom.xml 将要部署的module的pom.xml文件<packaging>节点设置为war <…

关于如何使用xposed来hook微信软件

安卓端 难点有两个 收款码的生成和到帐监听需要源码加 2442982910转载于:https://www.cnblogs.com/ganchuanpu/p/10220705.html

GitHub动作简介

GitHub Actions can be a little confusing if you’re new to DevOps and the CI/CD world, so in this article, we’re going to explore some features and see what we can do using the tool.如果您是DevOps和CI / CD领域的新手&#xff0c;那么GitHub Actions可能会使您…

java returnaddress,JVM之数据类型

《Java虚拟机规范》阅读笔记-数据类型1.概述Java虚拟机的数据类型可分为两大类&#xff1a;原始类型(Primitive Types&#xff0c;也称为基本类型)和引用类型(Reference Types)。Java虚拟机用不同的字节码指令来操作不同的数据类型[1] 。2.原始类型原始类型是最基本的元素&…

C# matlab

编译环境&#xff1a;Microsoft Visual Studio 2008版本 9.0.21022.8 RTMMicrosoft .NET Framework版本 3.5已安装的版本: ProfessionalMicrosoft Visual Basic 2008 91986-031-5000002-60050Microsoft Visual Basic 2008Microsoft Visual C# 2008 91986-031-5000002-60050…

基于容器制作镜像

一。镜像基础 一。基于容器制作镜像 1. 查看并关联运行的容器 [ghlocalhost ~]$ docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4da438fc9a8e busybox …

认识数据分析_认识您的最佳探索数据分析新朋友

认识数据分析Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indi…

架构探险笔记10-框架优化之文件上传

确定文件上传使用场景 通常情况下&#xff0c;我们可以通过一个form&#xff08;表单&#xff09;来上传文件&#xff0c;就以下面的“创建客户”为例来说明&#xff08;对应的文件名是customer_create.jsp&#xff09;&#xff0c;需要提供一个form&#xff0c;并将其enctype属…

Windows Server 2003 DNS服务安装篇

导读-- DNS(Domain Name System&#xff0c;域名系统)是一种组织成层次结构的分布式数据库&#xff0c;里面包含有从DNS域名到各种数据类型(如IP地址)的映射“贵有恒&#xff0c;何必三更起五更勤;最无益&#xff0c;只怕一日曝十日寒。”前一段时间巴哥因为一些生活琐事而中止…

arima模型怎么拟合_7个统计测试,用于验证和帮助拟合ARIMA模型

arima模型怎么拟合什么是ARIMA&#xff1f; (What is ARIMA?) ARIMA models are one of the most classic and most widely used statistical forecasting techniques when dealing with univariate time series. It basically uses the lag values and lagged forecast error…

[WPF]ListView点击列头排序功能实现

[WPF]ListView点击列头排序功能实现 这是一个非常常见的功能&#xff0c;要求也很简单&#xff0c;在Column Header上显示一个小三角表示表示现在是在哪个Header上的正序还是倒序就可以了。微软的MSDN也已经提供了实现方式。微软的方法中&#xff0c;是通过ColumnHeader Templ…

天池幸福感的数据处理_了解幸福感与数据(第1部分)

天池幸福感的数据处理In these exceptional times, the lockdown left many of us with a lot of time to think. Think about the past and the future. Think about our way of life and our achievements. But most importantly, think about what has been and would be ou…

红草绿叶

从小到大喜欢阴天&#xff0c;喜欢下雨&#xff0c;喜欢那种潮湿的感觉。却又丝毫容不得脚上有一丝的水汽&#xff0c;也极其讨厌穿凉鞋。小时候特别喜欢去山上玩&#xff0c;偷桃子柿子&#xff0c;一切一切都成了美好的回忆&#xff0c;长大了&#xff0c;那些事情就都不复存…

詹森不等式_注意詹森差距

詹森不等式背景 (Background) In Kaggle’s M5 Forecasting — Accuracy competition, the square root transformation ruined many of my team’s forecasts and led to a selective patching effort in the eleventh hour. Although it turned out well, we were reminded t…

数据分析师 需求分析师_是什么让分析师出色?

数据分析师 需求分析师重点 (Top highlight)Before we dissect the nature of analytical excellence, let’s start with a quick summary of three common misconceptions about analytics from Part 1:在剖析卓越分析的本质之前&#xff0c;让我们从第1部分中对分析的三种常…