数据科学和统计学_数据科学中的统计

数据科学和统计学

统计 (Statistics)

Statistics are utilized to process complex issues in reality with the goal that Data Scientists and Analysts can search for important patterns and changes in Data. In straightforward words, Statistics can be utilized to get significant experiences from information by performing scientific calculations on it. A few Statistical capacities, standards and calculations are executed to break down crude information, fabricate a Statistical Model and construe or foresee the outcome. The motivation behind this is to give an extensive review of the fundamentals of statistics that you’ll need to start your data science journey.

统计数据用于处理现实中的复杂问题,其目标是数据科学家和分析师可以搜索数据的重要模式和变化。 简而言之,可以通过对统计信息进行科学计算,利用统计信息来获得重要的经验。 执行一些统计能力,标准和计算以分解原始信息,构建统计模型并解释或预见结果。 其背后的动机是对开始进行数据科学之旅所需的统计基础知识进行广泛的回顾。

资料类型 (Data Types)

  1. Numerical:

    数值

    Data communicated with digits; is quantifiable. It can either be discrete (limited number of qualities) or consistent (interminable number of qualities).

    用数字传达的数据; 是可量化的。 它可以是离散的(有限数量的质量)或一致的(无限数量的质量)。

  2. Downright:

    完全

    Qualitative data grouped into classes. It tends to be ostensible (no structure) or ordinal (requested data).

    定性数据分为几类。 它倾向于表面上的(无结构)或顺序的(请求的数据)。

集中趋势测度 (Measures of Central Tendency)

  • Mean: The normal of a dataset.

    平均值 :数据集的法线。

  • Medium: The center of an arranged dataset; less defenseless to anomalies.

    :排列的数据集的中心; 对异常情况缺乏防御力。

  • Mode: The most widely recognized incentive in a dataset; just significant for discrete information.

    模式 :数据集中最广泛认可的激励; 对于离散信息而言意义重大。

statistics (1)

变异量度 (Measures of Variability)

  • Range: The distinction between the most elevated and least incentive in a dataset.

    范围 :数据集中最高激励和最低激励之间的区别。

  • Variance (σ2): Apportions on how to spread a lot of data is comparative with the mean.

    方差(σ2) :关于如何分散大量数据的方式与均值比较。

  • Standard Deviation (σ): Another estimation of how to spread out numbers are in data collection; it is the square foundation of variance

    标准偏差(σ) :关于如何分散数字的另一种估计是在数据收集中。 它是方差的平方根

  • Z-score: Decides the number of the standard deviations data point is from the mean.

    Z分数 :确定标准差数据点与平均值的数量。

  • R-Squared: A factual proportion of fit that demonstrates how much variety of a reliant variable is clarified by the free variable(s); just helpful for straightforward direct relapse.

    R平方 :拟合的实际比例,它表明自由变量阐明了多少依赖变量; 有助于直接复发。

  • Balanced R-squared: A changed variant of r-squared that has been balanced for the number of indicators in the model; it increments if the new term improves the model more than would be normal by some coincidence and the other way around.

    平衡的R平方R平方的已更改变体,已经针对模型中的指标数量进行了平衡; 如果新术语对模型的改进程度比正常情况好一些(反之亦然),则它会增加。

变量之间关系的度量 (Measurement of Relationships between Variables)

  • Covariance: Measures the fluctuation between (at least two) factors. On the off chance that it's sure, at that point they will move in a similar way, in the event that it's negative, at that point they will in general move in inverse bearings, and on the off chance that they're zero, they have no connection to one another.

    协方差 :衡量(至少两个)因素之间的波动。 可以肯定的是,到那时它们将以类似的方式运动,如果它为负,则通常它们将反向移动,而当它们为零时,它们将以相反的方向运动。没有任何联系。

  • Correlation: Measures the quality of a connection between two factors and ranges from - 1 to 1; the standardized adaptation of covariance. By and large, a connection of +/ - 0.7 speaks to a solid connection between two factors. On the other side, connections between - 0.3 and 0.3 show that there is almost no connection between factors.

    相关 :测量两个因素之间的连接质量,范围为-1到1; 协方差的标准化适应。 总的来说,+ /-0.7的连接表示两个因素之间的牢固连接。 另一方面,-0.3和0.3之间的联系表明因素之间几乎没有联系。

概率分布函数 (Probability Distribution Functions)

  • Probability Density Function (PDF): A capacity for ceaseless data where the incentive anytime can be deciphered as giving a relative probability that the estimation of the irregular variable would rise to that example.

    概率密度函数(PDF) :一种不间断数据的能力,在这种能力下,可以随时将激励解释为给出不规则变量的估计将上升到该示例的相对概率。

  • Probability Mass Function (PMF): A capacity for discrete information that gives the likelihood of a given worth happening.

    概率质量函数(PMF) :离散信息的能力,给出给定价值发生的可能性。

  • Cumulative Density Function (CDF): A capacity that reveals to us the probability that an irregular variable is not exactly a specific worth; the basis of the PDF.

    累积密度函数(CDF) :一种能力,向我们揭示不规则变量不完全是特定价值的可能性; PDF的基础。

连续数据分配 (Continuous Data Distributions)

  • Uniform Distribution: Probability dissemination where all results are similarly likely.

    均匀分布 :概率分布 ,所有结果都有可能相似。

  • Normal/Gaussian Distribution: Regularly alluded to as the bell curve and is identified with central limit theorem; has a mean of 0 and a standard deviation of 1.

    正态/高斯分布 :通常被称为钟形曲线,并通过中心极限定理进行标识; 平均值为0,标准偏差为1。

statistics (2)

T-Distribution: Probability dissemination used to evaluate populace parameters when the example size is little and/r when the populace change is obscure.

T分布 :当样本量较小时和/或在人口变化不明显时,用于评估人口参数的概率分布

Chi-Square Distribution: Dissemination of the chi-square measurement.

卡方分布 :传播卡方测量。

离散数据分布 (Discrete Data Distributions)

  • Poisson Distribution: Probability dissemination that communicates the likelihood of a given number of occasions happening inside a fixed timeframe.

    泊松分布 :概率分布 ,用于传达在固定时间范围内发生给定次数的情况的可能性。

  • Binomial Distribution: Probability dissemination of the number of achievements in a succession of n autonomous encounters each with its Boolean-esteemed result (p, 1-p).

    二项式分布 :概率分布 n次连续的自动遭遇中每个成就的数量,每个自主遭遇都有布尔值估计的结果(p,1-p)。

片刻 (Moments)

Moments portray various parts of nature and state of circulation. The principal moment is the mean, the subsequent moment is the fluctuation, the third moment is the skewness, and the fourth moment is the kurtosis.

时刻刻画了自然的各个部分和循环状态。 主力矩是均值,随后力矩是波动,第三力矩是偏度,第四力矩是峰度。

可能性 (Probability)

Conditional Probability [P(A|B)] is the probability of an occasion happening, in light of the event of a past occasion.

条件概率[P(A | B)]是根据过去的事件发生的情况的概率。

Independent Event whose result doesn't impact the likelihood of the result of another occasion; P(A|B) = P(A).

独立事件,其结果不会影响其他情况下结果的可能性; P(A | B)= P(A)。

Mutually Exclusive events are events that can't happen at the same time; P(A|B) = 0.

互斥事件是不能同时发生的事件。 P(A | B)= 0。

Bayes' Theorem: A scientific recipe for deciding restrictive likelihood. "The probability of A given B is equal to the probability of B given A times the probability of A over the probability of B".

贝叶斯定理 :决定限制性可能性的科学方法。 “ A给定B的概率等于B给定A的概率乘以A的概率对B的概率”。

statistics (3)

准确性 (Accuracy)

  • True positive: Identifies the condition when the condition is available.

    真实肯定 :在条件可用时标识条件。

  • True negative: doesn't distinguish the condition when the condition is absent.

    真否定 :不存在条件时不区分条件。

  • False-positive: distinguishes the condition when the condition is missing.

    假阳性 :缺少条件时区分条件。

  • False-negative: doesn't distinguish the condition when the condition is available.

    假阴性 :在条件可用时不区分条件。

  • Sensitivity: otherwise called recall; quantifies the capacity of a test to distinguish the condition when the condition is available; sensitivity = TP/(TP+FN)

    敏感性 :否则称为召回; 在条件可用时量化测试区分条件的能力; 灵敏度= TP /(TP + FN)

  • Specificity: quantifies the capacity of a test to accurately reject the condition when the condition is missing; Specificity = TN/(TN+FP)

    特异性 :量化测试在条件缺失时准确拒绝条件的能力; 特异性= TN /(TN + FP)

  • Predictive value positive: otherwise called precision; the extent of positives that compare to the nearness of the condition; PVP = TP/(TP+FP)

    正预测值 :否则称为精度; 与条件的接近程度相比,阳性的程度; PVP = TP /(TP + FP)

  • Predictive value negative: the extent of negatives that compare to the nonattendance of the condition; PVN = TN/(TN+FN)

    预测值负数 :与条件的无人值守相比较的负数范围; PVN = TN /(TN + FN)

statistics (4)

假设检验及其统计意义 (Hypothesis Testing and Statistical Significance)

  • Null Hypothesis: The speculation that example perceptions result absolutely from possibility.

    零假设(Null Hypothesis)假设感知完全是由可能性引起的。

  • Alternative Hypothesis: The theory that example perceptions are affected by some non-irregular reason.

    替代假设 :理论感知受一些非常规原因影响的理论。

  • P-value: the likelihood of acquiring the watched aftereffects of a test, accepting that the invalid speculation is right; a littler p-value implies that there is more grounded proof for the elective theory.

    P值 :接受无效推测是正确的,获得测试的观察到的后效应的可能性; 较小的p值表示选修理论有更多扎实的证据。

  • Alpha: The essentialness level; the probability of dismissing the invalid theory when it is valid — otherwise called Type 1 error.

    Alpha :必要性级别; 无效理论成立时被驳回的可能性-否则称为1类错误。

  • Beta: type 2 mistake; neglecting to dismiss the false null hypothesis.

    Beta :类型2错误; 忽略了错误的虚假假设。

假设检验的步骤 (Steps to Hypothesis Testing)

  1. Express the invalid and elective theory

    表达无效选修理论

  2. Decide the test size; is it a couple or two-tailed test?

    确定测试大小; 是几尾还是两尾测试?

  3. Register the test measurement and the likelihood value

    注册测试测量值和似然值

  4. Dissect the outcomes and either dismiss or don't dismiss the invalid speculation

    剖析结果,或者驳斥或不驳斥无效的推测

翻译自: https://www.includehelp.com/data-science/statistics.aspx

数据科学和统计学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/377179.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

《MySQL——如何解决一主多从的读写分离的过期读问题》

目录两种架构两种架构特点强制走主库方案Sleep方案判断主备无延迟方案配合semi-sync等主库位点方案GTID方案两种架构 基于一主多从的读写分离,如何处理主备延迟导致的读写分离问题。 读写分离的主要目标:分摊主库压力。 有两种架构: 1、客…

《MySQL——外部检测与内部统计 判断 主库是否出现问题》

目录select1判断查表判断更新判断外部检测弊端内部统计一主一备的双M架构里,主备切换只需要把客户端流量切换到备库。 在一主多从的架构里,主备切换要把客户端流量切换到备库,也需要把从库接到新主库上。 切换有两种场景:1、主动…

[Json] C#ConvertJson|List转成Json|对象|集合|DataSet|DataTable|DataReader转成Json (转载)...

点击下载 ConvertJson.rar 本类实现了 C#ConvertJson|List转成Json|对象|集合|DataSet|DataTable|DataReader转成Json|等功能大家先预览一下 请看代码 /// <summary> /// 类说明&#xff1a;Assistant /// 编 码 人&#xff1a;苏飞 /// 联系方式&#xff1a;361983679 …

《MySQL——恢复数据-误删行、表、库》

目录误删行事前预防误删行数据方法误删表/库延迟复制备库事前预防误删库/表方法传统的架构不能预防误删数据&#xff0c;因为主库的一个drop table命令&#xff0c;会通过binlog传给所有从库和级联从库&#xff0c;进而导致整个集群的实例都会执行这个命令。 MySQL相关的误删除…

python图例位置_Python | 图例位置

python图例位置Legends are one of the key components of data visualization and plotting. Matplotlib can automatically define a position for a legend in addition to this, it allows us to locate it in our required positions. Following is the list of locations…

工作总结:文件对话框的分类(C++)

原文地址&#xff1a;http://www.jizhuomi.com/software/173.html 文件对话框分为打开文件对话框和保存文件对话框&#xff0c;相信大家在Windows系统中经常见到这两种文件对话框。例如&#xff0c;很多编辑软件像记事本等都有“打开”选项&#xff0c;选择“打开”后会弹出一个…

《MySQL——Innodb改进LRU算法》

Innodb改进LRU.算法&#xff0c;实质上将内存链表分成两段。 靠近头部的young和靠近末尾的old&#xff0c;取5/12段为分界。 新数据在一定时间内只能在old段的头部&#xff0c;当在old段保持了一定的时间后被再次访问才能升级到young。 实质上是分了两段lru&#xff0c;这样做的…

jQuery: 整理4---创建元素和添加元素

1.创建元素&#xff1a;$("内容") const p "<p>这是一个p标签</p>" console.log(p)console.log($(p)) 2. 添加元素 2.1 前追加子元素 1. 指定元素.prepend(内容) -> 在指定元素的内部的最前面追加内容&#xff0c;内容可以是字符串、…

Design a high performance cache for multi-threaded environment

如何设计一个支持高并发的高性能缓存库 不 考虑并发情况下的缓存的设计大家应该都比较清楚&#xff0c;基本上就是用map/hashmap存储键值&#xff0c;然后用双向链表记录一个LRU来用于缓存的清理。这篇文章 应该是讲得很清楚http://timday.bitbucket.org/lru.html。但是考虑到高…

LinkChecker 8.1 发布,网页链接检查

LinkChecker 8.1 可对检查时间和最大的 URL 数量进行配置&#xff1b;当使用 HTTP 请求时发送 do-not-track 头&#xff1b;生成 XML 的 sitemap 用于搜索引擎优化&#xff1b;检测 URL 长度和重复的页面内容&#xff1b;修复了很多检查的 bug。 LinkChecker 是一个网页链接检查…

c语言语言教程0基础_C语言基础

c语言语言教程0基础Hey, Folks here I am back with my second article on C language. Hope you are through with my previous article C language - History, Popularity reasons, Characteristics, Basic structure etc. In this one, I will cover some fundamental conce…

《MySQL——临时表》

内存表与临时表区别 临时表&#xff0c;一般是人手动创建。 内存表&#xff0c;是mysql自动创建和销毁的。 内存表&#xff0c;指的是使用Memory引擎的表&#xff0c;建表语法&#xff1a;create table ... engine memeory 表的数据存在内存里&#xff0c;系统重启后会被清…

drei

模拟9 T3 &#xff08;COGS上也有&#xff0c;链接http://218.28.19.228/cogs/problem/problem.php?pid1428&#xff09; 题目描述 输入a&#xff0c;p&#xff0c;求最小正整数x&#xff0c;使得a^x mod p 1。 分析 神奇的欧拉定理&#xff08;对于gcd&#xff08;a&#xf…

css中变量_CSS中的变量

css中变量CSS | 变数 (CSS | Variables) CSS variables allow you to create reusable values that can be used throughout a CSS document. CSS变量允许您创建可在CSS文档中使用的可重用值。 In CSS variable, function var() allows CSS variables to be accessed. 在CSS变…

SuperSpider——打造功能强大的爬虫利器

SuperSpider——打造功能强大的爬虫利器 博文作者&#xff1a;加菲 发布日期&#xff1a;2013-12-11 阅读次数&#xff1a;4506 博文内容&#xff1a; 1.爬虫的介绍 图1-1 爬虫&#xff08;spider) 网络爬虫(web spider)是一个自动的通过网络抓取互联网上的网页的程序&#xf…

《MySQL——关于grant赋权以及flush privileges》

先上总结图&#xff1a; 对于赋予权限或者收回权限还是创建用户&#xff0c;都会涉及两个操作&#xff1a; 1、磁盘&#xff0c;mysql.user表&#xff0c;用户行所有表示权限的字段的值的修改 2、内存&#xff0c;acl_users找到用户对应的对象&#xff0c;将access值修改 g…

《MySQL 8.0.22执行器源码分析(1)——execute iterator一些记录》

目录一条语句的函数调用栈顺序8.0使用迭代器模式改进executorint *handler*::ha_rnd_next(*uchar* **buf*)int *TableScanIterator*::Read()int FilterIterator :: Read&#xff08;&#xff09;int HashJoinIterator::Read()int NestedLoopIterator :: Read&#xff08;&#…

strcspn函数

函数原型&#xff1a;extern int strcspn(char *str1,char *str2) 参数说明&#xff1a;str1为参照字符串&#xff0c;即str2中每个字符分别与str1中的每个字符比较。 所在库名&#xff1a;#include <string.h> 函数功能&#xff1a;以str1为参照&#xff0c…

MongoDB源码概述——内存管理和存储引擎

数据存储&#xff1a; 之前在介绍Journal的时候有说到为什么MongoDB会先把数据放入内存&#xff0c;而不是直接持久化到数据库存储文件&#xff0c;这与MongoDB对数据库记录文件的存储管理操作有关。MongoDB采用操作系统底层提供的内存文件映射&#xff08;MMap&#xff09;的方…

SharePoint 2010 Form Authentication (SQL) based on existing database

博客地址 http://blog.csdn.net/foxdaveSharePoint 2010 表单认证&#xff0c;基于现有数据库的用户信息表本文主要描述本人配置过程中涉及到的步骤&#xff0c;仅作为参考&#xff0c;不要仅限于此步骤。另外本文通俗易懂&#xff0c;适合大众口味儿。I. 开启并配置基于声明的…