查询数据库中有多少个数据表
97%. That’s the percentage of data that sits unused by organizations according to Gartner, making up so-called “dark data”.
97 %。 根据Gartner的说法,这就是组织未使用的数据百分比,即所谓的“ 暗数据 ”。
Data has overtaken oil as the world’s most valuable resource, but nearly all of it is still unused by organizations. Gartner estimates that 87% of organizations have “low BI and analytics maturity”.
数据已取代石油成为世界上最有价值的资源,但几乎所有组织仍未使用它。 Gartner估计,87%的组织具有“较低的BI和分析成熟度”。
A possible explanation for this paradox is that not all data are created equal. The value of data can vary drastically from one organization to another, and within the same organization from one project to another.
对于这种悖论的一种可能解释是,并非所有数据都是相同的。 数据的价值在一个组织与另一个组织之间可能会发生巨大变化,而在同一组织内,一个项目与另一个项目之间的数据价值可能会发生巨大变化。
In order to make the required investment in business intelligence and analytics, organizations should be able to anticipate, to an accurate degree, the business impact of doing so, and such a future investment should be expected to yield a high enough ROI.
为了在商业智能和分析上进行所需的投资,组织应该能够准确地预测这样做的业务影响,并且这种未来的投资应该产生足够高的ROI。
Data Cannot Become the New Oil Without The Machine Learning Equivalent of Exploration Geophysics
没有探索地球物理的机器学习等效性,数据就无法成为新石油
Assuming that data is always valuable and attempting to extract such value by trial-and-error as part of an AI project, be it driven by an AutoML platform, can be very wasteful at best, and have a negative ROI in the worst-case scenario.
假设数据始终是有价值的,并尝试通过试错法将其作为AI项目的一部分来提取,这是由AutoML平台驱动的,充其量是非常浪费的,在最坏的情况下,其ROI为负场景。
Pursuant to the oil analogy, this would be equivalent to assuming that there is always oil in the ground wherever one looks and that the only factor driving the amount of oil extracted from the ground is the extraction technique one uses.
根据该油的比喻,这将等同于假定总是有油在地面的地方一个外观和其驱动的油从地下提取的量的唯一因素是萃取技术一个用途。
Over the years, an entire field of study at the intersection of geophysics and economics, namely exploration geophysics, has been devoted to reducing the business risk in oil production. Exploration geophysics relies on inductive reasoning (as opposed to deductive reasoning) to detect the presence of and to estimate the amount of valuable geological deposits at a given location, without incurring the upfront cost and risk of building an exploitation site.
多年来,地球物理与经济学交叉领域的整个研究领域,即勘探地球物理 ,一直致力于降低石油生产中的商业风险。 勘探地球物理学依靠归纳推理 (与演绎推理相对)来检测给定位置的存在并估算有价值的地质矿床的数量,而不会产生前期成本和建设开发场地的风险。
Similarly, in order to reduce the business risk in investing in AI projects, it is crucial to develop inductive reasoning methods for quantifying the value of data, prior to and independently from doing any predictive modeling, a phase we refer to as pre-learning.
同样,为了降低投资AI项目的业务风险,至关重要的是在进行任何预测建模之前并独立于其进行预测,开发归纳推理方法来量化数据的价值,这一阶段我们称为预学习 。
Whereas the deductive approach for valuing a dataset would consist of first allocating resources to put the dataset to use and then monitoring the business impact for a while, the inductive approach consists of using mathematical reasoning to infer from the dataset of interest the highest performance that can be achieved by any predictive model, inexpensively, and without having to train any predictive model.
评估数据集的演绎方法包括首先分配资源以使用该数据集,然后监视一段时间的业务影响,而归纳法则包括使用数学推理从感兴趣的数据集中推断出可以实现的最高性能。可以通过任何预测模型廉价地实现,而无需训练任何预测模型。
In this article, we summarize the theoretical foundations that enable pre-learning, and we illustrate how to quantify the juice in various datasets using the open-source kxy python package.
在本文中,我们总结了启用预学习的理论基础,并说明了如何使用开源kxy python软件包对各种数据集中的果汁进行量化。
What is juice to start with?
什么是果汁开始?
The juice in data refers to the amount of information in data that is useful for solving a particular problem at hand.
数据中的果汁指的是数据中对解决当前特定问题有用的信息量。
The same way the juice in an orange (or the oil in the ground) exists in its own right, whether or not and however it is extracted, it is important to realize that every dataset can conceptually be thought of as a (possibly empty) part that can be useful for solving the problem at hand, and the useless remainder.
无论是否提取橙汁,橙汁中的汁液(或地面上的油)都以自己的方式存在,同样重要的是要认识到,每个数据集在概念上都可以认为是一个(可能是空的)对解决眼前的问题很有用的部分,以及无用的剩余部分。
In this respect, two points are worth stressing. First, what’s useful (resp. useless) is problem-specific. The part of a dataset that is useless for solving a specific problem can be useful for solving another problem.
在这方面,有两点值得强调。 首先,有用的(无用的)是针对特定问题的。 数据集中对于解决特定问题无用的部分对于解决另一个问题很有用。
Second, what’s useful in a dataset for solving a given problem is not tied to a specific method for solving the problem. The same way the total amount of juice contained in a given orange is the maximum amount of liquid that can be extracted out of the orange, irrespective of how it is squeezed, the total amount of juice in a dataset is the maximum amount of utility that can be extracted out of the dataset to solve a specific problem, no matter what machine learning model is used to solve the problem.
其次,在数据集中用于解决给定问题的有用方法并不局限于解决问题的特定方法。 给定橙汁中包含的果汁总量是可以从橙汁中提取的最大液体量的相同方法,无论如何榨汁,数据集中的果汁总量即是无论使用哪种机器学习模型来解决问题,都可以从数据集中提取来解决特定问题。
To lay out the foundations of pre-learning, we need to formalize what we mean by ‘problem’ and ‘useful’.
为了奠定预学习的基础,我们需要形式化“问题”和“有用”的含义。
The problems we are concerned with are classification and regression problems, without restriction on the type of inputs or outputs. Specifically, we consider predicting a business outcome y using a vector of inputs x. The choice of y is intrinsic to the business problem we are interested in solving, while inputs x represent the dataset we are considering using to solve our problem.
我们关注的问题是分类和回归问题,而不受输入或输出类型的限制。 具体来说,我们考虑预测业务成果y 使用输入向量x 。 y的选择对于我们感兴趣的业务问题是固有的,而输入x表示我们正在考虑用来解决问题的数据集。
As usual, we represent our uncertainty about the values of y and x by modeling them as random variables.¹ Saying that our dataset is useful for solving the problem of interest is equivalent to saying that inputs x are informative about the label/output y.
像往常一样,我们通过将y和x值建模为随机变量来表示我们对y和x值的不确定性。¹说我们的数据集可用于解决感兴趣的问题,等同于说输入x可以提供有关标签/输出y的信息 。
Fortunately, notions of informativeness and association are fully formalized by Information Theory. You’ll find a gentle primer on information theory here. For the purpose of this article, if we denote h(z) the entropy of a random variable — differential for continuous variable and Shannon for categorical variables, it suffices to recall that the canonical measure of how informative x is about y is their mutual information, defined as
幸运的是,信息性和关联性的概念已被信息理论完全形式化。 您可以在这里找到有关信息论的入门文章。 出于本文的目的,如果我们表示h(z)的随机变量的熵-连续变量的微分和分类变量的Shannon,则足以回想起x是y信息量的标准度量是它们的互信息,定义为
Some Key Properties of Mutual Information
互信息的一些关键属性
The mutual information I(y; x) is well defined whether the output is categorical or continuous and whether inputs are continuous, categorical, or a combination of both. For some background reading on why this is the case, check out this primer and references therein.
很好地定义了互信息I(y; x),无论输出是分类的还是连续的,以及输入是连续的,分类还是两者的组合。 有关为什么会出现这种情况的一些背景知识,请查看此入门手册和其中的参考文献。
It is always non-negative and it is equal to 0 if and only if y and x are statistically independent (i.e. there is no relationship whatsoever between y and x).
当且仅当y和x在统计上独立时(即y和x之间没有任何关系),它始终为非负且等于0 。
Additionally, the mutual information is invariant by lossless feature transformations. Indeed, if f and g are two one-to-one maps, then
另外,互信息通过无损特征变换而不变。 确实,如果f和g是两个一对一的映射,则
A more general result, known as the Data Processing Inequality, states that transformations applied to x can only reduce its mutual information with y. Specifically,
更为普遍的结果称为数据处理不等式 ,该结果表明,应用于x的转换只能减少与y的互信息。 特别,
and the equality holds when f is either a one-to-one map, or y and x are statistically independent given f(x)
当f是一对一映射,或者y和x在给定f(x)时在统计上独立时,等式成立
meaning that all the information about y that is contained in x is fully reflected in f(x) or, said differently, the transformation f preserves all the juice despite being lossy.
这意味着x中包含的有关y的所有信息都完全反映在f(x)中,或者换句话说,变换f尽管有损,但保留了所有汁液。
Thus, when mutual information is used as a proxy for quantifying the amount of juice in a dataset, effective feature engineering neither decreases nor increases the amount of juice, which is fairly intuitive. Feature engineering merely turns inputs and/or outputs into a representation that makes it easier to train a specific machine learning model.
因此,当将互信息用作量化数据集中果汁量的代理时,有效的特征工程既不会减少也不会增加果汁量,这非常直观。 特征工程仅将输入和/或输出转换为表示,使训练特定机器学习模型变得更加容易。
From Mutual Information to Maximum Achievable Performance
从相互信息到最大可实现的绩效
Although it reflects the essence of the amount of juice in a dataset, mutual information values, typically expressed in bits or nats, can hardly speak to a business analyst or decision-maker.
尽管它反映了数据集中果汁数量的本质,但是通常以位或小数表示的互信息值很难与业务分析师或决策者交流。
Fortunately, it can be used to calculate the highest performance that can be achieved when using x to predict y, for a variety of performance metrics (R², RMSE, classification accuracy, log-likelihood per sample, etc.), which in turn can be translated into business outcomes. We provide a brief summary below, but you can find out more here.
幸运的是,对于各种性能指标(R²,RMSE,分类精度,每个样本的对数似然性等),它可以用于计算使用x预测y时可以达到的最高性能。转化为业务成果。 我们在下面提供了一个简短的摘要,但是您可以在此处找到更多信息 。
We consider a predictive model M with predictive probability
我们考虑具有预测概率的预测模型M
where f(x) is the model’s prediction for the output associated with inputs x. The model is a classifier when y is categorical and a regression model when y is continuous.
其中f(x)是模型对与输入x关联的输出的预测。 当y为分类时,该模型为分类器;当y为连续时,该模型为回归模型。
最大可达到R² (Maximum Achievable R²)
In an additive regression model
在加性回归模型中
the population version of the R² is defined as
R²的人口版本定义为
The ratio in the formula above represents the fraction of variance of the output that cannot be explained using inputs, under our model.
在我们的模型下,上式中的比率表示输出方差的一部分,使用输入无法解释。
Although variance is as good a measure of uncertainty as it gets for Gaussian distributions, it is a weak measure of uncertainty for other distributions, unlike the entropy.
尽管方差像对高斯分布一样是衡量不确定性的好方法,但与熵不同,对于其他分布而言,它是衡量不确定性的弱项。
Considering that the entropy (in nats) has the same unit/scale as the (natural) logarithm of the standard deviation, we generalize the R² as follows:
考虑到熵 (以纳特为单位 ) 与 标准偏差 的 (自然) 对数 具有相同的单位/标度 ,我们将R²概括如下:
Note that when (y, f(x)) is jointly Gaussian (e.g. Gaussian Process regression, including linear regression with Gaussian additive noise), the information-adjusted R² above is identical to the original R².
注意,当(y,f(x))共同为高斯时(例如,高斯过程回归,包括具有高斯累加噪声的线性回归),上述信息调整后的R²与原始R²相同。
More generally, this information-adjusted R² applies to both regression and classification, with continuous inputs, categorical inputs, or a combination of both.
更一般而言,此信息调整后的R²适用于回归和分类,具有连续输入,分类输入或两者的组合。
A direct application of the data processing inequality gives us the maximum R² any model predicting y with x can achieve:
数据处理不平等的直接应用为我们提供了最大的R 2与X任何模型预测Y可以实现:
It is important to stress that this optimal R² is not just an upper bound; it is achieved by any model whose predictive distribution is the true (data generating) conditional distribution p(y|x).
需要强调的是,这个最佳R²不仅仅是一个上限; 它可以通过任何预测分布为真实(数据生成)条件分布p(y | x)的模型来实现 。
最低可达到的RMSE (Minimum Achievable RMSE)
The population version of the Root Mean Square Error of the model above reads
上面模型的均方根误差的总体版本为
In the same vein, we may define its information-adjusted generalization as
同样,我们可以将其信息调整后的概括定义为
A direct application of the data processing inequality gives us the smallest RMSE any model using x to predict y can achieve:
数据处理不等式的直接应用为我们提供了最小的均方根误差,任何使用x预测y的模型都可以实现:
最大可达到的真实对数似然 (Maximum Achievable True Log-Likelihood)
Similarly, the sample log-likelihood per observation of our model can be defined as
同样,每次观察模型的对数似然可定义为
Its population equivalent, which we refer to as the true log-likelihood per observation, namely
它的总体当量,我们称为每个观察值的真实对数似然 ,即
satisfies the inequality
满足不平等
This inequality stems from the Gibbs and data processing inequalities. See here for more details.
这种不平等源于吉布斯和数据处理不平等。 有关更多详细信息,请参见此处 。
Note that the inequality above holds true for both regression and classification problems and that the upper-bound is achieved— by a model using as its predictive distribution the true conditional p(y|x).
请注意,上述不等式对于回归和分类问题均成立,并且通过使用真实条件p(y | x)作为其预测分布的模型实现了上限。
The term -h(y) represents the best true log-likelihood per observation that one can achieve without any data, and can be regarded as a naive log-likelihood benchmark, while the mutual information term I(y; x) represents the boost that can be attributed to our data.
-h(y)项表示每个观察结果的最佳真实对数似然性,即无需任何数据即可获得的观测值,可以看作是幼稚的对数似然性基准,而互信息项I(y; x)则表示提升可以归因于我们的数据。
最高可达到的分类精度 (Maximum Achievable Classification Accuracy)
In a classification problem where the output y can take up to q distinct values, it is also possible to express the highest classification accuracy that can be achieved by a model using x to predict y.
在输出y可以占用q个不同值的分类问题中,还可能表示出最高的分类精度,这可以通过使用x预测y的模型来实现。
Let us consider the function
让我们考虑一下功能
For a given entropy value h the best accuracy that can be achieved by predicting the outcome of any discrete distribution taking q distinct values, and that has entropy h is given by
对于给定的熵值h ,可以通过预测采用q个不同值的任何离散分布的结果获得的最佳精度,并且具有熵h
where the function
功能在哪里
is the inverse function of
是...的反函数
and is easily evaluated numerically. You can find more details here.
并且很容易用数字评估。 您可以在此处找到更多详细信息。
The figure below provides an illustration of the function above for various q.
下图说明了上述各种q的函数。
More generally, the accuracy that can be achieved by a classification model M using x to predict a categorical output y taking q distinct values satisfies the inequality:
更一般而言,使用x预测带有q个不同值的分类输出y的分类模型M可以达到的精度满足不等式:
The entropy term h(y) reflects the accuracy of the naive strategy consisting of always predicting the most frequent outcome, namely
熵项h(y)反映了天真的策略的准确性,该策略包括始终预测最频繁的结果,即
whereas the mutual information term I(y; x) accounts for the maximum amount of additional insights we can derive from our data.
而互信息项I(y; x)则说明了我们可以从数据中得出的最大附加见解量。
To sum-up, the highest value that can be achieved by virtually any population-based performance metric in a classification and regression problem can be expressed as a function of the mutual information of the true data generating distribution I(y; x) and a measure of the variability of the output (such as its entropy h(y), variance, or standard deviation) when the naive (inputs-less) predictive strategy does not have a null performance.
综上所述,在分类和回归问题中,几乎任何基于人口的绩效指标都可以实现的最高价值可以表示为真实数据生成分布的互信息的函数。 I(y; x)和输出变异性的量度(例如其熵h(y) ,方差或标准偏差) 当幼稚(无输入)预测策略没有无效性能时。
Model-Free Mutual Information Estimation
无模型互信息估计
Finally, we can address the elephant in the room. Clearly it all boils down to estimating the mutual information I(y; x) under the true data generating distribution. However, we do not know the true joint distribution (y, x). If we knew it, we would have access to the best possible predictive model — the model with predictive distribution the true conditional p(y|x)!
最后,我们可以向房间里的大象讲话。 显然,这全都归结为估计真实数据生成分布下的互信息I(y; x) 。 但是,我们不知道真正的联合分布(y,x) 。 如果知道这一点,我们将有可能使用最佳的预测模型-具有预测分布的模型为真实条件p(y | x) !
Fortunately, we do not need to know or learn the true joint distribution (y, x); this is where the inductive reasoning approach we previously mentioned comes into play.
幸运的是,我们不需要知道或学习真实的联合分布(y,x) ; 这就是我们前面提到的归纳推理方法发挥作用的地方。
The inductive approach we adopt consists of measuring a wide enough range of properties of the data that indirectly reveal the structure/patterns therein, and infer the mutual information that is consistent with the observed properties, without making any additional arbitrary assumptions. The more flexible the properties we observe empirically, the more structure we will capture in the data, and the closer our estimation will be to the true mutual information.
我们采用的归纳方法包括测量足够广泛的数据属性,这些属性间接揭示其中的结构/模式,并推断与观察到的属性一致的互信息,而无需进行任何其他任意假设。 我们凭经验观察到的属性越灵活,我们将在数据中捕获的结构越多,我们的估计就越接近真实的互信息。
To do so effectively, we rely on a few tricks.
为了有效地做到这一点,我们依靠一些技巧。
技巧I:在copula统一的双重空间中工作。 (Trick I: Working in the copula-uniform dual space.)
First, we recall that the mutual information between y and x is invariant by one-to-one maps and, in particular, when y and x are ordinal, the mutual information between y and x is equal to the mutual information between their copula-uniform dual representations, defined by applying the probability integral transform to each coordinate.
首先,我们回想一下y和x之间的互信息是一对一映射不变的,尤其是当y和x为序数时, y和x之间的互信息等于它们的copula-之间的互信息。 统一对偶表示形式 ,通过将概率积分变换应用于每个坐标来定义。
Instead of estimating the mutual information between y and x directly, we estimate the mutual information between their copula-uniform dual representations; we refer to this as working in the copula-uniform dual space.
我们不是直接估计y和x之间的互信息,而是估计它们的copula-一致对偶表示之间的互信息。 我们称其为在copula均匀双空间中工作 。
This allows us to completely bypass marginal distributions, and perform inference in a unit/scale/representation-free fashion — in the dual space, all marginal distributions are uniform on [0,1]!
这使我们能够完全绕过边际分布,并以单位/比例/无表示的方式进行推理—在对偶空间中,所有边际分布在[0,1]上都是均匀的!
技巧二:通过成对的斯皮尔曼等级关联揭示模式。 (Trick II: Revealing patterns through pairwise Spearman rank correlations.)
We reveal structures in our data by estimating all pairwise Spearman rank correlations in the primal space, defined for two ordinal scalars x and y as
我们通过估计原始空间中所有成对的Spearman秩相关来揭示数据中的结构,原始空间针对两个有序标量x和y定义为
It measures the propensity for two variables to be monotonically related. Its population version is solely a function of the copula-uniform dual representations u and v of x and y and reads
它测量两个变量单调相关的倾向。 它的总体版本仅是x和y的对数一致对偶表示u和v的函数 ,并且读数为
In other words, using Spearman’s rank correlation, we may work in the dual space (a.ka. trick I) while efficiently estimating properties of interest in the primal space.
换句话说,使用Spearman的秩相关,我们可以在对偶空间(也称为技巧I)中进行工作,同时有效地估计原始空间中感兴趣的属性。
Without this trick, we would need to estimate marginal CDFs and explicitly apply the probability integral transform to inputs to be able to work in the dual space, which would defeat the purpose of the first trick.
如果没有这个技巧,我们将需要估计边际CDF,并明确地将概率积分变换应用于输入,以便能够在对偶空间中工作,这将使第一个技巧的目的无效。
After all, given that the mutual information between two variables does not depend on their marginal distributions, it would be a shame to have to estimate marginal distributions to calculate it.
毕竟,鉴于两个变量之间的相互信息不取决于其边际分布,因此必须估算边际分布以进行计算将是一个可耻的事情。
技巧三:扩展输入空间以捕获非单调模式。 (Trick III: Expanding the inputs space to capture non-monotonic patterns.)
Pairwise Spearman rank correlations fully capture patterns of the type ‘the output decreases/increases with certain inputs’ for regression problems, and ‘we can tell whether a bit of the encoded output is 0 or 1 based whether certain inputs take large or small values’ for classification problems.
成对的Spearman秩相关可完全捕获类型“回归时输出降低/某些输入增加”的模式,并且“我们可以根据某些输入取大还是小来判断编码输出的位是0还是1”用于分类问题。
To capture patterns beyond these types of monotonic associations, we need to resort to another trick. We note that, for any function f that is not injective, we have
要捕获超出这些类型的单调关联的模式,我们需要诉诸另一个技巧。 我们注意到,对于任何非内射函数f ,我们都有
Thus, instead of estimating I(y; x), we may estimate I(y; x, f(x)) for any injective function f.
因此,代替估计I(y; x) ,我们可以估计任何内射函数f的 I(y; x,f(x)) 。
While pairwise Spearman rank correlations between y and x reveal monotonic relationships between y and x, f can be chosen so that pairwise Spearman correlations between y and f(x) capture a range of additional non-monotonic relationships between y and x.
而y和x之间的成对Spearman等级相关性揭示y和x之间的关系单调,f可以被选择为使得y和F(X)之间的成对Spearman相关捕获范围的y和x之间的附加非单调关系。
A good example of f is the function
f的一个很好的例子是函数
where m can be chosen to be the sample mean, median, or mode.
其中m可以选择为样本均值,中位数或众数。
Indeed, if y = x² for some mean zero and skew zero random variable x, then the Spearman rank correlation between y and x, which can be found to be 0 by symmetry, fails to reveal the structure in the data. On the other hand, the Spearman rank correlation between y and f(x) (with m=0), which is 1, better reflects how informative x is about y.
事实上,如果y = X 2为一些均值为零,歪斜零随机变量x,那么y和x之间的Spearman等级相关性,这可以被认为是0由对称性,未能揭示结构中的数据。 另一方面,Spearman等级之间的相关性 ÿ f(x) (其中m = 0 )为1 ,可以更好地反映x的信息量与y的关系 。
This choice of f captures patterns of the type ‘the output tends to decrease/increase as an input departs from a canonical value’. Many more types of patterns can be captured by using the same trick, including periodicity/seasonality, etc.
f的这种选择捕获了以下类型的模式:“当输入偏离规范值时,输出趋于减少/增加”。 通过使用相同的技巧,可以捕获更多类型的模式,包括周期性/季节性等。
技巧四:使用最大熵原理将所有内容放在一起,以避免任意假设。 (Trick IV: Putting everything together using the maximum entropy principle as a way of avoiding arbitrary assumptions.)
To summarize, we define z=(x, f(x)), and we estimate the Spearman rank auto-correlation matrix of the vector (y, z), namely S(y, z).
总而言之,我们定义z =(x,f(x)) ,并估计向量(y,z)的Spearman等级自相关矩阵,即S(y,z)。
We then use as the density of the copula-uniform representation of (y, z) the copula density that, among all copula densities matching the patterns observed through the Spearman rank auto-correlation matrix S(y, z), has the highest entropy — i.e. the most uncertain about every pattern but the patterns observed through S(y, z).
然后,我们使用与在通过Spearman等级自相关矩阵S(y,z)观察到的模式匹配的所有copula密度中具有最高熵的copula密度作为(y,z)的copula统一表示的密度。 —即除了通过S(y,z)观察到的模式以外,每个模式中最不确定的部分。
Assuming (y, z) is d-dimensional, the resulting variational optimization problem reads:
假设(y,z)是d维的,则得出的变分优化问题为:
We then use the learned joint pdf to estimate the needed mutual information as
然后,我们使用学习的联合pdf估计所需的互信息,如下
Making Sense of The Maximum-Entropy Variational Problem
认识最大熵变分问题
In the absence of the Spearman rank correlation constraints, the solution to the maximum-entropy problem above is the pdf of the standard uniform distribution, which corresponds to assuming that y and x are statistically independent, and have 0 mutual information. This makes intuitive sense, as we have no reason to believe x is informative about y until we gather empirical evidence.
在没有Spearman秩相关约束的情况下,上述最大熵问题的解决方案是标准均匀分布的pdf,它对应于假设y和x在统计上独立,并且具有0个互信息。 这是直觉上的意义,因为在收集经验证据之前,我们没有理由相信x对y具有信息意义。
When we observe S(y, z), the new solution to the variational maximum-entropy problem deviates from the uniform distribution just enough to reflect the patterns captured by S(y, z). Hence, our approach should not be expected to overestimate the true mutual information I(y; x).
当我们观察到S(y,z)时 ,新的变分最大熵问题解决方案偏离了均匀分布,恰好足以反映S(y,z)捕获的模式。 因此,不应期望我们的方法高估了真实的互信息I(y; x) 。
Additionally, so long as S(y, z) is expressive enough, which we can control through the choice of the function f, all types of patterns will be reflected in S(y, z), and our estimated mutual information should not be expected to underestimate the true mutual information I(y; x).²
另外,只要S(y,z)具有足够的表现力(可以通过选择函数f进行控制) ,所有类型的模式都将反映在S(y,z)中 ,而我们估计的互信息不应期望会低估真实的互信息I(y; x) .²
Application To An Ongoing Kaggle Competition
申请进行中的Kaggle比赛
We built the KxY platform, and the accompanying open-source python package to help organizations of all sizes cut down the risk and cost of their AI projects by focusing on high ROI projects and experiments. Of particular interest is the implementation of the approach described in this post to quantify the amount of juice in your data.
我们构建了KxY平台以及随附的开源python软件包 ,通过专注于高ROI项目和实验来帮助各种规模的组织降低其AI项目的风险和成本。 特别有趣的是实现本文中描述的方法以量化数据中的果汁量。
The kxy package can be installed from PyPi (pip install kxy) or Github and is also accessible through our pre-configured Docker image on DockerHub — for more details on how to get started, read this. Once installed, the kxy package requires an API key to run. You can request one by filling up our contact form, or by emailing us at demo@kxy.ai.
该KXY包可以从PyPI将被安装(PIP安装KXY)或Github上,也是通过我们的预配置上DockerHub泊坞窗图像访问-关于如何开始的详细信息,请阅读此 。 安装后, kxy软件包需要运行API密钥。 您可以填写我们的联系表或通过电子邮件给我们,电子邮件为demo@kxy.ai。
We use the House Price Advanced Regression Techniques Kaggle competition as an example. The problem consists of predicting house sale prices using a comprehensive list of 79 explanatory variables of which 43 are categorical and 36 are ordinal.
我们以房价高级回归技术 Kaggle竞赛为例。 问题在于使用79个解释变量的综合列表来预测房屋售价,其中43个是分类变量,36个是序数变量。
We find that when the one-hot encoding method is used to represent categorical variables, a near-perfect prediction can be achieved.
我们发现,当使用一热编码方法来表示分类变量时,可以实现近乎完美的预测。
In fact, when we select explanatory variables one at a time in a greedy fashion, always choosing the variable that generates the highest amount of incremental juice among the ones that haven’t yet been selected, we find that with only 17 out of the 79 variables, we can achieve near-perfect prediction — in the R² sense.
实际上,当我们一次贪婪地选择一个解释变量时,总是选择在尚未选择的变量中产生最大增量汁的变量,我们发现在79个变量中只有17个变量,我们可以实现接近完美的预测-在R²的意义上。
The figure below illustrates the result of the full variable selection analysis. Interestingly, the top of the current Kaggle leaderboard has managed to generate an RMSE of 0.00044, which is somewhere between the best that can be done using the top 15 variables, and the best that can be done using the top 16 variables.
下图说明了完整变量选择分析的结果。 有趣的是,当前Kaggle排行榜的顶部已成功生成0.00044的RMSE,介于使用前15个变量可以实现的最佳效果和使用前16个变量可以实现的最佳效果之间。
You’ll find the code used to generate the results above as a Jupyter notebook here.
在此处 ,您可以在Jupyter笔记本中找到用于生成结果的代码。
Footnotes:
脚注:
[¹] It is assumed that observations (inputs, output) can be regarded as independent draws from the same random variable (y, x). In particular, the problem should either not exhibit any temporal dependency or observations (inputs, output) should be regarded as samples from a stationary and ergodic time series, in which case your sample size should be long enough to span multiple of the systems’ memory.
[¹]假设观察值(输入,输出)可以视为来自同一随机变量(y,x)的独立绘制。 尤其是,问题应该不表现出任何时间依赖性,或者观察值(输入,输出)应被视为来自平稳和遍历时间序列的样本,在这种情况下,样本量应足够长以覆盖系统内存的倍数。
[²] The requirements of the previous footnote have very practical implications. The whole inference pipeline relies on the accurate estimation of the Spearman rank correlation matrix S(y, z). When observations can be regarded as i.i.d. samples, reliable estimation of S(y, z) only requires a small sample size and is indicative of the true latent phenomenon. On the other hand, when observations exhibit temporal dependency, if estimating S(y, z) using disjoint subsets of our data yields very different values, then the time series is either not stationary and ergodic or it is stationary and ergodic but we do not have a long enough history to characterize S(y, z); either way, the estimated S(y, z) will not accurately characterize the true latent phenomenon and the analysis should not be applied.
[²]上一个脚注的要求具有非常实际的意义。 整个推理流程依赖于Spearman秩相关矩阵S(y,z)的准确估计。 当观测值可被视为同义样本时,对S(y,z)的可靠估计仅需较小的样本量,即可指示出真正的潜在现象。 另一方面,当观测值表现出时间依赖性时,如果使用我们数据的不相交子集来估计S(y,z)会产生非常不同的值,则时间序列不是固定的和遍历的,而是固定的和遍历的,但我们不是具有足够长的历史来表征S(y,z); 无论哪种方式,估计的S(y,z)都无法准确地表征真实的潜在现象,因此不应进行分析。
翻译自: https://towardsdatascience.com/how-much-juice-is-there-in-your-data-d3e76393ca9d
查询数据库中有多少个数据表
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390868.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!