hotelling变换_基于Hotelling-T²的偏最小二乘(PLS)中的变量选择

hotelling变换

背景 (Background)

One of the most common challenges encountered in the modeling of spectroscopic data is to select a subset of variables (i.e. wavelengths) out of a large number of variables associated with the response variable. It is common for spectroscopic data to have a large number of variables relative to the number of observations. In such a situation, the selection of a smaller number of variables is crucial especially if we want to speed up the computation time and gain in the model’s stability and interpretability. Typically, variable selection methods are classified into two groups:

在光谱数据的建模中遇到的最常见的挑战Òne为选择的变量的子集(即,波长)了大量的与该响应相关联的变量的变量。 光谱数据相对于观测数量通常具有大量变量。 在这种情况下,选择较小数量的变量至关重要,尤其是在我们希望加快计算时间并提高模型稳定性和可解释性的情况下。 通常,变量选择方法分为两类:

• Filter-based methods: the most relevant variables are selected as a preprocessing step independently of the prediction model.• Wrapper-based methods: use the supervised learning approach.

•基于过滤器的方法:与预测模型无关,选择最相关的变量作为预处理步骤。•基于包装器的方法:使用监督学习方法。

Hence, any PLS-based variable selection is a wrapper method. Wrapper methods need a selection criterion that relies solely on the characteristics of the data at hand.

因此,任何基于PLS的变量选择都是包装器方法。 包装方法需要一个选择标准,该选择标准仅依赖于手头数据的特征。

方法 (Method)

Let us consider a regression problem for which the relation between the response variable y (n × 1) and the predictor matrix X (n × p) is assumed to be explained by the linear model y = β X, where β (p × 1) is the regression coefficients. Our dataset is comprised of n = 466 observations from various plant materials, and y corresponds to the concentration of calcium (Ca) for each plant. The matrix X is our measured LIBS spectra that includes p = 7151 wavelength variables. Our objective is therefore to find some columns subsets of X with satisfactorily predictive power for the Ca content.

让我们考虑假定的量,响应变量Y(N×1)和预测器矩阵X(N×P)之间的关系由线性模型Y =βX,其中β(P×1说明一个回归问题)是回归系数。 我们的数据集由来自各种植物材料的n = 466个观测值组成,并且y对应于每种植物的钙(Ca)浓度。 矩阵X是我们测得的LIBS光谱,其中包括p = 7151个波长变量。 因此,我们的目标是找到一些X的子集,这些子集对于Ca含量具有令人满意的预测能力。

ROBPCA建模 (ROBPCA modeling)

Let’s first perform robust principal components analysis (ROBPCA) to help visualize our data and detect whether there is an unusual structure or pattern. The obtained scores are illustrated by the scatterplot below in which the ellipses represent the 95% and 99% confidence interval from the Hotelling’s T². Most observations are below the 95% confidence level, albeit some observations seem to cluster on the top-right corner of the scores scatterplot.

让我们首先执行健壮的主成分分析( ROBPCA ),以帮助可视化我们的数据并检测是否存在异常的结构或模式。 所获得的分数由下面的散点图说明,其中椭圆表示距Hotelling T 2的95%和99%置信区间。 尽管有些观察似乎聚集在分数散点图的右上角,但大多数观察都低于95%的置信度。

Image for post
ROBPCA scores scatterplot.
ROBPCA对散点图进行评分。

However, when looking more closely, for instance using the outlier map, we can see that ultimately there are only three observations that seem to pose a problem. We have two observations flagged as orthogonal outliers and only one as a bad leverage point. Some observations are flagged as good leverage points, whilst most are regular observations.

但是,当更仔细地观察时(例如,使用离群值地图),我们可以看到最终只有三个观测值似乎构成问题。 我们有两个观测值标记为正交离群值,只有一个观测值标记为不良杠杆点。 一些观察值被标记为良好的杠杆点,而大多数是常规观察值。

Image for post
ROBPCA outlier map.
ROBPCA异常值地图。

PLS建模 (PLS modeling)

It is worth mentioning that in our regression problem, ordinary least square (OLS) fitting is no option since np. PLS resolves this by searching for a small set of the so-called latent variables (LVs), that performs a simultaneous decomposition of X and y with the constraint that these components explain as much as possible of the covariance between X and y. The figures below are the results obtained from the PLS model. We obtained an R² of 0.85 with an RMSE and MAE of 0.08 and 0.06, respectively, which correspond to a mean absolute percentage error (MAPE) of approximately 7%.

值得一提的是,在我们的回归问题,普通最小二乘法(OLS)拟合是由于N“P别无选择。 PLS通过搜索一小组所谓的潜在变量(LVs)来解决此问题,该变量在约束Xy尽可能解释Xy之间的协方差的约束下执行Xy的同时分解。 下图是从PLS模型获得的结果。 我们获得的R²为0.85,RMSE和MAE分别为0.08和0.06,这对应于大约7%的平均绝对百分比误差(MAPE)。

Image for post
Observed vs. predicted plot (full dataset).
观测图与预测图(完整数据集)。
Image for post
Residual plot (full dataset).
剩余图(完整数据集)。

Similarly to the ROBPCA outlier map, the PLS residual plot has flagged three observations that exhibit high standardized residual value. Another way to check for outliers is to calculate Q-residuals and Hotelling’s T² from the PLS model, then define a criterion for which an observation is considered as an outlier or not. High Q-residual value corresponds to an observation which is not well explained by the model, while high Hotelling’s T² value expresses an observation that is far from the center of regular observations (i.e, score = 0). The results are plotted below.

与ROBPCA离群图相似,PLS残差图标记了三个观测值,这些观测值表现出较高的标准残差值。 检查异常值的另一种方法是从PLS模型计算Q残差和Hotelling的T²,然后定义一个标准,对于该标准,观察值是否视为异常值。 高Q残差值对应于模型无法很好解释的观测值,而高Hotelling的T²值表示远离常规观测值中心的观测值(即,得分= 0)。 结果绘制在下面。

Image for post
Q residuals vs. Hotelling’s T² plot (full dataset).
Q残差与Hotelling的T²图(完整数据集)。

基于Hotelling-T²的变量选择 (Hotelling-T² based variable selection)

Let’s now perform variable selection from our PLS model, which is carried out by computing the T² statistic (for more details see Mehmood, 2016),

现在,让我们从我们的PLS模型中执行变量选择,该模型是通过计算T²统计信息来实现的(有关更多详细信息,请参阅Mehmood,2016 ),

Image for post

where W is the loading weight matrix and C is the covariance matrix. Thus, a variable is selected based on the following criteria,

其中W是装载权重矩阵, C是协方差矩阵。 因此,根据以下条件选择变量:

Image for post

where A is number of LVs from our PLS model, and 1-𝛼 is the confidence level (with 𝛼 equals 0.05 or 0.01) from the F-distribution.

其中A是我们的PLS模型中LV的数量,1-𝛼是F分布的置信度(𝛼等于0.05或0.01)。

Thus, from 7151 variables in our original dataset, only 217 were selected based on the aforementioned selection criterion. The observed vs. predicted plot is displayed below along with the model’s R² and RMSE.

因此,从我们原始数据集中的7151个变量中,仅基于上述选择标准选择了217个。 观察到的与预测的图以及模型的R²和RMSE一起显示在下面。

Image for post
Observed vs. predicted plot (selected variables).
观测图与预测图(选定变量)。

In the results below, the three observations that were flagged as outliers were removed from the dataset. The mean absolute percentage error is 6%.

在以下结果中,从数据集中删除了标记为异常值的三个观察值。 平均绝对百分比误差为6%。

Image for post
Observed vs. predicted plot (selected variable, outliers removed).
观测图与预测图(选定变量,离群值已删除)。
Image for post
Residual plot (selected variable, outliers removed).
残差图(选定变量,离群值已删除)。

摘要 (Summary)

In this article, we successfully performed Hotelling-T² based variable selection using partial least squares. We obtained a huge reduction (-97%) in the number of selected variables compared to using the model with the full dataset.

在本文中,我们使用偏最小二乘成功地执行了基于Hotelling-T²的变量选择。 与使用具有完整数据集的模型相比,我们选择的变量数量大大减少了(-97%)。

翻译自: https://towardsdatascience.com/hotelling-t%C2%B2-based-variable-selection-in-partial-least-square-pls-165880272363

hotelling变换

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/242129.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

商业银行为什么大量组织高净值小规模活动?

在管理界有一个非常著名的定律叫做二八定律,所谓28定律就是20%的客户贡献了企业80%的利润。虽然这个定律在银行不一定适用,但同样的道理用于银行营销也是合适的。银行之所以经常组织一些高净值小规模的活动,因为这些客户的资产和价值比较高&a…

在县城投资买一辆出租车,一个月能收入多少钱?

在县城投资出租车能赚多少钱具体要看你是什么县城,比如西部的县城勉强能养活自己,中部的县城一个月能赚个5、6千,东部的小县城月赚个万元以上也有可能。具体回报率怎么样可以先算下投资一个出租车的成本投资一个出租车的构成成本比较多&#…

通过ISO镜像文件安装Ubuntu(可实现默认启动Windows的双系统)

解压文件 使用WinRAR等软件,Ubuntu ISO镜像文件中的casper文件夹解压到硬盘中的任意分区根目录,把ISO镜像也放在那个分区根目录。 使用Grub4dos启动Ubuntu 使用grub4dos启动Ubuntu,menu.lst写法如下。其中root命令指定了硬盘分区编号&#xf…

命名实体识别 实体抽取_您的公司为什么要关心命名实体的识别

命名实体识别 实体抽取Named entity recognition is the task of categorizing text into entities, such as people, locations, and dates. For example, for the sentence, On April 30, 1789, George Washington was inaugurated as the first president of the United Sta…

表达式测试

1111 (parameters) -> { statements; }//求平方 (int a) -> {return a * a;}//打印,无返回值 (int a) -> {System.out.println("a " a);}

有关西电的课程学分相关问题:必修课、选修课、补考、重修、学分

注:最近一年多以来学校的政策改动比较大,听说有选修一旦选了就必须通过,否则视为挂科需要重修的;还有的说是选修课学分够了再多选可能要收费(未经确认,可能只是误传);等各种说法。本…

银行现在都很缺钱吗,为什么给的利息比以前高了?

目前无论是大银行还是小银行,也不论是国有银行还是民营银行,基本上每个银行都上浮利率,如果不上浮利率,那就只能吃土了,当然加息一般主要针对定期存款以及贷款来说,活期存款利率一般是不会上浮,…

机器学习 异常值检测_异常值是否会破坏您的机器学习预测? 寻找最佳解决方案

机器学习 异常值检测内部AI (Inside AI) In the world of data, we all love Gaussian distribution (also known as a normal distribution). In real-life, seldom we have normal distribution data. It is skewed, missing data points or has outliers.在数据世界中&#…

1000万贷款三年,到期一次性偿还1500万,这个利息算不算高?

1000万的贷款三年期到期还1500万,相当于每一年的利息是166.6万,折算下来年化利率是16.6%。至于这个利率是否划算,要看你在什么金融机构贷款以及你个人的资质来看。如果你个人条件比较好,在银行做的抵押贷款,那我认为16…

Golang之变量去哪儿

写过C/C的同学都知道,调用著名的malloc和new函数可以在堆上分配一块内存,这块内存的使用和销毁的责任都在程序员。一不小心,就会发生内存泄露,搞得胆战心惊。切换到Golang后,基本不会担心内存泄露了。虽然也有new函数&…

运营商ip映射_我们如何映射互联网以发现运营商

运营商ip映射Being able to accurately predict which carriers use which IP addresses is important for Wandera’s data cost management solution. Customers with dual-SIM/eSIM devices in their fleet need to be aware at which point in time a device is using whic…

在县城开一家彩票站,一个月能赚多少钱?

现在彩票店多如牛毛,几步就有一个投注站,真能赚大钱的很少,但维持个基本生活应该是不成问题的。 至于接手彩票上是否能赚钱,关键还是要看人流,人流,人流。 想要知道彩票站是否赚钱,你就得先了解…

修改TrustedInstaller权限文件(无法删除文件)

在Win7系统中,存在一个虚拟账户,即TrustedInstaller,有时需要对C盘一些系统文件/文件夹进行修改,或删除,就会弹出“你需要TrustedInstaller提供的权限才能修改此文件”。这时用此法可解除此限制。对于系统中一些无法删…

yolov3算法优点缺点_优点缺点

yolov3算法优点缺点Naive Bayes: A classification algorithm under a supervised learning group based on Probabilistic logic. This is one of the simplest machine learning algorithms of all. Logistic regression is another classification algorithm that models po…

为什么很多企业要跑到美国去上市,而不是在A股上市?

我们都知道目前很多中国优质的企业都选择在香港,美国等境外上市,其中不乏阿里巴巴、腾讯,京东,百度这样的知名企业。比如下图是2017年我国市值排名前20的企业,这些企业当中有19个在境外上市,有的是境外跟境…

逻辑回归画图_逻辑回归

逻辑回归画图申请流程 (Application Flow) Logistic Regression is one of the most fundamental algorithms for classification in the Machine Learning world.Logistic回归是机器学习世界中分类的最基本算法之一。 But before proceeding with the algorithm, let’s firs…

邮储银行的规模有多大?凭什么可以成为第6大国有银行?

邮储银行之所以被划为第6大国有银行,因为他不论是在性质上还是在规模上都对得起第6大国有银行这一称号。首先邮储银行是国有控股的大型商业银行。邮储银行是由原来邮局的储蓄所以及邮电系统的储蓄业务整合而来,在上市之前邮储银行由中国邮政集团100%控股…

工商银行信用卡如何通过刷星提额?

想要刷星级提额,我们就先来了解一下,为什么银行愿意给你提额。不论是对其他银行还是对于工商银行来说,他们愿意给你挑提额无非就两个核心前提,一个是你能给银行创造更多的收益,第2个是你没有任何风险,也就是…

主成分分析具体解释_主成分分析-现在用您自己的术语解释

主成分分析具体解释The caption in the online magazine “WIRED” caught my eye one night a few months ago. When I focused my eyes on it, it read: “Can everything be explained to everyone in terms they can understand? In 5 Levels, an expert scientist explai…

MongoDB介绍

一、MongoDB介绍 1.1 mongoDB介绍 MongoDB 是由C语言编写的,是一个基于分布式文件存储的开源数据库系统。 在高负载的情况下,添加更多的节点,可以保证服务器性能。 MongoDB 旨在为WEB应用提供可扩展的高性能数据存储解决方案。 MongoDB …