异常检测机器学习_使用机器学习检测异常

异常检测机器学习

什么是异常检测? (What is Anomaly Detection?)

The anomaly detection problem has been a problem that has been frequently explored in the field of machine learning and has become a classic problem. Anomalies are any unusual sequence or pattern inside a large corpus of data. These anomalies usually cause unexpected and complex errors or inefficiencies unless resolved. Searching for these anomalies through a corpus might be easy if the corpus was relatively small, but when it scales to an enormous size, that solution becomes unreasonable. For example, trying to find a grammatical mistake in a 200 word paragraph is pretty easy but imagine trying to find all the grammatical errors in a 5000 page encyclopedia. The problem becomes much more difficult for humans. Fortunately, with the help of machine learning, we are able to solve this problem much easier (kind of).

异常检测问题已经成为机器学习领域中经常探讨的问题,并且已经成为经典问题。 异常是大型数据集中的任何异常序列或模式。 除非解决,否则这些异常通常会导致意外的复杂错误或效率低下。 如果语料库相对较小,则通过语料库搜索这些异常可能很容易,但是当它扩展到巨大规模时,该解决方案将变得不合理。 例如,尝试在200个单词的段落中查找语法错误是很容易的,但是可以想象一下,尝试在5000页的百科全书中查找所有语法错误。 这个问题对人类来说变得更加困难。 幸运的是,借助机器学习,我们能够(更轻松)解决此问题。

First of all, what is machine learning? Machine learning is essentially using statistics to model and train how a system (or corpus) normally behaves from a training set (the background data set). Afterwards, we can compare the abnormally behaving system (the target data set) to our model of how a normal system behaves and try to uncover anomalies in the target. Although the main idea sounds pretty easy and intuitive, there are many complexities associated with this process such as finding a background data set that is representative of the whole population, distributing the calculations to different machines for large data sets, etc. Although these problems are all difficult obstacles that software engineers have to tackle before creating a polished machine learning model, I will not be talking about these issues but rather the application of machine learning to find anomalies.

首先,什么是机器学习? 机器学习本质上是使用统计数据来建模和训练系统(或语料库)通常如何根据训练集(背景数据集)表现。 然后,我们可以将行为异常的系统(目标数据集)与正常系统行为的模型进行比较,并尝试发现目标中的异常。 尽管主要想法听起来很容易且直观,但是与此过程相关的复杂性很多,例如找到代表整个人群的背景数据集,将计算分布到大型数据集的不同机器等。尽管这些问题是在创建完善的机器学习模型之前软件工程师必须解决的所有困难障碍,我不会在谈论这些问题,而是在机器学习中应用以发现异常。

异常检测问题的类型 (Types of Anomaly Detection Problems)

已知数据语料库中的结构异常 (Structured Anomalies in a Known Corpus of Data)

There are four main types of anomaly detection problems. The first (and also easiest) type is detecting structured anomalies in a known corpus. These are problems where you know what the structure of the anomalies will be and you know the format of the corpus. As a simplified analogy, the problem of detecting numbers that decrease from the number prior to it where the corpus is a string of strictly increasing numbers would fall under this type. In this example, we know the pattern of the normal behavior (strictly increasing numbers) and we are detecting for a known anomaly (a decrease between adjacent numbers). This problem is relatively easy as we can clearly measure and know for sure when something is an anomaly as we have a clear structure we are comparing it to. In this case, it is relatively easy to have a high performance machine learning algorithm and have negligible false negatives.

有四种主要类型的异常检测问题。 第一种(也是最简单的一种)类型是检测已知语料库中的结构异常。 在这些问题中,您知道异常的结构将是什么,并且您知道语料库的格式。 作为简化的类比,在语料库是一串严格增加的数字的情况下,检测从其之前的数字开始减少的数字的问题将属于这种类型。 在此示例中,我们知道正常行为的模式(数字严格增加),并且正在检测已知的异常(相邻数字之间的减少)。 这个问题相对容易,因为我们可以清楚地测量并确定什么时候异常,因为我们有一个清晰的结构要与之进行比较。 在这种情况下,拥有高性能的机器学习算法和具有可忽略的错误否定条件相对容易。

未知数据语料库中的结构异常 (Structured Anomalies in an Unknown Corpus of Data)

The second type is detecting a structured anomaly in an unknown corpus. These problems are more difficult than the previous example as we now need to consider the problem of how to parse through and evaluate the corpus in order to uncover the anomalies. This problem is not that much more difficult than the previous example as we still know the structure of the anomalies so after we solve the parsing problem then this type of problem becomes identical to the previous type. However, as the target corpus has an unknown structure, there will most likely be more false negatives than in the first type.

第二种类型是检测未知语料库中的结构异常。 这些问题比前面的示例更加困难,因为我们现在需要考虑如何解析和评估语料库以发现异常的问题。 因为我们仍然知道异常的结构,所以这个问题并不比前面的示例困难得多,因此在解决了解析问题之后,该类型的问题就变得与前面的类型相同。 但是,由于目标语料库的结构未知,因此与第一种类型相比,假阴性率最高。

已知数据语料库中的非结构化异常 (Unstructured Anomalies in a Known Corpus of Data)

The third type is detecting an unstructured anomaly in a known corpus. Again, this type of problem is more complex than the previous type. Although we have a defined structure where we can build our parsing algorithm upon, the anomalies are unstructured meaning that we have to truly understand the heuristics of the background corpus in order to evaluate the target corpus against. In this case, we start to have false positives in addition to false negatives as we do not have a proper way to evaluate if our detected anomalies are in fact true positives through the program without human interaction.

第三种是检测已知语料库中的非结构异常。 同样,这种类型的问题比以前的类型更为复杂。 尽管我们有一个定义的结构可以在其中构建我们的解析算法,但是异常是非结构化的,这意味着我们必须真正了解背景语料库的启发式方法才能评估目标语料库。 在这种情况下,除了假阴性外,我们还开始有假阳性,因为我们没有适当的方法来评估通过程序在没有人工干预的情况下检测到的异常是否实际上是真正的阳性。

未知数据语料库中的非结构化异常 (Unstructured Anomalies in an Unknown Corpus of Data)

The last type is the toughest anomaly detection problem and is still being researched and improved today. The remaining type is, of course, detecting unstructured anomalies in an unknown corpus. In this case, not only do we have to understand the heuristics of the corpus, we also have to create many measures based on the heuristics to evaluate how anomalous each segment of the target corpus is. For all of these measures, we need to set thresholds for which we classify a segment as an anomaly. These thresholds each have their own trade offs and finding the optimal thresholds for detecting anomalies requires operating and evaluating performance in a multi-dimensional space, each dimension representing one of the thresholds. Additionally, after exploring this multi-dimensional space, one might realize that the heuristics of the background corpus was not properly represented by the machine learning model and must restart and think of another way to quantify or identify the patterns of the corpus. The whole process can be really complex and frustrating due to the performance feedback loop. This type of anomaly detection, although very difficult, can potentially yield amazing results.

最后一种是最棘手的异常检测问题,目前仍在研究和改进中。 当然,剩下的类型是检测未知语料库中的非结构化异常。 在这种情况下,我们不仅必须了解语料库的启发式方法,还必须基于启发式方法创建许多度量,以评估目标语料库的每个片段的异常程度。 对于所有这些措施,我们需要设置阈值,将其分类为异常。 这些阈值各有其自身的权衡,找到用于检测异常的最佳阈值需要在多维空间中进行操作和评估性能,每个维表示一个阈值。 另外,在探索了多维空间之后,人们可能会意识到,背景语料库的启发式方法不能正确地由机器学习模型表示,因此必须重新开始思考另一种量化或识别语料库模式的方法。 由于性能反馈回路,整个过程可能非常复杂且令人沮丧。 这种异常检测虽然非常困难,但可能会产生惊人的结果。

结论 (Conclusion)

Understandably, the degree of which we can ignore the structure of the anomalies and corpus is proportional to the degree of difficulty in creating the algorithm. The more specific we are about the structure of the anomalies and the corpus, the easier the machine learning algorithm is to make. The less structured the anomalies and corpus are, the wider the range of problems that the algorithm can be applied to. However, accuracy and precision will also become issues as the structure of the anomalies and corpus becomes more vague. In an ideal world, if we made a super generic and accurate machine learning algorithm and tuned it perfectly to fix every situation, we would be able apply it to any problem in the world. In the field of health and medicine, we can detect problematic sub-sequences in genomes to detect illnesses like cancer way before it becomes an issue. In the field of technology, we can apply the algorithm to a real time logging system and uncover hackers or malicious activity the instant it occurs. There are so many other fields that anomaly detection can be applied to and if we can one day perfect it, we can solve many issues that are stumping scientists, engineers, and researchers today.

可以理解,我们可以忽略异常和语料库的结构的程度与创建算法的难度成正比。 我们对异常和语料库的结构越具体,机器学习算法就越容易实现。 异常和语料库的结构越少,可以应用该算法的问题范围就越广。 但是,随着异常和语料库的结构越来越模糊,准确性和准确性也将成为问题。 在理想的世界中,如果我们制作了超级通用且准确的机器学习算法,并对其进行了完美的调整以解决每种情况,那么我们便可以将其应用于世界上的任何问题。 在健康和医学领域,我们可以检测到基因组中有问题的子序列,从而在疾病成为问题之前检测出诸如癌症之类的疾病。 在技​​术领域,我们可以将该算法应用于实时日志记录系统,并在发生黑客或恶意活动后立即对其进行发现。 还有很多其他领域可以应用异常检测,如果我们有一天能够完善它,我们可以解决当今困扰科学家,工程师和研究人员的许多问题。

翻译自: https://towardsdatascience.com/detecting-anomalies-using-machine-learning-e3495f79718

异常检测机器学习

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392106.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

数据挖掘—BP神经网络(Java实现)

public class Test {public static void main(String args[]) throws Exception {ArrayList<ArrayList<Double>> alllist new ArrayList<ArrayList<Double>>(); // 存放所有数据ArrayList<String> outlist new ArrayList<String>(); // …

c语言掌握常用函数,c语言一些常用函数.pdf

c语言一些常用函数C 语言程序设计(常用函数说明)C 语言是 1972 年由美国的 Dennis Ritchie 设计发明的,并首次在 UNIX 操作系统的 DEC PDP-11 计算机上使用。它由早期的编程语言 BCPL(Basic Combind ProgrammingLanguage)发展演变而来。在 1970 年,AT&T 贝尔实验室的 Ken T…

高阶函数 - 函数节流

/*** 函数节流 - 限制函数被频繁调用* param {Function} fn [需要执行的函数]* param {[type]} interval [限制多长的时间再重复执行fn]*/var throttle function(fn, interval) {var __self fn,timer,firstTime true;return function() {var args arguments,__me…

[CareerCup] 8.7 Chat Server 聊天服务器

8.7 Explain how you would design a chat server. In particular, provide details about the various backend components, classes, and methods. What would be the hardest problems to solve? 这个简易的聊天服务器功能十分的有限&#xff0c;毕竟只是针对面试题的&…

react hooks使用_如何开始使用React Hooks:受控表格

react hooks使用by Kevin Okeh由Kevin Okeh 如何开始使用React Hooks&#xff1a;受控表格 (How to Get Started With React Hooks: Controlled Forms) React Hooks are a shiny new proposal that will allow you to write 90% cleaner React. According to Dan Abramov, Hoo…

特征工程tf-idf_特征工程-保留和删除的内容

特征工程tf-idfThe next step after exploring the patterns in data is feature engineering. Any operation performed on the features/columns which could help us in making a prediction from the data could be termed as Feature Engineering. This would include the…

c语言定义数组a10 指定各元素,C语言填空题.doc

C语言填空题.doc二、填空题1、C 语言只有 32 个关键字和 9 种控制语句。2、每个源程序有且只有一个 main 函数&#xff0c;系统总是从该函数开始执行 C 语言程序。 3、C 语言程序的注释可以出现在程序中的任何地方&#xff0c;它总是以 * 符号作为开始标记&#xff0c;以 */ 符…

猫狗队列

功能要求&#xff1a; 用户可以调用push方法将cat类或dog类的实例放入队列中;用户可以调用pollAll方法&#xff0c;将队列中所有的实例按照进队列的先后顺序依次弹出;用户可以调用pollDog方法&#xff0c;将队列中dog类的实例按照进队列的先后顺序依次弹出;用户可以调用pollCat…

如何使用HTML5,JavaScript和Bootstrap构建自定义文件上传器

by Prashant Yadav通过Prashant Yadav 如何使用HTML5&#xff0c;JavaScript和Bootstrap构建自定义文件上传器 (How to build a custom file uploader with HTML5, JavaScript, & Bootstrap) In this short article, we’ll learn how to create custom file uploader wit…

monkey测试===通过monkey测试检查app内存泄漏和cpu占用

最近一直在研究monkey测试。网上资料很多&#xff0c;但都是一个抄一个的。原创的很少 我把检查app内存泄漏的情况梳理一下&#xff1a; 参考资料&#xff1a; Monkey测试策略&#xff1a;https://testerhome.com/topics/597 Android Monkey测试详细介绍&#xff1a;http://www…

数据挖掘—主成分分析法降维和最小最大规范化

算法步骤:1)将原始数据按列组成n行m列矩阵X2)特征中心化。即每一维的数据都减去该维的均值&#xff0c;使每一维的均值都为03)求出协方差矩阵4)求出协方差矩阵的特征值及对应的特征向量5)将特征向量按对应的特征值大小从上往下按行排列成矩阵&#xff0c;取前k行组成矩阵p6)YPX…

用户使用说明c语言,(C语言使用指南.docx

(C语言使用指南Turbo C(V2.0)使用指南(本文的许多命令或方法同样适用于TC3) 在开始看本文以前&#xff0c;我先说明一下C语言的安装和使用中最应该注意的地方&#xff1a;许多网友在下载Turbo C 2.0和Turbo C 3.0后&#xff0c;向我问得最多的是在使用过程中碰到如下问题&…

三维空间两直线/线段最短距离、线段计算算法 【转】

https://segmentfault.com/a/1190000006111226d(ls,lt)|sj−tj||s0−t0(be−cd)u⃗ −(ae−bd)v⃗ ac−bd(ls,lt)|sj−tj||s0−t0(be−cd)u⃗ −(ae−bd)v⃗ ac−b2|具体实现代码如下&#xff08;C#实现&#xff09;&#xff1a; public bool IsEqual(double d1, double d2) { …

【慎思堂】之JS牛腩总结

一 JS基础 1-定义 Javascript是一种脚本语言/描述语言&#xff0c;是一种解释性语言。用于开发交互式web网页&#xff0c;使得网页和用户之间实现了一种实时性的、动态的、交互性的关系&#xff0c;使网页包含更多活跃的元素和更加精彩的内容。 主要用于&#xff1a;表单验证 …

vuejs 轮播_如何在VueJS中设计和构建轮播功能

vuejs 轮播by Fabian Hinsenkamp由Fabian Hinsenkamp设计 A carousel, slideshow, or slider — however you call it this class of UI — has become one of the core elements used in modern web development. Today, it’s almost impossible to find any Website or UI …

iOS绘圆形图-CGContextAddArc各参数说明

2019独角兽企业重金招聘Python工程师标准>>> 1.使用 UIGraphicsGetCurrentContext() 画圆 CGContextAddArc(<#CGContextRef _Nullable c#>, <#CGFloat x#>, <#CGFloat y#>, <#CGFloat radius#>, <#CGFloat startAngle#>, <#CGFlo…

c语言中if和goto的用法,C语言中if和goto的用法.doc

C语言中if和goto的用法C语言中&#xff0c;if是一个条件语句&#xff0c;用法??if(条件表达式) 语句如果满足括号里面表达式&#xff0c;表示逻辑为真于是执行后面的语句&#xff0c;否则不执行(表达式为真则此表达式的值不为0&#xff0c;为假则为0&#xff0c;也就是说&…

数据挖掘—K-Means算法(Java实现)

算法描述 &#xff08;1&#xff09;任意选择k个数据对象作为初始聚类中心 &#xff08;2&#xff09;根据簇中对象的平均值&#xff0c;将每个对象赋给最类似的簇 &#xff08;3&#xff09;更新簇的平均值&#xff0c;即计算每个对象簇中对象的平均值 &#xff08;4&#xf…

自我价值感缺失的表现_不同类型的缺失价值观和应对方法

自我价值感缺失的表现Before handling the missing values, we must know what all possible types of it exists in the data science world. Basically there are 3 types to be found everywhere on the web, but in some of the core research papers there is one more ty…

[收藏转载]C# GDI+ 简单绘图(一)

最近对GDI这个东西接触的比较多&#xff0c;也做了些简单的实例&#xff0c;比如绘图板&#xff0c;仿QQ截图等&#xff0e; 废话不多说了&#xff0c;我们先来认识一下这个GDI&#xff0c;看看它到底长什么样. GDI&#xff1a;Graphics Device Interface Plus也就是图形设备接…