异常检测机器学习

什么是异常检测？ (What is Anomaly Detection?)

The anomaly detection problem has been a problem that has been frequently explored in the field of machine learning and has become a classic problem. Anomalies are any unusual sequence or pattern inside a large corpus of data. These anomalies usually cause unexpected and complex errors or inefficiencies unless resolved. Searching for these anomalies through a corpus might be easy if the corpus was relatively small, but when it scales to an enormous size, that solution becomes unreasonable. For example, trying to find a grammatical mistake in a 200 word paragraph is pretty easy but imagine trying to find all the grammatical errors in a 5000 page encyclopedia. The problem becomes much more difficult for humans. Fortunately, with the help of machine learning, we are able to solve this problem much easier (kind of).

异常检测问题已经成为机器学习领域中经常探讨的问题，并且已经成为经典问题。异常是大型数据集中的任何异常序列或模式。除非解决，否则这些异常通常会导致意外的复杂错误或效率低下。如果语料库相对较小，则通过语料库搜索这些异常可能很容易，但是当它扩展到巨大规模时，该解决方案将变得不合理。例如，尝试在200个单词的段落中查找语法错误是很容易的，但是可以想象一下，尝试在5000页的百科全书中查找所有语法错误。这个问题对人类来说变得更加困难。幸运的是，借助机器学习，我们能够(更轻松)解决此问题。

First of all, what is machine learning? Machine learning is essentially using statistics to model and train how a system (or corpus) normally behaves from a training set (the background data set). Afterwards, we can compare the abnormally behaving system (the target data set) to our model of how a normal system behaves and try to uncover anomalies in the target. Although the main idea sounds pretty easy and intuitive, there are many complexities associated with this process such as finding a background data set that is representative of the whole population, distributing the calculations to different machines for large data sets, etc. Although these problems are all difficult obstacles that software engineers have to tackle before creating a polished machine learning model, I will not be talking about these issues but rather the application of machine learning to find anomalies.

首先，什么是机器学习？机器学习本质上是使用统计数据来建模和训练系统(或语料库)通常如何根据训练集(背景数据集)表现。然后，我们可以将行为异常的系统(目标数据集)与正常系统行为的模型进行比较，并尝试发现目标中的异常。尽管主要想法听起来很容易且直观，但是与此过程相关的复杂性很多，例如找到代表整个人群的背景数据集，将计算分布到大型数据集的不同机器等。尽管这些问题是在创建完善的机器学习模型之前软件工程师必须解决的所有困难障碍，我不会在谈论这些问题，而是在机器学习中应用以发现异常。

异常检测问题的类型 (Types of Anomaly Detection Problems)

已知数据语料库中的结构异常 (Structured Anomalies in a Known Corpus of Data)

There are four main types of anomaly detection problems. The first (and also easiest) type is detecting structured anomalies in a known corpus. These are problems where you know what the structure of the anomalies will be and you know the format of the corpus. As a simplified analogy, the problem of detecting numbers that decrease from the number prior to it where the corpus is a string of strictly increasing numbers would fall under this type. In this example, we know the pattern of the normal behavior (strictly increasing numbers) and we are detecting for a known anomaly (a decrease between adjacent numbers). This problem is relatively easy as we can clearly measure and know for sure when something is an anomaly as we have a clear structure we are comparing it to. In this case, it is relatively easy to have a high performance machine learning algorithm and have negligible false negatives.

有四种主要类型的异常检测问题。第一种(也是最简单的一种)类型是检测已知语料库中的结构异常。在这些问题中，您知道异常的结构将是什么，并且您知道语料库的格式。作为简化的类比，在语料库是一串严格增加的数字的情况下，检测从其之前的数字开始减少的数字的问题将属于这种类型。在此示例中，我们知道正常行为的模式(数字严格增加)，并且正在检测已知的异常(相邻数字之间的减少)。这个问题相对容易，因为我们可以清楚地测量并确定什么时候异常，因为我们有一个清晰的结构要与之进行比较。在这种情况下，拥有高性能的机器学习算法和具有可忽略的错误否定条件相对容易。

未知数据语料库中的结构异常 (Structured Anomalies in an Unknown Corpus of Data)

The second type is detecting a structured anomaly in an unknown corpus. These problems are more difficult than the previous example as we now need to consider the problem of how to parse through and evaluate the corpus in order to uncover the anomalies. This problem is not that much more difficult than the previous example as we still know the structure of the anomalies so after we solve the parsing problem then this type of problem becomes identical to the previous type. However, as the target corpus has an unknown structure, there will most likely be more false negatives than in the first type.

第二种类型是检测未知语料库中的结构异常。这些问题比前面的示例更加困难，因为我们现在需要考虑如何解析和评估语料库以发现异常的问题。因为我们仍然知道异常的结构，所以这个问题并不比前面的示例困难得多，因此在解决了解析问题之后，该类型的问题就变得与前面的类型相同。但是，由于目标语料库的结构未知，因此与第一种类型相比，假阴性率最高。

已知数据语料库中的非结构化异常 (Unstructured Anomalies in a Known Corpus of Data)

The third type is detecting an unstructured anomaly in a known corpus. Again, this type of problem is more complex than the previous type. Although we have a defined structure where we can build our parsing algorithm upon, the anomalies are unstructured meaning that we have to truly understand the heuristics of the background corpus in order to evaluate the target corpus against. In this case, we start to have false positives in addition to false negatives as we do not have a proper way to evaluate if our detected anomalies are in fact true positives through the program without human interaction.

第三种是检测已知语料库中的非结构异常。同样，这种类型的问题比以前的类型更为复杂。尽管我们有一个定义的结构可以在其中构建我们的解析算法，但是异常是非结构化的，这意味着我们必须真正了解背景语料库的启发式方法才能评估目标语料库。在这种情况下，除了假阴性外，我们还开始有假阳性，因为我们没有适当的方法来评估通过程序在没有人工干预的情况下检测到的异常是否实际上是真正的阳性。

未知数据语料库中的非结构化异常 (Unstructured Anomalies in an Unknown Corpus of Data)

The last type is the toughest anomaly detection problem and is still being researched and improved today. The remaining type is, of course, detecting unstructured anomalies in an unknown corpus. In this case, not only do we have to understand the heuristics of the corpus, we also have to create many measures based on the heuristics to evaluate how anomalous each segment of the target corpus is. For all of these measures, we need to set thresholds for which we classify a segment as an anomaly. These thresholds each have their own trade offs and finding the optimal thresholds for detecting anomalies requires operating and evaluating performance in a multi-dimensional space, each dimension representing one of the thresholds. Additionally, after exploring this multi-dimensional space, one might realize that the heuristics of the background corpus was not properly represented by the machine learning model and must restart and think of another way to quantify or identify the patterns of the corpus. The whole process can be really complex and frustrating due to the performance feedback loop. This type of anomaly detection, although very difficult, can potentially yield amazing results.

最后一种是最棘手的异常检测问题，目前仍在研究和改进中。当然，剩下的类型是检测未知语料库中的非结构化异常。在这种情况下，我们不仅必须了解语料库的启发式方法，还必须基于启发式方法创建许多度量，以评估目标语料库的每个片段的异常程度。对于所有这些措施，我们需要设置阈值，将其分类为异常。这些阈值各有其自身的权衡，找到用于检测异常的最佳阈值需要在多维空间中进行操作和评估性能，每个维表示一个阈值。另外，在探索了多维空间之后，人们可能会意识到，背景语料库的启发式方法不能正确地由机器学习模型表示，因此必须重新开始思考另一种量化或识别语料库模式的方法。由于性能反馈回路，整个过程可能非常复杂且令人沮丧。这种异常检测虽然非常困难，但可能会产生惊人的结果。

结论 (Conclusion)

Understandably, the degree of which we can ignore the structure of the anomalies and corpus is proportional to the degree of difficulty in creating the algorithm. The more specific we are about the structure of the anomalies and the corpus, the easier the machine learning algorithm is to make. The less structured the anomalies and corpus are, the wider the range of problems that the algorithm can be applied to. However, accuracy and precision will also become issues as the structure of the anomalies and corpus becomes more vague. In an ideal world, if we made a super generic and accurate machine learning algorithm and tuned it perfectly to fix every situation, we would be able apply it to any problem in the world. In the field of health and medicine, we can detect problematic sub-sequences in genomes to detect illnesses like cancer way before it becomes an issue. In the field of technology, we can apply the algorithm to a real time logging system and uncover hackers or malicious activity the instant it occurs. There are so many other fields that anomaly detection can be applied to and if we can one day perfect it, we can solve many issues that are stumping scientists, engineers, and researchers today.

可以理解，我们可以忽略异常和语料库的结构的程度与创建算法的难度成正比。我们对异常和语料库的结构越具体，机器学习算法就越容易实现。异常和语料库的结构越少，可以应用该算法的问题范围就越广。但是，随着异常和语料库的结构越来越模糊，准确性和准确性也将成为问题。在理想的世界中，如果我们制作了超级通用且准确的机器学习算法，并对其进行了完美的调整以解决每种情况，那么我们便可以将其应用于世界上的任何问题。在健康和医学领域，我们可以检测到基因组中有问题的子序列，从而在疾病成为问题之前检测出诸如癌症之类的疾病。在技术领域，我们可以将该算法应用于实时日志记录系统，并在发生黑客或恶意活动后立即对其进行发现。还有很多其他领域可以应用异常检测，如果我们有一天能够完善它，我们可以解决当今困扰科学家，工程师和研究人员的许多问题。

翻译自: https://towardsdatascience.com/detecting-anomalies-using-machine-learning-e3495f79718

异常检测机器学习

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/392106.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

数据挖掘—BP神经网络（Java实现）

public class Test {public static void main(String args[]) throws Exception {ArrayList<ArrayList<Double>> alllist new ArrayList<ArrayList<Double>>(); // 存放所有数据ArrayList<String> outlist new ArrayList<String>(); // …

异常检测机器学习_使用机器学习检测异常