推荐算法的先验算法的连接

So here we are diving into the world of data mining this time, let’s begin with a small but informative definition;

因此，这一次我们将进入数据挖掘的世界，让我们从一个小的但内容丰富的定义开始；

什么是数据挖掘？ (What is data mining ?!)

It’s technically a profound dive into datasets searching for some correlations, rules, anomaly detection and the list goes on. It’s a way to do some simple but effective machine learning instead of doing it the hard way like using regular neural networks or the ultimate complex version that is convolutions and recurrent neural networks (we will definitely go through that thoroughly in future articles).

从技术上讲，这是对数据集的深入研究，以寻找一些相关性，规则，异常检测，并且列表还在继续。这是一种进行简单但有效的机器学习的方法，而不是像使用常规神经网络或卷积和递归神经网络这样的终极复杂版本那样艰苦的方法来完成它(我们肯定会在以后的文章中全面介绍)。

Data mining algorithms vary from one to another, each one has it’s own privileges and disadvantages, i will not go through that in this article but the first one you should focus on must be the classical Apriori Algorithm as it is the opening gate to the data mining world.

数据挖掘算法因人而异，每种算法都有其自身的特权和劣势，在本文中我不会进行介绍，但是您应该关注的第一个算法必须是经典的Apriori算法，因为它是数据的门户采矿世界。

But before going any further, there’s some special data mining vocabulary that we need to get familiar with :

但是在进一步介绍之前，我们需要熟悉一些特殊的数据挖掘词汇：

k-Itemsets : an itemset is just a set of items, the k refers to it’s order/length which means the number of items contained in the itemset.
k-Itemsets：一个项目集只是一组项目， k表示它的顺序/长度，这意味着该项目集中包含的项目数。
Transaction : it is a captured data, can refer to purchased items in a store. Note that Apriori algorithm operates on datasets containing thousands or even millions of transactions.
交易：它是捕获的数据，可以参考商店中购买的物品。请注意，Apriori算法对包含数千甚至数百万个事务的数据集进行操作。
Association rule : an antecedent → consequent relationship between two itemsets :
关联规则：两个项目集之间的前→后关系：

Implies the presence of the itemset Y (consequent) in the considered transaction given the itemset X (antecedent).

在给定项目集X(先行者)的情况下，表示在考虑的事务中存在项目集Y(因此)。

Support : represents the popularity/frequency of an itemset, calculated this way :
支持：表示项目集的受欢迎程度/频率，通过以下方式计算：

Confidence ( X → Y ) : shows how much a rule is confident/true, in other words the likelihood of having the consequent itemset in a transaction, calculated this way :
置信度(X→Y)：显示一条规则置信度/真实度的多少，换句话说，在交易中拥有后续项集的可能性，计算方式为：

A rule is called a strong rule if its confidence is equal to 1.

如果规则的置信度等于1，则称为强规则 。

Lift ( X → Y ) : A measure of performance, indicates the quality of an association rule :
提升(X→Y)：一种性能度量，表示关联规则的质量：

MinSup : a user-specified variable which stands for the minimum support threshold for itemsets.
MinSup：用户指定的变量代表项目集的最低支持阈值。
MinConf : a user-specified variable which stands for the minimum confidence threshold for rules.
MinConf：用户指定的变量，代表规则的最小置信度阈值。
Frequent itemset : whose support is equal or higher than the chosen minsup.
频繁项目集：支持等于或大于选择的minsup 。
Infrequent itemset : whose support is less than the chosen minsup.
不 频繁项目 集：其支持小于所选的minsup 。

那么... Apriori如何工作？ (So…how does Apriori work ?)

Starting with a historical glimpse, the algorithm was first proposed by the computer scientists Agrawal and Srikant in 1994, it proceeds this way :

从历史的一瞥开始，该算法由计算机科学家Agrawal和Srikant于1994年首次提出，它以这种方式进行：

Generates possible combinations of k-itemsets (starts with k=1)
生成k个项目集的可能组合(以k = 1开头)
Calculates support according to each itemset
根据每个项目集计算支持
Eliminates infrequent itemsets
消除不频繁的项目集
Increments k and repeats the process
递增k并重复该过程

Now, how to generate those itemsets ?!!

现在，如何生成这些项目集？

For itemsets of length k=2, it is required to consider every possible combination of two items (no permutation is needed). For k > 2, two conditions must be satisfied first :

对于长度为k = 2的项目集，需要考虑两个项目的每种可能的组合(不需要排列)。对于k> 2 ，必须首先满足两个条件：

The combined itemset must be formed of two frequent ones of length k-1, let’s call’em subsets.
组合的项目集必须由两个长度为k-1的 频繁项组成，我们称它们为em 子集。
Both subsets must have the same prefix of length k-2
两个子集必须具有相同的长度k-2前缀

If you think about it, these steps will just extend the previously found frequent itemsets, this is called the ‘bottom up’ approach. It also proves that Apriori algorithm respects the monotone property :

如果您考虑一下，这些步骤将仅扩展先前发现的频繁项目集，这称为“自下而上”方法。这也证明Apriori算法尊重单调性 ：

All subsets of a frequent itemset must also be frequent.
频繁项目集的所有子集也必须是频繁的。

As well as the anti-monotone property :

以及抗单调特性 ：

All super-sets of an infrequent itemset must also be infrequent.
罕见项目集的所有超集也必须是不频繁的。

Okay, but wait a minute, this seems infinite !!

好的，但是等等，这似乎是无限的！

No, luckily it is not infinite, the algorithm stops at a certain order k if :

不，幸运的是它不是无限的，如果满足以下条件，该算法将以某个顺序k停止：

All the generated itemsets of length k are infrequent
生成的所有长度为k的项目集很少
No found prefix of length k-2 in common which makes it impossible to generate new itemsets of length k
找不到长度为k-2的前缀，这使得无法生成长度为k的新项目集

Sure…it’s not rocket science ! but how about an example to make this clearer ?

当然……这不是火箭科学！ 但是如何使这个例子更清楚呢？

Here’s a small transaction table in binary format, the value of an item is 1 if it’s present in the considered transaction, otherwise it’s 0.

这是一个二进制格式的小交易表，如果项目存在于所考虑的交易中，则该项目的值为1 ，否则为0 。

太好了……是时候进行一些关联规则挖掘了！ (Great…It’s time for some association rule mining !)

Once you reach this part, all there’s left to do is to take one frequent k-itemset at a time and generate all its possible rules using binary partitioning.

一旦达到这一部分，剩下要做的就是一次获取一个频繁的k项集，并使用二进制分区生成所有可能的规则。

If the 3-itemset {Almonds-Sugar-Milk} from the previous example were a frequent itemset, then the generated rules would look like :

如果前面示例中的3个项目集{Almonds-Sugar-Milk}是一个频繁项集，则生成的规则将如下所示：

我的Apriori模拟概述！使用Python (An overview of my Apriori simulation !! Using Python)

数据集 (Dataset)

Of format csv (Comma separated values), containing 7501 transactions of purchased items in a supermarket. Restructuring the dataset with the transaction encoder class from mlxtend library made the use and manipulation much easier. The resulting structure is occupying an area of 871.8 KB with 119 columns indexed respectively by food name from “Almonds” to “Zucchini”.

格式为csv (逗号分隔值)，包含在超市中的7501个已购买商品的交易。使用mlxtend库中的事务编码器类重构数据集使使用和操作更加容易。最终的结构占据了871.8 KB的区域，其中119列分别由食品名称从``杏仁''到``西葫芦''索引。

Here’s an overview of the transaction table before and after :

这是之前和之后的事务表的概述：

实现算法 (Implementing the algorithm)

I will not be posting any code fragments as it was a straight forward approach, the procedure is recursive, calls the responsible functions for the itemsets generation, support calculation, elimination and association rule mining in the mentioned order.

我不会发布任何代码片段，因为这是一种直接的方法，该过程是递归的，并按上述顺序调用负责项集生成，支持计算，消除和关联规则挖掘的负责功能。

The execution took 177 seconds which seemed optimised and efficient thanks to Pandas and NumPy’s ability to perform quick element-wise operations. All found association rules were saved in an html file for later use.

由于Pandas和NumPy能够执行快速的按元素操作，因此执行过程耗时177秒，这似乎是优化和高效的。找到的所有关联规则都保存在html文件中，以备后用。