Efficient Pattern Mining Methods

@(Pattern Discovery in Data Mining) 本文介绍了几个模式挖掘的高效算法。主要以Apriori思想为框架，主要讲解了FP-Growth算法。

The Downward Closure Property of Frequent Patterns

Property The downward closure (also called “Apriori”) property of frequent patterns:

Efficient mining methodology

If any subset of an itemset S is infrequent, then there is no chance for S to be frequent — why do we even have to consider S!? (It is an efficient way to prune)

Principle Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Scalable mining Methods

The Apriori Algorithm

Outline of Apriori (level-wise, candidate generation and test)

Initially, scan DB once to get frequent 1-itemsetRepeat Until no frequent or candidate set can be generatedReturn all the frequent itemsets derived

Psuedo Code

Tricks joining & pruning

这里，对于某此迭代产生的joining结果，检验任何一个k-1的子集是否在候选集中。即上图pruning的过程。

Extensions or Improvements of AprioriReduce passes of transaction database scans Partitioning (e.g., Savasere, et al., 1995)Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounderShrink the number of candidates Hashing (e.g., DHP: Park, et al., 1995)Pruning by support lower bounding (e.g., Bayardo 1998) Sampling (e.g., Toivonen, 1996)Exploring special data structures

Mining Frequent Patterns by Exploring Vertical Data Format

FPGrowth: A Frequent Pattern-Growth Approach构造FP-Tree，快速迭代生成frequent patterns 什么是FP-Tree？如何构造FP-Tree？

生成结果如下所示：

生成Frequent Itemset 利用分治的方法进行迭代计算。过程（设min_sup=2，以e后缀为例）： 1）得到e的前缀路径子树

2）计算e的频数，判断e是否是frequent item。方法是遍历e节点的链表（虚线连接）计算节点数目，得sup(e)=3 > 2，所以继续下述步骤。 3）因为e是频繁的，找到所有以e结尾的frequent itemlist，也就是说，分拆问题，，进行迭代。这里我们首先需要拿到e的Conditional FP-Tree。 4）Conditional FP-Tree的生成：结果：比较挫，直接看图

步骤： 1 – 更新e的前缀路径子树中的support值

2 – 删除包含e的节点

3 – 删除不频繁(infrequent)的节点，这里的c和d根据前述计算频数的方法知道满足最小support条件。至此，已经得到了关于e的Conditional FP-Tree。

5）利用前面得到的关于e的CFPT，找到所有以de、ce、ae结尾（be不考虑因为b已经被删除）的frequent itemlist。这里直接调用分治的过程进行递归。例如对于de来说，在e的CFPT中找到关于de的前缀路径子树……得到de的CFPT。例如，

讨论：对于单枝前缀路径子树，一次就能生成所有frequent patterns。例如：

* 此处红框选中的子树是m-cond. based，单枝树为{}-f3-c3-a3. * 这是一个迭代的过程，节点a产生了第二棵树{}-f3-c3. 节点c产生了树{}-f3. 然后第二棵树中的节点c产生了最后一棵树{}-f3. 节点f无法再产生新的树。 * 第一棵树是m-cond. based，产生了组合fm，cm，am * 第二棵树是am-cond. based，产生了fam，cam * 第三棵树是cm-cond. based，产生了fcm * 最后一棵树产生了fcam * 所以我们可以得到并集 fm, cm, am, fam, fcm, cam, fcam。

课程习题：

功夫不负有心人。

相关文章：

你感兴趣的文章：

标签云：