《机器学习实战》学习笔记：决策树的实现

决策树是个极其易懂的算法，也是最常用的数据挖掘算法，决策树允许机器根据数据集创造规则，其实这就是机器学习的过程。专家系统中经常会使用到决策树及其变种，而且决策树给出的结果往往可以匹敌在当前领域具有几十年工作经验的专家。

这一章节的主要任务是讨论决策树的方法，以及编写构造决策树的python代码，使用递归调用建立分类器，最后可使用Matplotlib绘制决策树图。算法验证部分，对决策树输入一些隐形眼镜的处方数据，并由训练好的分类器预测需要的镜片类型。

以下是决策树的Python实现：

一、首先编写几个函数对数据进行预处理：

1.计算熵函数：熵代表集合的无序程度，集合越的分布无序，则熵越大；

from math import log :log2 = lambda x:log(x)/log(2)results = {}for row in sample:r = row[len(row)-1]results[r] = results.get(r, 0) + 1ent = r in results.keys():p = float(results[r]) / len(sample)ent -= p * log2(p)return ent

2.按属性和值获取数据集函数，可以看到使用python编写的函数十分简洁，该函数的意义是从数据集sample序列中取得第k列的值为v的子集，并从获得的子集中去掉第k列：

:return [d[:k] + d[k+1:] for d in sample if d[k] == v]

3.计算最大概率属性的相关函数，在构建决策树时，在处理所有决策属性后，还不能唯一区分数据时，，我们采用多数表决的方法来选择最终分类：

:classCount = {}for cla in classList:classCount[cla] = classCount.get(cla, 0) + 1sortedClassCount = sorted(classCount.items(), key=lambda d: d[1], reverse = True)return sortedClassCount[0][0]

二.选取最优数据特征划分方式的函数

在构造决策树之前，首先需要评估每个特性的作用，思考以哪一列的值划分集合，从而获得最大的信息增益，以下函数实现了这一功能，输入数据集，输出最优的特征值。

:ent, feature = 100000000, -1for i in range(len(sample[0]) – 1):feat_list = [e[i] for e in sample]unq_feat_list = set(feat_list)ent_t = 0.0for f in unq_feat_list:sub_data = fetch_subdataset(sample, i, f)ent_t += entropy(sub_data) * len(sub_data) / len(sample)if ent_t < ent:ent, feature = ent_t, ireturn feature

三.使用递归构建决策树

:cla = [c[-1] for c in sample]if len(cla) == cla.count(cla[0]):return cla[0]if len(sample[0]) == 1:return MaxFeature(sample)feature = DecisionFeature(sample)featureLabel = datalabel[feature]decisionTree = {featureLabel:{}}del(datalabel[feature])featValue = [d[feature] for d in sample]UniqueFeatValue = set(featValue)for value in UniqueFeatValue:subLabel = datalabel[:]decisionTree[featureLabel][value] = buildTree( \fetch_subdataset(sample, feature, value), subLabel)return decisionTree

四.使用决策树

:label = decisionTree.keys()[0]next_dict = decisionTree[label]feat_index = featLabels.index(label)for key in next_dict.keys():if test[feat_index] == key:if type(next_dict[key]).__name__ == ‘dict’:c_label = classify(next_dict[key], featLabels, test)else:c_label = next_dict[key]return c_label

由于时间有限，只能分析到这里，实现结果和实例验证有待进一步讨论。

人之所以能，是相信能。

相关文章：

你感兴趣的文章：

标签云：