数据挖掘实战之DBLP中合作者挖掘(Python+Hadoop)

任务描述:

本文的写作目的是从DBLP数据库中找到经常一起写作的合作者。熟悉数据挖掘中频繁项挖掘的经典算法(FP-Growth)并作出改进和优化。实验代码用Python写的,分别在本地(Win8)和Hadoop集群(条件有限,虚拟机上跑的,3个节点)上实现。(下载本文所涉及全部代码https://github.com/findmyway/DBLP-Coauthor)

任务分解:

    从DBLP数据集中提取作者信息建立索引作者ID并对文件编码分析数据的规模构建FP-Tree并从FP-Tree得到频繁项集频繁项集挖掘结果分析并行FP-Growth算法的可行性分析Hadoop平台上实现FP-Growth算法

从DBLP数据集中提取作者信息

首先从官网下载DBLP数据集http://dblp.uni-trier.de/xml/只需下载dblp.xml.gz解压后得到1G多dblp.xml文件!文件略大。用vim打开文件后可以看到,所有的作者信息分布在以下这些属性中:’article’,’inproceedings’,’proceedings’,’book’,’incollection’,’phdthesis’,’mastersthesis’,’www’

在这里使用python自带的xml分析器解析该文件(注意这里使用sax的方式)源码如下:(其核心思想为,分析器在进入上面那些属性中的某一个时,标记flag=1,然后将author属性的内容输出到文件,退出时再标记flag=0,最后得到authors.txt文件)

getAuthors.py

01import codecs02from xml.sax import handler, make_parser03paper_tag = (‘article’,’inproceedings’,’proceedings’,’book’,04 ‘incollection’,’phdthesis’,’mastersthesis’,’www’)0506class mHandler(handler.ContentHandler):07 def __init__(self,result):08 self.result = result09 self.flag = 01011 def startDocument(self):12 print ‘Document Start’13 14 def endDocument(self):15 print ‘Document End’16 17 def startElement(self, name, attrs):18 if name == ‘author’:19 self.flag = 120 21 def endElement(self, name):22 if name == ‘author’:23 self.result.write(‘,’)24 self.flag = 025 if (name in paper_tag) :26 self.result.write(‘\r\n’)27 28 def characters(self, chrs): 29 if self.flag:30 self.result.write(chrs)3132def parserDblpXml(source,result):33 handler = mHandler(result)34 parser = make_parser()35 parser.setContentHandler(handler)36 37 parser.parse(source)38 3940if __name__ == ‘__main__’:41 source = codecs.open(‘dblp.xml’,’r’,’utf-8′)42 result = codecs.open(‘authors.txt’,’w’,’utf-8′)43 parserDblpXml(source,result)44 result.close()45 source.close()建立索引作者ID

读取步骤1中得到的authors.txt文件,将其中不同的人名按照人名出现的次序编码,存储到文件authors_index.txt中,同时将编码后的合作者列表写入authors_encoded.txt文件。

encoded.py

01import codecs02source = codecs.open(‘authors.txt’,’r’,’utf-8′)03result = codecs.open(‘authors_encoded.txt’,’w’,’utf-8′)04index = codecs.open(‘authors_index.txt’,’w’,’utf-8′)05index_dic = {}06name_id = 007 ## build an index_dic, key -> authorName value => [id, count]08for line in source:09 name_list = line.split(‘,’)10 for name in name_list:11 if not (name == ‘\r\n’):12 if name in index_dic:13 index_dic[name][1] +=114 else:15 index_dic[name] = [name_id,1]16 index.write(name + u’\r\n’)17 name_id += 118 result.write(str(index_dic[name][0]) + u’,’)19 result.write(‘\r\n’)2021source.close()22result.close()23index.close()

(这里注意编码,不然会出现UnicodeError,很蛋疼的。。。)

分析数据的规模

查看在DBLP数据集中作者发表文章的数量。即统计只发表过1次文章的人数有多少,发表过2篇文章的人数有多少……发表过n篇文章的有多少人…并且绘图:

选取支持度为40到200这一段放大后查看:

分析可知,当支持度为40的作者数量接近1000,随后支持度每增加20,对应的作者数量减半,为了降低计算量,第一次实验时支持度阈值不宜选得太小,同时为了避免结果数量太少,初次实验时阈值可选在40~60之间(在接下来的实验中选的是40)。

view_data.py

01from matplotlib.font_manager import FontProperties02font = FontProperties(fname=r”c:\windows\fonts\simsun.ttc”, size=14) 03import codecs04import matplotlib.pyplot as plt05import numpy as np06data = codecs.open(‘authors_encoded.txt’,’r’,’utf-8′)07word_counts = {}08maxCounts = 009for line in data:10 line = line.split(‘,’)11 for word in line[0:-1]:12 word_counts[word] = word_counts.get(word,0) + 113 if word_counts[word] > maxCounts:14 maxCounts = word_counts[word]15 maxKey = word1617xMax = maxCounts18data.close()19bins = {}20for k,v in word_counts.iteritems():21 bins[v] = bins.get(v,0) + 122y = []23for i in range(40, 200):24 y.append(bins.get(i,0))25plt.plot(y,’-‘);26plt.grid()27plt.yticks(range(0,1000,100))28plt.xticks(range(0,160,20),range(40,200,20))29plt.xlabel(u’支持度’,fontproperties=font)30plt.ylabel(u’对应支持度下的作者个数’,fontproperties=font)31plt.title(u’作者数量与支持度之间的对应关系’,fontproperties=font)32plt.show()构建FP-Tree并从FP-Tree得到频繁项集

FP-Tree算法的原理在这里不展开讲了,其核心思想分为2步,首先扫描数据库得到FP-Tree,然后再从树上递归生成条件模式树并上溯找到频繁项集。这里借用MachineLearninginAction中的核心代码。(写得真心好,值得深入学习)。

final.py

001class treeNode:002 def __init__(self, nameValue, numOccur, parentNode):003 self.name = nameValue004 self.count = numOccur005 self.nodeLink = None006 self.parent = parentNode #needs to be updated007 self.children = {}008 def inc(self, numOccur):009 self.count += numOccur010011def createTree(dataSet, minSup=1): #create FP-tree from dataset but don’t mine012 freqDic = {}013 #go over dataSet twice014 for trans in dataSet:#first pass counts frequency of occurance015 for item in trans:016 freqDic[item] = freqDic.get(item, 0) + dataSet[trans] 017 headerTable = {k:v for (k,v) in freqDic.iteritems() if v >= minSup}018 if len(headerTable) == 0: return None, None #if no items meet min support –>get out019 for k in headerTable:020 headerTable[k] = [headerTable[k], None] #reformat headerTable to use Node link021 #print ‘headerTable: ‘,headerTable022 retTree = treeNode(‘Null Set’, 1, None) #create tree023 for tranSet, count in dataSet.items(): #go through dataset 2nd time024 localD = {}025 for item in tranSet: #put transaction items in order026 if headerTable.get(item,0):027 localD[item] = headerTable[item][0]028 if len(localD) > 0:029 orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]030 updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset031 return retTree, headerTable #return tree and header table032033def updateTree(items, inTree, headerTable, count):034 if items[0] in inTree.children:#check if orderedItems[0] in retTree.children035 inTree.children[items[0]].inc(count) #incrament count036 else: #add items[0] to inTree.children037 inTree.children[items[0]] = treeNode(items[0], count, inTree)038 if headerTable[items[0]][1] == None: #update header table039 headerTable[items[0]][1] = inTree.children[items[0]]040 else:041 updateHeader(headerTable[items[0]][1], inTree.children[items[0]])042 if len(items) > 1:#call updateTree() with remaining ordered items043 updateTree(items[1::], inTree.children[items[0]], headerTable, count)044045def updateHeader(nodeToTest, targetNode): #this version does not use recursion046 while (nodeToTest.nodeLink != None): #Do not use recursion to traverse a linked list!047 nodeToTest = nodeToTest.nodeLink048 nodeToTest.nodeLink = targetNode049050def ascendTree(leafNode, prefixPath): #ascends from leaf node to root051 if leafNode.parent != None:052 prefixPath.append(leafNode.name)053 ascendTree(leafNode.parent, prefixPath)054055def findPrefixPath(basePat, treeNode): #treeNode comes from header table056 condPats = {}057 while treeNode != None:058 prefixPath = []059 ascendTree(treeNode, prefixPath)060 if len(prefixPath) > 1:061 condPats[frozenset(prefixPath[1:])] = treeNode.count062 treeNode = treeNode.nodeLink063 return condPats064065def mineTree(inTree, headerTable, minSup, preFix, freqItemList):066 bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])]#(sort header table)067 for basePat in bigL: #start from bottom of header table068 newFreqSet = preFix.copy()069 newFreqSet.add(basePat)070 #print ‘finalFrequent Item: ‘,newFreqSet #append to set071 if len(newFreqSet) > 1:072 freqItemList[frozenset(newFreqSet)] = headerTable[basePat][0]073 condPattBases = findPrefixPath(basePat, headerTable[basePat][1])074 myCondTree, myHead = createTree(condPattBases, minSup)075 #print ‘head from conditional tree: ‘, myHead076 if myHead != None: #3. mine cond. FP-tree077 #print ‘conditional tree for: ‘,newFreqSet078 #myCondTree.disp(1)079 mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)080081def loadSimpDat(inFile):082 dataSet = {}083 for line in inFile:084 line =line.strip().split(‘,’)085 dataLine = [word for word in line if word.isdigit()]086 dataSet[frozenset(dataLine)] = dataSet.get(frozenset(dataLine),0) + 1087 return dataSet088089if __name__ == “__main__”:090 minSup = 100091 print “Reading Source File … Wait…”092 with open(‘authors_encoded.txt’,’r’) as f:093 dataSet = loadSimpDat(f)094 print “Constructing FP-tree … Wait…”095 myFPtree, myHeaderTab = createTree(dataSet, minSup)096 print “Mining frequent items … Wait…”097 myFreqList = {}098 mineTree(myFPtree, myHeaderTab, minSup, set([]), myFreqList)099 print “Totally %d frequent itemsets found ! ” %len(myFreqList)100 print “Constructing authors_index… Wait…”101102 maxCoauthors = 0103 for freqAuthors in myFreqList.keys():104 if len(freqAuthors) > maxCoauthors:105 maxCoauthors = len(freqAuthors)106 print “the max num of coauthors is %d ” % (maxCoauthors)107108 109 with open(‘authors_index.txt’,’r’) as authorsIndex:110 i = 0111 authorsDic = {}112 for name in authorsIndex:113 name = name.strip()114 authorsDic[i] = name115 i = i+1116 print “Writing result into result.txt… Wait…”117 with open(‘result4.txt’,’w’) as result2:118 with open(‘result3.txt’,’w’) as result:119 result.write(“%25s\t%25s\t%15s\t%10s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\n” \120 %(‘authorA’,’authorB’,’authorC’,’Sup(A,B,C)’,’Sup(A)’,’Sup(B)’,’Sup(C)’,\121 ‘Con(A)’,’Con(B)’,’Con(C)’,’MinCon’,’MaxCon’))122 result2.write(“%25s\t%25s\t%15s\t%10s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\t%6s\n” \123 %(‘authorA’,’authorB’,’authorC’,’Sup(A,B,C)’,’Sup(A)’,’Sup(B)’,’Sup(C)’,\124 ‘Con(A)’,’Con(B)’,’Con(C)’,’MinCon’,’MaxCon’))125 resultList = sorted(myFreqList.items(), key=lambda p: p[1], reverse=True)126 for itemSet, support in resultList:127 itemList = list(itemSet)128 A = itemList[0]129 authorA = authorsDic.get(int(A),’0′)130 B = itemList[1]131 authorB = authorsDic.get(int(B),’0′)132 SupAB_C = int(support)133 SupA = int(myHeaderTab.get(A,[0])[0])134 SupB = int(myHeaderTab.get(B,[0])[0])135 ConA = float(SupAB_C) / float(SupA)136 ConB = float(SupAB_C) / float(SupB)137 (C,authorC,SupC,ConC) = (”,”,0.0,0.0)138 139 if len(itemList) == 3:140 C = itemList[2]141 authorC = authorsDic.get(int(C),’0′)142 SupC = int(myHeaderTab.get(C,[0])[0])143 ConC = float(SupAB_C) / float(SupC)144 MinCon = min([ConA, ConB, ConC])145 MaxCon = max([ConA, ConB, ConC])146 elif len(itemList) == 2:147 MinCon = min([ConA, ConB])148 MaxCon = max([ConA, ConB]) 149150 if MinCon < 0.4 or MaxCon < 0.5 or (MinCon + MaxCon)/2 < 0.5:151 continue152 result.write(“%25s\t%25s\t%15s\t%10.0f\t%6.0f\t%6.0f\t%6.0f\t%6.4f\t%6.4f\t%6.4f\t%6.4f\t%6.4f\n” \153 %(authorA,authorB,authorC,SupAB_C,\154 SupA,SupB,SupC,ConA,ConB,ConC,MinCon,MaxCon))155 result2.write(“%25s\t%25s\t%15s\t%10.0f\t%6.0f\t%6.0f\t%6.0f\t\%6.4f\t%6.4f\t%6.4f\t%6.4f\t%6.4f\n”\156 %(A,B,C,SupAB_C,SupA,SupB,SupC,\157 ConA,ConB,ConC,MinCon,MaxCon))158 print “Finished !”频繁项集挖掘结果分析

在选取频繁度为40后发现,得到的结果非常多,总共2000多,为了分析的方便,进一步提高频繁度阈值为100,此时得到了111条记录,按照合作者的共同支持度排序,部分截图如下:

输出结果说明

统计满足支持度条件的合作者个数可以发现,经常一起合作的作者数目最多为3,故在输出文件中输出了authorA,authorB,authorC(当合作者数目为2时,authorC为空,其对应支持度和置信度为0),Sup(A,B,C)为A,B,C共同合作的次数,Sup(A)Sup(B)Sup(C)分别为A,B,C各自的写作次数,Con(A)、Con(B)、Con(C)分别表示A,B,C的置信度(即合作次数除以写作总次数)MinCon和MaxCon分别统计Con(A)、Con(B)、Con(C)的最小值和最大值(注意,当authorC为空时,其对应的置信度不加入最大最小值的统计)

输出结果分析

上面的结果是没有经过任何处理的结果,初步分析可以发现以下特性:

1.在满足支持度条件的合作者中,大多数是两个人,但是也有少数3个人一起经常合作的情况;

2.由于在这里我们关注的是作者之间的合作程度,故可以不关注提升度对于结果的影响;

3.合作者之间的关系是双向性的,也就是说,A与B的合作程度与B与A合作的程度是一致的,因此可以直接考虑置信度;

4.在按支持度排序后,某些作者的置信度较低,需要引入置信度阈值,为了避免置信度不平衡的情况(比如A经常和B合作,但该合作次数占B写作次数很少一部分),加入阈值条件MinCon>=0.3,同时置信度中的较大值应该满足MaxCon>=0.5,另外加入平衡条件(MinCon+MaxCon)/2>=0.5,过滤后的输出结果入下:

再对该结果文件分析发现,输出结果降低到82条,同时可以看到MinCon分布在(0.3,0.4)之间的记录很少,因此,可以考虑将MinCon的阈值调整到0.4

可视化:

在这里将作者与其合作者之间的关系用图来表示以排名在最前面的作者IrithPomeranz为例.

【查看大图】

viewRelatio.py

01# -*- coding: utf-8 -*-02importitertools03importnetworkxasnx04importmatplotlib.pyplotasplt05importnumpyasnp06importcodecs07frommatplotlib.font_managerimportFontProperties08font=FontProperties(fname=r”c:\windows\fonts\simsun.ttc”)09defcreateEdge(nodeX):10 withcodecs.open(‘authors.txt’,’r’,’utf-8′)asf:11 forlineinf:12 line=line.strip().split(‘,’)13 ifline[-1]==”:14 line.remove(”)15 ifnodeXinlineandlen(line)>1:16 line.remove(nodeX)17 forauthorinline:18 yield(author,nodeX)19defmakeFreqDic():20 print”Creating FreqDic…”21 withcodecs.open(‘authors.txt’,’r’,’utf-8′)asf:22 freqDic={}23 forlineinf: 24 line=line.strip().split(‘,’)25 ifline[-1]==”:26 line.remove(”) 27 forauthorinline:28 freqDic[author]=freqDic.get(author,0)+129 returnfreqDic30defmain(freqDic,nodeX):31 G=nx.Graph()32 print”Adding edge…”33 forA,BincreateEdge(nodeX):34 edgeDic=G.get_edge_data(A,B,default={‘weight’:0})35 G.add_edge(A,B,weight=edgeDic[‘weight’]+1)36 nodes=G.nodes()37 nodes.remove(nodeX)38 shells=[[nodeX],nodes]39 pos=nx.shell_layout(G,shells)40 print”Drawing nodes…”41 nodeSize=[10*freqDic[n]forn,dicinG.nodes_iter(data=True)]42 nodeColors=np.random.rand(len(nodeSize))43 nx.draw_networkx_nodes(G,pos,node_size=nodeSize,node_color=nodeColors,alpha=0.7)44 print”Drawing edges…”45 edgeWidth=[edata[‘weight’]/2foru,v,edatainG.edges(data=True)]46 edgeColor=np.random.rand(G.number_of_edges())47 nx.draw_networkx_edges(G,pos,width=edgeWidth,edge_color=edgeColor,alpha=0.35)48 print”Adding label…”49 select_labels={n:nforn,dinG.nodes_iter(data=True)iffreqDic[n]>=80}50 select_labels[nodeX]=nodeX51 nx.draw_networkx_labels(G,pos,labels=select_labels,font_size=8,alpha=0.3)52 title=str(nodeX)+u”与其合作者之间的关系网络”53 plt.title(title,size=15,fontproperties=font)54 plt.text(0.5,0.94,u”# 节点大小对应该作者发表文章总次数”,55 horizontalalignment=’center’,56 size=10,color=’r’,verticalalignment=’center’,57 transform=plt.gca().transAxes,58 fontproperties=font)59 plt.text(0.5,0.97,u”# 节点之间连线粗细对应该两个作者一起发表文章总次数”,60 horizontalalignment=’center’,61 size=10,color=’r’,verticalalignment=’center’,62 transform=plt.gca().transAxes,63 fontproperties=font)64 plt.axis(‘off’)65 fileName=str(nodeX)+”.png”66 plt.savefig(fileName,transparent=True,dpi=500)67 plt.show()6869if__name__==’__main__’:70 freqDic=makeFreqDic()71 nodeX=u’Irith Pomeranz’72 main(freqDic,nodeX)并行FP-Growth算法的可行性分析

并行FP-Growth算法最早研究可以参考文献:

Parallel_Frequent_Pattern_Mining.pdf

其核心思想可以通过上图来说明。

并行FP-growth算法可分解为两个MapReduce过程。

第一轮MapReduce

第一轮MapReduce所做的工作就是一个WordCount,扫描整个数据,统计每个词项所出现的次数。具体说来就是在Map过程中,输出以下键值对<word,1>,在Reduce过程中,统计word出现的总的次数。具体可参考:

第二轮MapReduce

第二轮MapReduce过程是并行FP-growth算法的核心。

结合上图来分析:

Map过程:

1.读取第一轮MapReduce所得到的的词频,得到一个词典的数据结构;

2.依次从数据库读取记录,并根据上一步中的词典结构对其排序,同时过滤掉不满足支持度阈值的词项;

3.输出排序后记录中每一个元素的条件模式项,具体为什么这么做可以回顾FP-growth算法的原理

Reduce过程:

1.获取每个元素所对应的条件模式项,并统计条件模式项中每个词项出现的次数

2.对条件模式项中的每个词频用支持度阈值过滤

3.从该条件模式项中生成其所有子集即为最后的结果(这一步在上图中没有,我自己加进去的)

Hadoop平台上实现FP-Growth算法

理解上面的算法原理后,实现起来就比较容易了。

第一轮的MapReduce可以参考:www.tianjun.ml/essays/19

第二轮的Map过程如下:

mapper2.py

01#!/usr/bin/env python02import sys03def creatDic():04 freqDic = {}05 with open(‘sortedList’, ‘r’) as sortedList:06 for line in sortedList:07 line = line.strip().split(‘\t’)08 freqDic[int(line[0])] = int(line[1])09 return freqDic1011def read_input(inFile):12 for line in inFile:13 yield line.split(‘,’)1415def main(freqDic, minSup):16 data = read_input(sys.stdin)17 for names in data:18 names = {name:freqDic[int(name)] for name in names \19 if name.isdigit() \20 and freqDic.get(int(name), 0) >= minSup}21 lenth = len(names)22 if lenth >= 2:23 conPatItems = [name for name, value in \24 sorted(names.iteritems(), \25 key = lambda p:p[1])]26 for i in range(lenth-1):27 print “%s\t%s” % (conPatItems[i], conPatItems[i+1::])28 else:29 continue3031if __name__ == ‘__main__’:32 support = 10033 dic = creatDic()34 main(dic, support)

第二轮的Reduce过程如下:

reducer2.py

01#!/usr/bin/env python02from itertools import groupby03from operator import itemgetter04import sys0506def readMapOutput(file):07 for line in file:08 yield line.strip().split(‘\t’)0910def main(minSup):11 data = readMapOutput(sys.stdin)12 for currentName, group in groupby(data, itemgetter(0)):13 localDic = {}14 try:15 for currentName, conPatItems in group:16 conPatItems = conPatItems.strip().strip(‘[‘).strip(‘]’)17 #print “%s\t%s” % (currentName, conPatItems)18 itemList = conPatItems.split(‘,’)19 for item in itemList:20 item = item.strip().strip(“‘”)21 item = int(item)22 localDic[item] = localDic.get(item,0) + 123 resultDic = {k:v for k, v in localDic.iteritems() \24 if v >= minSup}25 #Here we just print out 2-coauthors26 if len(resultDic) >= 1:27 print “%s\t%s” % (currentName, resultDic.items())2829 except:30 print “%s\t%s” %(“inner err”, “sorry!”)31 pass3233if __name__ == “__main__”:34 support = 10035 main(support)总结:

本次实验分别在本地和Hadoop集群上实现了DBLP数据集中的合作者挖掘,由于实验条件有限,Hadoop集群为虚拟机实现,故难以对本地单机运行效率和分布式运行的效率进行对比,不过从论文Parallel_Frequent_Pattern_Mining.pdf 的分析结果可以看出,在数据量较大的时候分布式运行的FP-growth算法要比常规的FP-growth算法运行效率高很多。

关于调试

不得不说,MapReduce的调试要比普通程序的调试要复杂的多,一个建议是先将Reduce的任务设为0,查看Map输出是否是想要的结果,或者直接在Reduce过程中打印出收到的整理后的Map输出,方便进一步分析。

写程序的过程中出了个很蛋疼问题,分布式的输出和单机模式下的输出结果不同!!!找了半天才发现分布式的输出结果因为没有挖掘完整的频繁项,比如输出的<key, value>中 value为<authorA,authorB,authorC>的话,并没有输出其子集<authorA,authorB> <authorA,authorC>和<authorB,authorC>以至分布式下的输出结果比单机下的结果数量要少。

最后的最后,不得不吐槽下编码以及字符串的处理……

数据挖掘实战之DBLP中合作者挖掘(Python+Hadoop)

相关文章:

你感兴趣的文章:

标签云: