Project 1-3: 链接分析之链接统计

现在我们组分析源码和统计分析链接的工作正在同步进行，，稍后还会有分析源码和统计分析链接的进度报告发布。

本文说的是如何解析链接关系，供统计分析之用。

一句话——人生苦短，我用Python。

基本工作原理是遍历mirror下面的网页, 用正则表达式解析出链接地址, 然后输出链接关系.最后得到的文件可以作为下一个程序的输入, 以统计网页出度入度和计算PR值.

以下是源码：

1 # coding: utf-8 2 # 3 4 import os, re 5 6 rootdir= ‘/home/xxx/workspace/heritrix/jobs/ccer-20100930010817713/mirror/www.ccer.pku.edu.cn’ 7 8 dotfile = open(‘links.data’, ‘w’, 4096000) 9 10 count = 0 11 urllist = [] 12 13 def append2list(url): 14 if url not in urllist: 15 urllist.append(url) 16 return urllist.index(url) 17 18 def extract(dirr, name): 19 #print “extracting:”, dirr, name 20 f = open(dirr+’/’+name, ‘r’) 21 cururl = ‘http://’ + dirr[dirr.find(‘www.ccer.pku.edu.cn’):] + ‘/’ + name 22 curindex = append2list(cururl) 23 24 hrefs = re.findall(r”’href=(‘|”)?([^\s'”><()]+)(\1?)”’, f.read()) 25 for href in hrefs: 26 if not href[0] == href[2]\ 27 or href[1] == ‘#’\ 28 or href[1] == ‘./’\ 29 or href[1].startswith(‘mailto:’)\ 30 or href[1].startswith(‘javascript’)\ 31 or href[1].endswith(‘.css’)\ 32 or href[1].endswith(‘.jpg’)\ 33 or href[1].endswith(‘.bmp’)\ 34 or href[1].endswith(‘.jpeg’)\ 35 or href[1].endswith(‘.ico’)\ 36 or href[1].endswith(‘.gif’)\ 37 or href[1].endswith(‘.pdf’)\ 38 or href[1].endswith(‘.ppt’)\ 39 or href[1].endswith(‘.doc’)\ 40 or href[1].endswith(‘.xls’)\ 41 or href[1].endswith(‘.pptx’)\ 42 or href[1].endswith(‘.docx’)\ 43 or href[1].endswith(‘.xlsx’)\ 44 or href[1].endswith(‘.zip’)\ 45 or href[1].endswith(‘.png’): 46 pass 47 else: 48 realref = href[1] 49 if not realref.startswith(‘http’): #relative links 50 if ‘.asp?’ in realref: 51 realref = realref.replace(‘.asp?’, ”, 1) + ‘.asp’ # file name on disk 52 realref = ‘http://’ + dirr[dirr.find(‘www.ccer.pku.edu.cn’):] + ‘/’ + realref 53 #print realref 54 refindex = append2list(realref) 55 global count 56 dotfile.write(‘%d %d\n’ % (curindex, refindex)) 57 count += 1 58 if count % 10000 == 0: 59 print count 60 #f.close() 61 62 def filter(dummy, dirr, filess): 63 for name in filess: 64 if os.path.splitext(name)[1] in [‘.asp’, ‘.htm’, ‘.html’] and os.path.isfile(dirr+’/’+name): 65 extract(dirr, name) 66 67 os.path.walk(rootdir, filter, None) 68 69 dotfile.close() 70 71 urlfile = open(‘linkindex.txt’, ‘w’, 4096000) 72 for url in urllist: 73 urlfile.write(url + ‘\n’) 74 urlfile.close()

教育人的,激励人的,安慰人不开心的. 或者是诗词诗经里的..

相关文章：

你感兴趣的文章：

标签云：