python re清理html - 编程开发

代码

def formatHtml(input):    regular = re.compile('<\\bp\\b[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?SPAN[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?o:p>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?FONT[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?\\bB\\b[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('<\?[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?st1:[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?\\bchsdate\\b[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('<\\bbr\\b[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile('</?\\bchmetcnv\\b[^>]*>',re.IGNORECASE)    input = regular.sub('',input)    regular = re.compile(']*?>.*?',re.IGNORECASE+re.DOTALL)    input = regular.sub('',input)    return input

是用re注意：

1、def sub(pattern, repl, string, count=0, flags=0): 第三个参数是count很容易误用成flags. 2、re.sub(‘<8888(\g<0>)>’,s) 其中g<0>表示捕获的分组字符，0表示匹配的整个字符串，1表示第一个分组 3、(<div id=”play_div”[^>]*>)(.*?)(</div>)非贪婪 (<div id=”play_div”[^>]*>)(.*)(</div>)贪婪常用正则表达式中特殊字符

^匹配字符串的开始。 $匹配字符串的结尾。 \b匹配一个单词的边界。 \d匹配任意数字。 \D匹配任意非数字字符。 x?匹配一个可选的x字符（换句话说，它匹配1次或者0次x字符）。 x*匹配0次或者多次x字符。 x+匹配1次或者多次x字符。 x{n,m}匹配x字符，至少n次，至多m次。 (a|b|c)要么匹配a，要么匹配b，要么匹配c。 (x)一般情况下表示一个记忆组(remembered group). 你可以利用re.search函数返回对象的groups()函数获取它的值。 [^>]表示不匹配>字符以外的字符

原文地址：python re清理html, 感谢原作者分享。旁观者的姓名永远爬不到比赛的计分板上。

相关文章：

你感兴趣的文章：

标签云：