Python html5lib Skipped?Elements

I’ve been working on some interesting python stuff at Mozilla and one task recently called for called for rending a page and then finding elements with a URL attribute value (like img[src] or a[href]) and ensuring they become absolute URLs. ?One problem I encountered when using html5lib was that LINK and IMG elements were being skipped when I tokenized the HTML. ?After browsing through the html5lib source code, I found a variable called voidElements which included both LINK and IMAGE:

voidElements = frozenset((    "base",    "command",    "event-source",    "link",    "meta",    "hr",    "br",    "img",    "embed",    "param",    "area",    "col",    "input",    "source"))

When I commented out those two elements, they were found upon next run of my routine, meaning their presence in the set were causing me problems. ?Here’s how I skirted the issue:

new_void_set = set()for item in html5lib_constants.voidElements:new_void_set.add(item)new_void_set.remove('link')new_void_set.remove('img')html5lib_constants.voidElements = frozenset(new_void_set)

Since voidElements is a frozenset, I couldn’t simply remove LINK and IMG, so I needed to create a new frozenset without those elements. ?Let me know if there’s a more python-ish way of creating this frozen set. ?In an event, delving into the deep recesses of html5lib paid off and I accomplished the goal!

Read the full article at: Python html5lib SkippedElements

Python html5lib Skipped?Elements

相关文章:

你感兴趣的文章:

标签云: