Parsing HTML in Python (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 08 Jan 2012

Parsing HTML in Python

I've been having (mis)adventures learning about Python's various options for parsing HTML.

Up until now, I've avoided doing any HTMl parsing in my RSS reader FeedMe. I use regular expressions to find the places where content starts and ends, and to screen out content like advertising, and to rewrite links. Using regexps on HTML is generally considered to be a no-no, but it didn't seem worth parsing the whole document just for those modest goals.

But I've long wanted to add support for downloading images, so you could view the downloaded pages with their embedded images if you so chose. That means not only identifying img tags and extracting their src attributes, but also rewriting the img tag afterward to point to the locally stored image. It was time to learn how to parse HTML.

Since I'm forever seeing people flamed on the #python IRC channel for using regexps on HTML, I figured real HTML parsing must be straightforward. A quick web search led me to Python's built-in HTMLParser class. It comes with a nice example for how to use it: define a class that inherits from HTMLParser, then define some functions it can call for things like handle_starttag and handle_endtag; then call self.feed(). Something like this:

from HTMLParser import HTMLParser

class MyFancyHTMLParser(HTMLParser):
  def fetch_url(self, url) :
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    link = response.geturl()
    html = response.read()
    response.close()
    self.feed(html)   # feed() starts the HTMLParser parsing

  def handle_starttag(self, tag, attrs):
    if tag == 'img' :
      # attrs is a list of tuples, (attribute, value)
      srcindex = self.has_attr('src', attrs)
      if srcindex < 0 :
        return   # img with no src tag? skip it
      src = attrs[srcindex][1]
      # Make relative URLs absolute
      src = self.make_absolute(src)
      attrs[srcindex] = (attrs[srcindex][0], src)

    print '<' + tag
    for attr in attrs :
      print ' ' + attr[0]
      if len(attr) > 1 and type(attr[1]) == 'str' :
        # make sure attr[1] doesn't have any embedded double-quotes
        val = attr[1].replace('"', '\"')
        print '="' + val + '"')
    print '>'

  def handle_endtag(self, tag):
    self.outfile.write('</' + tag.encode(self.encoding) + '>\n')

Easy, right? Of course there are a lot more details, but the basics are simple.

I coded it up and it didn't take long to get it downloading images and changing img tags to point to them. Woohoo! Whee!

The bad news about HTMLParser

Except ... after using it a few days, I was hitting some weird errors. In particular, this one:
HTMLParser.HTMLParseError: bad end tag: ''

It comes from sites that have illegal content. For instance, stories on Slate.com include Javascript lines like this one inside <script></script> tags:
document.write("<script type='text/javascript' src='whatever'></scr" + "ipt>");

This is technically illegal html -- but lots of sites do it, so protesting that it's technically illegal doesn't help if you're trying to read a real-world site.

Some discussions said setting self.CDATA_CONTENT_ELEMENTS = () would help, but it didn't.

HTMLParser's code is in Python, not C. So I took a look at where the errors are generated, thinking maybe I could override them. It was easy enough to redefine parse_endtag() to make it not throw an error (I had to duplicate some internal strings too). But then I hit another error, so I redefined unknown_decl() and _scan_name(). And then I hit another error. I'm sure you see where this was going. Pretty soon I had over 100 lines of duplicated code, and I was still getting errors and needed to redefine even more functions. This clearly wasn't the way to go.

Using lxml.html

I'd been trying to avoid adding dependencies to additional Python packages, but if you want to parse real-world HTML, you have to. There are two main options: Beautiful Soup and lxml.html. Beautiful Soup is popular for large projects, but the consensus seems to be that lxml.html is more error-tolerant and lighter weight.

Indeed, lxml.html is much more forgiving. You can't handle start and end tags as they pass through, like you can with HTMLParser. Instead you parse the HTML into an in-memory tree, like this:

  tree = lxml.html.fromstring(html)

How do you iterate over the tree? lxml.html is a good parser, but it has rather poor documentation, so it took some struggling to figure out what was inside the tree and how to iterate over it.

You can visit every element in the tree with

for e in tree.iter() :
  print e.tag

But that's not terribly useful if you need to know which tags are inside which other tags. Instead, define a function that iterates over the top level elements and calls itself recursively on each child.

The top of the tree itself is an element -- typically the <html></html> -- and each element has .tag and .attrib. If it contains text inside it (like a <p> tag), it also has .text. So to make something that works similarly to HTMLParser:

def crawl_tree(tree) :
  handle_starttag(tree.tag, tree.attrib)
  if tree.text :
    handle_data(tree.text)
  for node in tree :
    crawl_tree(node)
  handle_endtag(tree.tag)

But wait -- we're not quite all there. You need to handle two undocumented cases.

First, comment tags are special: their tag attribute, instead of being a string, is <built-in function Comment> so you have to handle that specially and not assume that tag is text that you can print or test against.

Second, what about cases like <p>Here is some <i>italicised</i> text.</p> ? in this case, you have the p tag, and its text is "Here is some ". Then the p has a child, the i tag, with text of "italicised". But what about the rest of the string, " text."?

That's called a tail -- and it's the tail of the adjacent i tag it follows, not the parent p tag that contains it. Confusing!

So our function becomes:

def crawl_tree(tree) :
  if type(tree.tag) is str :
    handle_starttag(tree.tag, tree.attrib)
    if tree.text :
      handle_data(tree.text)
    for node in tree :
      crawl_tree(node)
    handle_endtag(tree.tag)
  if tree.tail :
    handle_data(tree.tail)

See how it works? If it's a comment (tree.tag isn't a string), we'll skip everything -- except the tail. Even a comment might have a tail:
<p>Here is some <!-- this is a comment --> text we want to show.</p>
so even if we're skipping comment we need its tail.

I'm sure I'll find other gotchas I've missed, so I'm not releasing this version of feedme until it's had a lot more testing. But it looks like lxml.html is a reliable way to parse real-world pages. It even has a lot of convenience functions like link rewriting that you can use without iterating the tree at all. Definitely worth a look!

Tags: , ,
[ 15:04 Jan 08, 2012    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus