Un-unicode: translating web pages to plain ASCII (Shallow Thoughts)

Akkana's Musings on Open Source, Science, and Nature.

Wed, 21 Oct 2009

Un-unicode: translating web pages to plain ASCII

It's not that I'm a dumb provincial American, really!

I mean, okay, I am a dumb provincial American. But not completely. I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are, I even know how to type Spanish characters like ñ and á in email (at least in Ubuntu; I can't seem to make it work in Gentoo).

The real problem is PalmOS -- I've never found any way to create Plucker files for my Palm that display anything beyond the standard ASCII character set. (I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)

So when I use a program like Sitescooper or my new FeedMe RSS reader to read daily news on my Palm, I'm forever seeing lines like this:

the weather phenomenon known as ÅoEl Ni€oÅq is

It's tiresome to try to read stuff like that.

Strangely, I've found no libraries to do this, in any language. There are lots of ways to translate from one character encoding into another -- but no way to degrade from nonASCII characters to the nearest ASCII equivalent. Googling finds lots of people asking for them -- I'm far from the only one who wants this. There are various partial hacks, but nothing ready-to-go.

Oh, well, welcome to the programming world. Time to roll my own. I started from some nice tricks I picked up in the web discussions I found, and ended up with something reasonably compact. Of course, the table of fallback characters will grow.

But my ace in the hole, this time, is that my little function has a way of logging errors. When it sees a character it doesn't recognize, it can log the character code to a file, making it easy to add a translation for that character. That was always the problem with similar hacks I'd attempted to add to mutt or plucker or sitescooper in the past: figuring out each new character and what its intended meaning was, so I could add it to the translation table.

Here it is: ununicode.

Call it like this:

import ununicode

ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))

There's also a minimal test script provided (which will also grow with time as I accumulate good samples).

Tags: , , , ,
[ 19:48 Oct 21, 2009    More programming | permalink to this entry ]