Shallow Thoughts : tags : i18n

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 25 Nov 2009

Character Sets and Encodings in Linux, part 2

Continuing the discussion of those funny characters you sometimes see in email or on web pages, today's Linux Planet article discusses how to convert and handle encoding errors, using Python or the command-line tool recode:

Mastering Characters Sets in Linux (Weird Characters, part 2).

Tags: , , , , , , ,
[ 15:06 Nov 25, 2009    More writing | permalink to this entry | ]

Thu, 12 Nov 2009

Article: Character Sets and Encodings in Linux

or: Why do I See All Those Those Weird Characters?

Today's Linux Planet article concerns those funny characters you sometimes see in email or on web pages, like when somebody puts “random squiggles’ around a phrase when they probably meant “double quotes”:

Character Sets in Linux or: Why do I See Those Weird Characters?.

Today's article covers only what users need to know. A followup article will discuss character encoding from a programmer's point of view.

Tags: , , , , ,
[ 16:34 Nov 12, 2009    More writing | permalink to this entry | ]

Wed, 21 Oct 2009

Un-unicode: translating web pages to plain ASCII

It's not that I'm a dumb provincial American, really!

I mean, okay, I am a dumb provincial American. But not completely. I know about Unicode, I know what UTF-8 and ISO-8859-1 and -15 are, I even know how to type Spanish characters like ñ and á in email (at least in Ubuntu; I can't seem to make it work in Gentoo).

The real problem is PalmOS -- I've never found any way to create Plucker files for my Palm that display anything beyond the standard ASCII character set. (I'm not clear whether to blame that on Palm or Plucker. Doesn't matter.)

So when I use a program like Sitescooper or my new FeedMe RSS reader to read daily news on my Palm, I'm forever seeing lines like this:

the weather phenomenon known as ÅoEl Ni€oÅq is

It's tiresome to try to read stuff like that.

Strangely, I've found no libraries to do this, in any language. There are lots of ways to translate from one character encoding into another -- but no way to degrade from nonASCII characters to the nearest ASCII equivalent. Googling finds lots of people asking for them -- I'm far from the only one who wants this. There are various partial hacks, but nothing ready-to-go.

Oh, well, welcome to the programming world. Time to roll my own. I started from some nice tricks I picked up in the web discussions I found, and ended up with something reasonably compact. Of course, the table of fallback characters will grow.

But my ace in the hole, this time, is that my little function has a way of logging errors. When it sees a character it doesn't recognize, it can log the character code to a file, making it easy to add a translation for that character. That was always the problem with similar hacks I'd attempted to add to mutt or plucker or sitescooper in the past: figuring out each new character and what its intended meaning was, so I could add it to the translation table.

Here it is: ununicode.

Call it like this:

import ununicode

ununicode.toascii(str, errfilename=os.path.join("/path/to/errfile"))

There's also a minimal test script provided (which will also grow with time as I accumulate good samples).

Tags: , , , ,
[ 20:48 Oct 21, 2009    More programming | permalink to this entry | ]

Fri, 02 May 2008

Two font mysteries solved

This has been a good week for fonts: two longstanding mysteries solved.

The first concerns the bitstream vera sans mono I've been using as a terminal font in apps like rxvt and xterm. I'd been specifying it in ~/.Xdefaults like this:
XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-*-*-iso10646-1

The mystery is that I'd noticed that in xterm, the font looked slightly different -- slightly uglier -- than in rxvt (both apps use the same X class name of XTerm). It was hard to put my finger on what was different -- the shape of all the letters looked the same, but it just seemed a little more ragged, and a little less compact, in xterm. I figured it was just a minor difference in their drawing code, or something.

Well, I was fiddling with fonts (trying to get the new-to-me "Inconsolata" font working) and I noticed that iso10646 bit. I didn't know what 10646 was, but shouldn't it be 8859-1 or 8859-15, the codes for the Latin-1 alphabet? After finishing up my Inconsolata experiments, when I set the font back to Vera I changed the line to XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-*-*-iso8859-15 and moved on to other things.

Until the next morning, when I booted up to a surprise: my main terminal window no longer fit on the screen. It seems it had reverted to the other (uglier) version of Vera Sans Mono, which is also very slightly taller, so instead of being a couple of lines shorter than the screen height, it was a couple of lines too tall to fit.

I checked .Xdefaults -- yes, it was still Vera. What was going on? I finally remembered the one thing I had changed: the language setting on the font, from 10646-1 to 8858-15. I changed it back: sure enough, now the font was pretty again and the terminal was short enough to fit.

I fired up xfontsel and did some experimenting. It turned out the difference between the two almost-identical Vera sans mono bold roman fonts is a field xfontsel calls "spc". It can be either 'c' or 'm'. The 'c' version is the pretty, compact font; the 'm' is the uglier, taller one. For some reason, specifying 10646-1 makes "spc" default to 'c', while 8859-15 makes it default to 'm'. But specifying 'c' in the font specifier gets the good version regardless of which language is specified.

So this would work: XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-c-*-*-*

But then I read up on 10646-1 and it turns out to mean "the whole unicode character set". That sounds like a good idea, so I kept it in my font specifier after all: XTerm*font: -bitstream-bitstream vera sans mono-bold-r-normal-*-12-*-*-*-c-*-iso10646-1

(For the moment I still didn't know what spc, c or n meant; read on if you're curious.)

The second insight concerned a longstanding mystery of Dave's. He has been complaining for quite a while about the way Ubuntu's modern pango-based apps all refuse to see bitmapped fonts. (It bothered me too, but less so, because the terminal and editor apps I use can see X fonts.)

Dave has an Ubuntu install on one machine that he's been upgrading release after release, which does see his bitmapped fonts. But any fresh Ubuntu installation fails to see the fonts. What was the difference?

We knew about the trick of going into /etc/fonts/conf.d, removing the symbolic link 70-yes-bitmaps.conf and replacing it with a link to /etc/fonts/conf.avail/70-yes-bitmaps.conf ... But doing that doesn't actually change anything, and bitmap fonts still don't show up.

The secret turned out to be that you need to run fc-cache -fv after changing the font/conf.d links. This apparently never happens on its own -- not on a reboot, not on installing or uninstalling font packages. Somehow it had happened once on Dave's good install, and that's why it worked there but nowhere else.

I'm not sure how anyone is supposed to find out about fc-cache -- there's no man fontconfig, and the /etc/fonts/conf.avail/README offers no clue, just misleadingly says "Fontconfig scans this directory". man fc-cache mentions /usr/share/doc/fontconfig/fontconfig-user.html, which doesn't exist; it turns out on Ubuntu it's actually /usr/share/doc/fontconfig-config/fontconfig-user.html. But wait, that's just an html-ized manual page for fonts-conf, so actually you could just run man fonts-conf ... your guess is as good as mine why the fc-cache man page sends you on a hunt for html files instead.

man fonts-conf is good reading -- it even solves the mystery of that spc parameter. It stands for spacing and can be proportional, dual-width, monospace or charcell. Aha! And there's lots more useful-looking information in that manual page as well.

Tags: , , ,
[ 15:58 May 02, 2008    More linux | permalink to this entry | ]

Sat, 01 Dec 2007

More Tips on International Input

With what I learned last week, I've been able to type accented characters into GTK apps such as xchat, and a few other apps such as emacs. That's nice -- but I was still having trouble reading accented characters in mutt, or writing them in vim to send through mutt (darn terminal apps).

The biggest problem was the terminal. I was using urxvt, but it turns out that urxvt won't let me type any nonascii characters. It just ignores my multi-key sequences, or prints a space instead of the character I wanted. I have no idea why, but switching to plain ol' xterm solved that problem. Of course, I had to make sure that I was using a font that supported the characters I wanted (ISO 8859-1 or 8859-15 or something similar), which leaves out my favorite terminal font (Schumacher Clean bold), but Bitstream Vera Sans Mono bold is almost as readable.

Of course, it's important to have your locale variables set appropriately. There are several locale variables:

LC_CTYPE
Which encodings to use for typing and displaying characters.
LC_MESSAGES
Which translations to use, in programs that offer them.
LC_COLLATE
How to sort alphabetically (this one also affects whether ls groups capitalized filenames first).
LC_ALL
Overrides any of the others.
LANG
The default, in case none of the other variables is set.
There are a few others which control very specific features like time, numbers, money, addresses and paper size: type locale to see all of them.

Once I switched to xterm, I was able to set either LANG or LC_CTYPE to either en_US.UTF-8 or en_US.ISO-8859-1. I set LC_COLLATE and LANG or LC_MESSAGES to C, so that I get the default (usually US) translations for programs and so that ls groups all the capitalized files first.

Along the way, I learned about yet another way to type accented characters.

setxkbmap -model pc104 -layout us -variant intl
switches to an international layout, at which point typing certain punctuation (like ' or ~) is assumed to be a prefix key. So instead of typing [Multi] ~ n, I can just type ~ n. The catch: it makes it harder to type quotes or tildes by themselves (you have to type a space after the quote or tilde).

Even faster, the international layout also offers shortcuts to many common characters with the "AltGr" key, which I'd heard about for years but never knew how to enable. AltGr is the right alt key, and typing, say, AltGr followed by n gives an ñ. You can see a full map at Wikipedia (AltGr characters are blue, quote prefixes are red).

To get back to a US non-international layout:

setxkbmap -model pc104 -layout us

Of course, these aren't the only keyboard layouts to choose from -- there are lots, plus you can define your own. And I was going to write a little bit about that, except it turns out they've changed it all around again since I last did that two years ago (don't you love the digital world?). So that will have to wait for another time.

But the place to start exploring is /usr/share/X11/xkb. The file symbols/us contains the definitions for those US keyboards, and I believe it's included via the files in the rules directory, probably rules/base, base.xml and base.lst. From there you're on your own. But the standard layouts probably follow the ones in the Wikipedia article on keyboard layouts

Tags: , ,
[ 16:48 Dec 01, 2007    More linux | permalink to this entry | ]

Thu, 22 Nov 2007

Typing accented characters (for ignorant 'murricans)

Happy Thanksgiving, everyone! Today's holiday tip involves how to type international characters.

For the online Spanish class I've been taking, so far I've been able to manage without having to type characters like ñ or á. Usually, if I need one I can find it in one of the class examples, copy it, and paste it wherever I need it. But obviously that would be tedious if I needed to type much.

I hacked up a quickie workaround: a python script that shows a set of buttons, one for each accented character I'm likely to need. Clicking a button copies that character to the clipboard, so I can now paste via mouse middleclick or ctrl-V. (I'm sure that sounds pathetic to those of you who type accented characters every day, but it's not something most US English speakers need to do. And besides, now I know how to access the X clipboard from Python-GTK -- hooray for learning new things from procrastination projects!)

Anyway, Mikael Magnusson took pity on me and explained in simple language how to use the X "Multi key" to type these characters the right way (well, a right way, anyway). Since all the online instructions I've seen have been rather complicated, here are the simple instructions for any of my fellow US monolingists who'd like to expand their horizons:

First, choose a key for the "Multi key" that you're not using for anything else. A lot of people use one of the Alt or Windows keys, but I use both of those already. What I don't use is the Menu key (that little key down by the right Ctrl key, at least on my keyboard) since not many Linux apps support it anyway.

Find the keycode for that key, by firing up xev and typing the key. For my Menu key, the keycode is 117.

Now type:

xmodmap -e "keycode 117 = Multi_key"

Now you're ready to type a sequence like: [Menu] ~ n to type an n-tilde, [Menu] ' a for an accented a, or [menu] ? ? for the upside-down question mark, in any app that supports those characters.

Of course, you don't want to type that xmodmap command every time you log in, so to make it permanent, put this in your .Xmodmap (you're on your own for figuring out whether your X environment reads .Xmodmap automatically or whether you need to tell it to run xmodmap .Xmodmap when X starts up):

keycode 117 = Multi_key

I have one final useful international input tidbit to offer: how to type Unicode characters by number. Hold ctrl+shift+U, then release U but keep holding the other two while you type a numeric sequence. (This may only work in gtk apps.) For instance, try this: hold down ctrl and shift, then type: u 2 6 6 c. Cool, huh? You can use the "gucharmap" program to find other neat sequences (hint: View->By Unicode Block otherwise you'll never find anything).

Now it's time to check the turkey. Have a good day, everyone!

Tags: , ,
[ 17:03 Nov 22, 2007    More linux | permalink to this entry | ]