Counting syllables in Python (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 11 Dec 2013

Counting syllables in Python

When I wrote recently about my Dactylic dinosaur doggerel, I glossed over a minor problem with my final poem: the rules of double-dactylic doggerel say that the sixth line (or sometimes the seventh) should be a single double-dactyl word -- something like "paleontologist" or "hexasyllabic'ly". I used "dinosaur orchestra" -- two words, which is cheating.

I don't feel too guilty about that. If you read the post, you may recall that the verse was the result of drifting grumpily through an insomniac morning where I would have preferred to be getting back to sleep. Coming up with anything that scans at all is probably good enough.

Still, it bugged me, not being able to think of a double-dactylic word that related somehow to Parasaurolophus. So I vowed that, later that day when I was up and at the computer, I would attempt to find one and rewrite the poem accordingly.

I thought that would be fairly straightforward. Not so much. I thought there would be some utility I could run that would count syllables for me, then I could run /usr/share/dict/words through it, print out all the 6-syllable words, and find one that fit. Turns out there is no such utility.

But Python has a library for everything, doesn't it?

Some searching turned up PyHyphen, which includes some syllable-counting functions. It apparently uses the hyphenation dictionaries that come with LibreOffice.

There's a Debian package for it, python-pyhyphen -- but it doesn't work. First, it depends on another package, hyphen-en-us, but doesn't have that dependency encoded in the package, even as a suggested or recommended package. But even when you install the hyphenated dictionary, it still doesn't work because it doesn't point to the dictionary in the place it was installed. Looks like that problem was reported almost two years ago, bug 627944: python-pyhyphen: doesn't work out-of-the-box with hyphen-* packages. There's a fix there that involves editing two files, /usr/lib/python2.7/dist-packages/hyphen/config.py and /usr/lib/python2.7/dist-packages/hyphen/__init__.py.

Or you can just give up on Debian and pip install pyhyphen, which is a lot easier.

But once you get it working, you find that it's terrible. It was wrong about almost every word I tried. I hope not too many people are relying on this hyphen-en-us dictionary for important documents. Its results seemed nearly random, and I quickly gave up on it for getting a useful list of words around six syllables.

Just for fun, since my count syllables web search turned up quite a few websites claiming that functionality, I tried entering some of my long test words manually. All of the websites I tried were wrong more than half the time, and often they were off by more than two syllables. I don't mind off-by-ones -- I can look at words claiming 5 and 7 syllables while searching for double dactyls -- but if I have to include 4-syllable words as well, I'll never find what I'm looking for.

That discouraged me from using another Python suggestion I'd seen, the nltk (natural language toolkit) package. I've been looking for an excuse to play with nltk, and some day I will, but for this project I was looking for a quick approximate solution, and the nltk examples I found mostly looked like using it would require a bigger time commitment than I was willing to devote to silly poetry. And if none of the dedicated syllable-counting websites or dictionaries got it right, would a big time investment in nltk pay off?

Anyway, by this time I'd wasted more than an hour poking around various libraries and websites for this silly unimportant problem, and I decided that with that kind of time investment, I could probably do better on my own than the official solutions were giving me. Why not basically just count vowels?

So I whipped up a little script, countsyl, that did just that. I gave it a list of vowels, with a few simple rules. Obviously, you can't just say every vowel is a new syllable -- there are too many double vowels and silent letters and such. But you can't say that any run of multiple vowels together counts as one syllable, because sometimes the vowels do count; and you can't make absolute rules like "'e' at the end of a word is always silent", because sometimes it isn't. So I kept both minimum and maximum syllable counts for each word, and printed both.

And much to my surprise, without much tuning at all my silly little script immediately much better results than the hyphenation dictionary or the dedicated websites.

Alas, although it did give me quite a few hexasyllabic words in /usr/share/dict/words, none of them were useful at all for a program on Parasaurolophus. What I really needed was a musical term (since that's what the poem is about). What about a musical dictionary?

I found a list of musical terms on Wikipedia: Glossary of musical terminology, saved it as a local file, ran a few vim substitutes and turned it into a plain list of words. That did a little better, and gave me some possible ideas: (non?)contrapuntally? (something)harmonically? extemporaneously?

But none of them worked out, and by then I'd run out of steam. I gave up and blogged the poem as originally written, with the cheating two-word phrase "dinosaur orchestra", and vowed to write up how to count words in Python -- which I have now done. Quite noncontrapuntally, and definitely not extemporaneously. But at least I have a useful little script next time I want to get an approximate syllable count.

Tags: , , , , ,
[ 17:51 Dec 11, 2013    More programming | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus