Convert patterns in only some lines to title case (Shallow Thoughts)

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Sun, 18 Dec 2011

Convert patterns in only some lines to title case

A friend had a fun problem: she had some XML files she needed to import into GNUcash, but the program that produced them left names in all-caps and she wanted them more readable. So she'd have a file like this:

<STMTTRN>
   <TRNTYPE>DEBIT
   <DTPOSTED>20111125000000[-5:EST]
   <TRNAMT>-22.71
   <FITID>****

   <NAME>SOME    COMPANY
   <MEMO>SOME COMPANY    ANY TOWN   CA 11-25-11 330346
</STMTTRN>
and wanted to change the NAME and MEMO lines to read Some Company and Any Town. However, the tags, like <NAME>, all had to remain upper case, and presumably so did strings like DEBIT. How do you change just the NAME and MEMO lines from upper case to title case?

The obvious candidate to do string substitutes is sed. But there are several components to the problem.

Addresses

First, how do you ensure the replacement only happens on lines with NAME and MEMO?

sed lets you specify address ranges for just that purpose. If you say sed 's/xxx/yyy/' sed will change all xxx's to yyy; but if you say sed '/NAME/s/xxx/yyy/' then sed will only do that substitution on lines containing NAME.

But we need this to happen on lines that contain either NAME or MEMO. How do you do that? With \|, like this: sed '/\(NAME\|MEMO\)/s/xxx/yyy/'

Converting to title case

Next, how do you convert upper case to lower case? There's a sed command for that: \L. Run sed 's/.*/\L&/' and type some upper and lower case characters, and they'll all be converted to lower-case.

But here we want title case -- we want most of each word converted to lowercase, but the first letter should stay uppercase. That means we need to detect a word and figure out which is the first letter.

In the strings we're considering, a word is a set of letters A through Z with one of the following characteristics:

  1. It's preceded by a space
  2. It's preceded by a close-angle-bracket, >

So the pattern /[ >][A-Z]*/ will match anything we consider a word that might need conversion.

But we need to separate the first letter and the rest of the word, so we can treat them separately. sed's \( \) operators will let us do that. The pattern \([ >][A-Z]\) finds the first letter of a word (including the space or > preceding it), and saves that as its first matched pattern, \1. Then \([A-Z]*\) right after it will save the rest of the word as \2.

So, taking our \L case converter, we can convert to title case like this: sed 's/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g

Starting to look long and scary, right? But it's not so bad if you build it up gradually from components. I added a g on the end to tell sed this is a global replace: do the operation on every word it finds in the line, otherwise it will only make the substitution once, on the first word it sees, then quit.

Putting it together

So we know how to seek out specific lines, and how to convert to title case. Put the two together, and you get the final command:

sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g'

I ran it on the test input, and it worked just fine.

For more information on sed, a good place to start is the sed regular expressions manual.

Tags: , ,
[ 14:13 Dec 18, 2011    More linux/cmdline | permalink to this entry | ]

Comments via Disqus:

blog comments powered by Disqus