Shallow Thoughts : tags : sed

Akkana's Musings on Open Source Computing and Technology, Science, and Nature.

Wed, 24 Jul 2013

Yet more on that comma-inserting regexp, plus a pattern to filter unprintable characters

One more brief followup on that comma inserting sed pattern and its followup:

$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
20,130,607,215,015

In the second article, I'd mentioned that the hardest part of the exercise was figuring out where we needed backslashes. Devdas (f3ew) asked on Twitter whether I would still need all the backslash escapes even if I put the pattern in a file -- in other worse, are the backslashes merely to get the shell to pass special characters unchanged?

A good question, and I suspected the need for some of the backslashes would disappear. So I tried this:

$ echo ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   
$ echo 20130607215015 | sed -f /tmp/commas

And it didn't work. No commas were inserted.

The problem, it turns out, is that my shell, zsh, changed both instances of \b to an ASCII backspace, ^H. Editing the file fixes that, and so does

$ echo -E ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta' >/tmp/commas   

But that only applies to echo: zsh doesn't do the \b -> ^H substitution in the original command, where you pass the string directly as a sed argument.

Okay, with that straightened out, what about Devdas' question?

Surprisingly, it turns out that all the backslashes are still needed. None of them go away when you echo > file, so they weren't there just to get special characters past the shell; and if you edit the file and try removing some of the backslashes, you'll see that the pattern no longer works. I had thought at least some of them, like the ones before the \{ \}, were extraneous, but even those are still needed.

Filtering unprintable characters

As long as I'm writing about regular expressions, I learned a nice little tidbit last week. I'm getting an increasing flood of Asian-language spams which my mail ISP doesn't filter out (they use spamassassin, which is pretty useless for this sort of filtering). I wanted a simple pattern I could pass to egrep (via procmail) that would filter out anything with a run of more than 4 unprintable characters in a row. [^[:print:]]{4,} should do it, but it wasn't working.

The problem, it turns out, is the definition of what's printable. Apparently when the default system character set is UTF-8, just about everything is considered printable! So the trick is that you need to set LC_ALL to something more restrictive, like C (which basically means ASCII) to before :print: becomes useful for language-based filtering. (Thanks to Mikachu for spotting the problem).

So in a terminal, you can do something like

LC_ALL=C egrep -v '[^[:print:]]' filename

In procmail it was a little harder; I couldn't figure out any way to change LC_ALL from a procmail recipe; the only solution I came up with was to add this to ~/.procmailrc:

export LC_ALL=C

It does work, though, and has cut the spam load by quite a bit.

Tags: , , , ,
[ 19:35 Jul 24, 2013    More linux/cmdline | permalink to this entry | ]

Tue, 09 Jul 2013

Sed: insert commas into numbers, but in a smarter way

A few days ago I wrote about a nifty sed script to insert commas into numbers that I dissected with the help of Dana Jansens.

Once we'd figured it out, though, Dana thought this wasn't really the best solution. For instance, what if you have a file that has some numbers in it, but also has some digits mixed up with letters? Do you really want to insert commas into every string of digits? What if you have some license plates, like abc1234? Maybe it would be better to restrict the change to digits that stand by themselves and are obviously meant to be numbers. How much harder would that be?

More regexp fun! We kicked it around a bit, and came up with a solution:

$ echo abc20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
abc20,130,607,215,015
$ echo abc20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'
abc20130607215015
$ echo 20130607215015 | sed ':a;s/\b\([0-9]\+\)\([0-9]\{3\}\)\b/\1,\2/;ta'   
20,130,607,215,015

Breaking that down: \b is any word boundary -- you could also use \< to indicate that it's the start of a word, much like \> was the end of a word.

\([0-9]\+\) is any string of one or more digits, taken as a group. The \( \) part marks it as a group so we'll be able to use it later.

\([0-9]\{3\}\) is a string of exactly three digits: again, we're using \( \) to mark it as our second numbered group.

\b is another word boundary (we could use \>), to indicate that the group of three digits must come at the end of a word, with only whitespace or punctuation following it.

/\1,\2/: once we've matched the pattern -- a word break, one or more digits, three digits and another word break -- we'll replace it with this. \1 matches the first group we found -- that was the string of one or more digits. \2 matches the second group, the final trio of digits. And there's a comma in between. We use the same :a; ;ta trick as in the first example to loop around until there are no more triplets to match.

The hardest part of this was figuring out what needed to be escaped with backslashes. The one that really surprised me was the \+. Although * works in sed the same way it does in other programs, matching zero or more repetitions of the preceding pattern, sed uses \+ rather than + for one or more repetitions. It took us some fiddling to find all the places we needed backslashes.

Tags: , ,
[ 21:16 Jul 09, 2013    More linux/cmdline | permalink to this entry | ]

Sun, 07 Jul 2013

Inserting commas into numbers with sed

Carla Schroder's recent article, More Great Linux Awk, Sed, and Bash Tips and Tricks , had a nifty sed command I hadn't seen before to take a long number and insert commas appropriately:

sed -i ':a;s/\B[0-9]\{3\}\gt;/,&/;ta' numbers.txt
. Or, if you don't have a numbers.txt file, you can do something like
echo 20130607215015 | sed ':a;s/\B[0-9]\{3\}\>/,&/;ta'
(I dropped the -i since that's for doing in-place edits of a file).

Nice! But why does it work? It would be easy enough to insert commas after every third number, but that doesn't work unless the number of digits is a multiple of three. In other words, you don't want 20130607215015 to become 201,306,072,150,15 (note how the last group only has two digits); it has to count in threes from the right if you want to end up with 20,130,607,215,015.

Carla's article didn't explain it, and neither did any of the other sites I found that mentioned this trick.

So, with some help from regexp wizard Dana Jansens (of OpenBox fame), I've broken it down into more easily understood bits.

Labels and loops

The first thing to understand is that this is actually several sed commands. I was familiar with sed's basic substitute command, s/from/to/. But what's the rest of it? The semicolons separate the commands, so the whole sed script is:

:a
s/\B[0-9]\{3\}\>/,&/
ta

What this does is set up a label called a. It tries to do the substitute command, and if the substitute succeeds (if something was changed), then ta tells it to loop back around to label a, the beginning of the script.

So let's look at that substitute command.

The substitute

Sed's s/from/to/ (like the equivalent command in vim and many other programs) looks for the first instance of the from pattern and replaces it with the to pattern. So we're searching for \B[0-9]\{3\}\> and replacing it with ,&/

Clear as mud, right? Well, the to pattern is easy: & matches whatever we just substituted (from), so this just sticks a comma in front of ... something.

The from pattern, \B[0-9]\{3\}\>, is a bit more challenging. Let's break down the various groups:

\B
Matches anything that is not a word boundary.
[0-9]
Matches any digit.
\{3\}
Matches three repetitions of whatever precedes it (in this case, a digit).
\>
Matches a word boundary at the end of a word. This was the hardest part to figure out, because no sed documentation anywhere bothers to mention this pattern. But Dana knew it as a vim pattern, and it turns out it does the same thing in sed even though the docs don't say so.

Okay, put them together, and the whole pattern matches any three digits that are not preceded by a word boundary but which are at the end of a word (i.e. they're followed by a word boundary).

Cool! So in our test number, 20130607215015, this matches the last three digits, 015. It doesn't match any of the other digits because they're not followed by a word end boundary.

So the substitute will insert a comma before the last three numbers. Let's test that:

$ echo 20130607215015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607215,015

Sure enough!

How the loop works

So the substitution pattern just adds the last comma. Once the comma is inserted, the ta tells sed to go back to the beginning (label :a) and do it again.

The second time, the comma that was just inserted is now a word boundary, so the pattern matches the three digits before the comma, 215, and inserts another comma before them. Let's make sure:

$ echo 20130607215,015 | sed 's/\B[0-9]\{3\}\>/,&/'
20130607,215,015

So that's how the pattern manages to match triplets from right to left.

Dana later commented that this wasn't really the best solution -- what if the string of digits is attached to other characters and isn't really a number? I'll cover that in a separate article in a few days. Update: Here's the smarter pattern, Sed: insert commas into numbers, but in a smarter way.

Tags: , ,
[ 14:14 Jul 07, 2013    More linux/cmdline | permalink to this entry | ]

Sun, 18 Dec 2011

Convert patterns in only some lines to title case

A friend had a fun problem: she had some XML files she needed to import into GNUcash, but the program that produced them left names in all-caps and she wanted them more readable. So she'd have a file like this:

<STMTTRN>
   <TRNTYPE>DEBIT
   <DTPOSTED>20111125000000[-5:EST]
   <TRNAMT>-22.71
   <FITID>****

   <NAME>SOME    COMPANY
   <MEMO>SOME COMPANY    ANY TOWN   CA 11-25-11 330346
</STMTTRN>
and wanted to change the NAME and MEMO lines to read Some Company and Any Town. However, the tags, like <NAME>, all had to remain upper case, and presumably so did strings like DEBIT. How do you change just the NAME and MEMO lines from upper case to title case?

The obvious candidate to do string substitutes is sed. But there are several components to the problem.

Addresses

First, how do you ensure the replacement only happens on lines with NAME and MEMO?

sed lets you specify address ranges for just that purpose. If you say sed 's/xxx/yyy/' sed will change all xxx's to yyy; but if you say sed '/NAME/s/xxx/yyy/' then sed will only do that substitution on lines containing NAME.

But we need this to happen on lines that contain either NAME or MEMO. How do you do that? With \|, like this: sed '/\(NAME\|MEMO\)/s/xxx/yyy/'

Converting to title case

Next, how do you convert upper case to lower case? There's a sed command for that: \L. Run sed 's/.*/\L&/' and type some upper and lower case characters, and they'll all be converted to lower-case.

But here we want title case -- we want most of each word converted to lowercase, but the first letter should stay uppercase. That means we need to detect a word and figure out which is the first letter.

In the strings we're considering, a word is a set of letters A through Z with one of the following characteristics:

  1. It's preceded by a space
  2. It's preceded by a close-angle-bracket, >

So the pattern /[ >][A-Z]*/ will match anything we consider a word that might need conversion.

But we need to separate the first letter and the rest of the word, so we can treat them separately. sed's \( \) operators will let us do that. The pattern \([ >][A-Z]\) finds the first letter of a word (including the space or > preceding it), and saves that as its first matched pattern, \1. Then \([A-Z]*\) right after it will save the rest of the word as \2.

So, taking our \L case converter, we can convert to title case like this: sed 's/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g

Starting to look long and scary, right? But it's not so bad if you build it up gradually from components. I added a g on the end to tell sed this is a global replace: do the operation on every word it finds in the line, otherwise it will only make the substitution once, on the first word it sees, then quit.

Putting it together

So we know how to seek out specific lines, and how to convert to title case. Put the two together, and you get the final command:

sed '/\(NAME\|MEMO\)/s/\([ >][A-Z]\)\([A-Z]*\)/\1\L\2/g'

I ran it on the test input, and it worked just fine.

For more information on sed, a good place to start is the sed regular expressions manual.

Tags: , ,
[ 14:13 Dec 18, 2011    More linux/cmdline | permalink to this entry | ]