Wednesday, May 7, 2008

Bash Tip: Proving a Negative with grep and diff

I stumbled across an interesting problem a while back. Given a set of data how do you determine correlating data that isn't there?

I was given a file containing a list of names and some other lines that indicated a successful condition. If a name on one line was followed by a success statement, then the condition was successful for the previous line. However, if a name was followed by another name, then the condition had failed for the first name.

Confused already? Let's look at an example and see if that clears things up. Say we have a file, tally.txt, whose contents look something like this:

Doug
Jim
voteSuccess
Diana
voteSuccess
Thomas
voteSuccess
Drew
Elizabeth
Chris
Adrienne
voteSuccess
Nicholas
voteSuccess
Anita
Greg
Jacob
Trudy
voteSuccess
Alex
voteSuccess
Richard
voteSuccess
Donald
Sam
Steve
Bob
Nathan
voteSuccess
Penelope
voteSuccess
Bishop
voteSuccess
Dustin
voteSuccess
Ron
George
Henry
voteSuccess
Arthur
Reggie
voteSuccess


Here is the same file with line numbers added to make things clearer:


1 Doug
2 Jim
3 voteSuccess
4 Diana
5 voteSuccess
6 Thomas
7 voteSuccess
8 Drew
9 Elizabeth
10 Chris
11 Adrienne
12 voteSuccess
13 Nicholas
14 voteSuccess
15 Anita
16 Greg
17 Jacob
18 Trudy
19 voteSuccess
20 Alex
21 voteSuccess
22 Richard
23 voteSuccess
24 Donald
25 Sam
26 Steve
27 Bob
28 Nathan
29 voteSuccess
30 Penelope
31 voteSuccess
32 Bishop
33 voteSuccess
34 Dustin
35 voteSuccess
36 Ron
37 George
38 Henry
39 voteSuccess
40 Arthur
41 Reggie
42 voteSuccess


Lines 3, 5, 7, 12, 14, 19, 21, 23, 29, 31, 33, 35, 39, and 42 all indicate a success condition (voteSuccess). By the conventions of the file, that means that the people on the preceding lines actually had the success (lines 2, 4, 6, 11, 13, 18, 20, 22, 28, 30, 32, 34, 38, and respectively). The problem is that we want to find out who wasn't able to successfully vote. We need to find some way to extract those that voted successfully from the file and only leave those that weren't able to vote.

It should be noted that I simplified this example quite a bit. The success condition string (voteSuccess) could actually be one of a host of things, so it is not just one known string that we can work against, but it is good enough for this exercise.

Of course this whole situation would be a lot easier if the program that created this file placed some sort of indication of failure after the names of the people that didn't have success in voting. Unfortunately, in many instances, we're often stuck with the formats we're given and have to find a way to make them work.

After a little thought, I came up with a psuedo algorithm that I thought might solve the problem:

  1. Correlate all the success conditions with the appropriate people.

  2. Strip these into a file of successful voters sorted alphabetically.

  3. Strip out all the status messages, leaving only voters, and sort them alphabetically into another file of all voters.

  4. Check the difference between the files. As the successful voters will be in both files, only those that failed will be different.


For step one, we'll use a somewhat current version of the grep utility (the one I used was 2.5.1, you can find the version by typing grep -V) to print all the lines of tally.txt that contain voteSuccess and every line above them. The -B switch for grep prints however many lines you want above the string that you were looking for. Typing:

grep -B 1 voteSuccess tally.txt

You'll notice that the -B switch prints -- between contiguous blocks of matches.

Jim
voteSuccess
Diana
voteSuccess
Thomas
voteSuccess
--
Adrienne
voteSuccess
Nicholas
voteSuccess
--
Trudy
voteSuccess
Alex
voteSuccess
Richard
voteSuccess
--
Nathan
voteSuccess
Penelope
voteSuccess
Bishop
voteSuccess
Dustin
voteSuccess
--
Henry
voteSuccess
--
Reggie
voteSuccess

We'll clean those out by piping the output into an invert match of grep.

grep -B 1 voteSuccess tally.txt | grep -v ^[--]

Now we'll clean out the voteSuccess condition statements and sort the output.

grep -B 1 voteSuccess tally.txt | grep -v ^[--] | grep -v voteSuccess | sort

Our output from the first command sequence looks like this:

Adrienne
Alex
Bishop
Diana
Dustin
Henry
Jim
Nathan
Nicholas
Penelope
Reggie
Richard
Thomas
Trudy

Now that we have a list of the successful voters, let's redirect it to the file, successfulvoters.txt, that we'll later use to ferret out the failed voters.

grep -B 1 voteSuccess tally.txt | grep -v ^[--] | grep -v voteSuccess | sort > successfulvoters.txt

Next, we need to pull together a sorted list of all voters. This is pretty easy, all we have to do is an inverted search for the term voteSuccess. The only things left will be the names of all the voters which we can sort and redirect into the file allvoters.txt

grep -v voteSuccess tally.txt | sort > allvoters.txt

Finally, we'll compares the successfulvoters.txt and allvoters.txt files using the diff utility. As diff can be verbose, we'll ask it to output an ed (line editor) script by employing the -e switch. These script instructions will highlight what needs to happen to the successfulvoters.txt file in order to make it look like the allvoters.txt file... which is add back all the failed users.

The failed users that we were trying to figure out how to isolate.

Let's compare files:

diff -e successfulvoters.txt allvoters.txt

That gives us the following:

12a
Ron
Sam
Steve
.
6a
Jacob
.
5a
Elizabeth
George
Greg
.
4a
Donald
Doug
Drew
.
3a
Bob
Chris
.
2a
Anita
Arthur
.


If you look at this output upside down, you can follow along in the successfullvoters.txt file and see where these additions would be added in order to make a complete list of users. If you can't do it mentally, I've flipped the output here:

.
Arthur
Anita
2a
.
Chris
Bob
3a
.
Drew
Doug
Donald
4a
.
Greg
George
Elizabeth
5a
.
Jacob
6a
.
Steve
Sam
Ron
12a

However, we just need the usernames and not the commands for the ed utility. If we get rid of every line that starts with a number (our voter names don't start with numbers) and every line with a period (.), and then run that through sort, we should have an alphabetical list of people that couldn't successfully vote.

To get rid of any line that begins for a number, we'll do an invert search on the output of the diff command. The expression ^[[:digit:]] uses the carat (^) character to denote starting the line and the shorthand expression [[:digit:]] to denote any number. That will just leave us to content with the periods, which we can remove by piping this output into yet another inverted grep search asking to return any line that doesn't contain them, [.]. Then we sort the output to make it more useable.

diff -e successfulvoters.txt allvoters.txt | grep -v ^[[:digit:]] | grep -v [.] | sort

And that's that!

We can put it all together in a quick and dirty bash script that will parse out the failed voters given a file name to process.


#!/bin/bash

# Take the filename from the command line and stuff it into a variable
TALLY=$1

# Find the successful voters
grep -B 1 voteSuccess $TALLY | grep -v ^[--] | grep -v voteSuccess | sort > successfulvoters.txt

# Find all the voters
grep -v voteSuccess $TALLY | sort > allvoters.txt

# Find the difference between the successful voters
# and all the possible voters (ie the failed voters)
diff -e successfulvoters.txt allvoters.txt | grep -v ^[[:digit:]] | grep -v [.] | sort

Does anyone have other ideas how to tackle this problem? While the test case was relatively small, the actual data set contained tens of thousands of entries.

I see a lot of redundancy in this solution with its two passes over the tally file. On the other hand, I had a working solution in about 15 minutes.

I have tested this a couple of times and it looks to work on all my data sets. Perhaps there's a problem that I am not seeing and, if so, please speak up and let me know.

How would you solve this problem?

Do you have any file processing war stories? Tricks of the trade you'd like to share?

Also, I have a couple of texts that I have use to flesh out my understanding of scripting and the bash shell. The first is the O'Reilly book Learning the bash Shell by Cameron Newham. It is a step by step introduction to bash shell concepts and includes a good overview of many standard shell tools and techniques.

I also really like the more general book by Stephen Kochan and Patrick Wood titled Unix Shell Programming (3rd ed). Kochan and Wood write the book to the POSIX standard for shells which should help in writing maintainable and portable scripts, however they also make an effort to point out how each shell differs in its approach. It has its faults, as most books do, but it is solid nonetheless.

No comments: