Thursday, September 25, 2008

Mike MacCana’s VentureCake - The 5 lines that mystified O’Reilly - how to use a spreadsheet in bash

You should read Mike MacCana’s post "The 5 lines that mystified O’Reilly: how to use a spreadsheet in bash". NOTE: I took down the link as it is now parked. Venture Cake appears to be gone.

Given that I just reviewed the Bash Cookbook mentioned in Mike's post, I was really happy to find this elegant solution to the problem of processing comma separated value (CSV) files with a simple bash script which doesn't appear in its pages.

It isn't an all purpose solution by any means, but a useful one if you need to expand on it. Some discussion of its finer good points and failings can be found on the Hacker News post that alerted me to it in the first place.

Tuesday, September 23, 2008

Bash Tip: Reverse Sorting Lists Revisted; Reversing a Horizontal List

A while back, I noticed that I was getting a lot of hits to my site from search engines like Google with search terms containing the terms "bash reverse sort list." So I decided to write a post about various ways to reverse a list using bash's tools. If my hit counter is any indicator, it is the most popular post I have written yet with thirty out of the last one hundred hits entering on that page.

Recently, I was messing around with a data problem and realized that I needed to reverse a horizontal list. "No problem," I thought, "I'll just use one of those techniques I talked about before." Unfortunately, I didn't cover that case.

On Linux, this is pretty simple, I found, if you have access to the rev command.

Say we have input in the form of the variable named testingHZsort that looks something like this:


bash $ testingHZsort="a aa ab b bd be c cqw cw d de defg h hi hij hijk i ii iii ij"
bash $
bash $ echo ${testingHZsort}
a aa ab b bd be c cqw cw d de defg h hi hij hijk i ii iii ij
bash $


Now, because of variable expansion, we wouldn't be able to call rev on the variable directly as we'll get a bunch of "No such file or directory" errors as rev tries to find files by the names in individual elements of testingHZsort. To get around this problem, we just have to expand the variable and then pipe it to the rev command.


bash $ echo ${testingHZsort} | rev
ji iii ii i kjih jih ih h gfed ed d wc wqc c eb db b ba aa a


Works like a charm.

The only problem is that you have to have access to the rev utility, which isn't available on the versions of Solaris that I have access to, so I had to go back to the drawing board.

On Solaris, I do have access to the sort utility, which works fine on vertical lists and include a reverse (-r) option. All I have to do is convert my vertical data into horizontal data, execute the reverse sort and then convert the vertical output back to horizontal output. Piping a couple of common utilities together should produce the desired result.

The first thing to do is convert the horizontal list into a vertical list, we can do this using the tr, or translate, utility to convert the spaces (' ') between testingHZsort's elements into newlines ('\n').


bash $ echo ${testingHZsort}
a aa ab b bd be c cqw cw d de defg h hi hij hijk i ii iii i
bash $
bash $ echo ${testingHZsort} | tr ' ' '\n'
a
aa
ab
b
bd
be
c
cqw
cw
d
de
defg
h
hi
hij
hijk
i
ii
iii
ij
bash $


Now we can pipe that output through sort with the reverse (-r) option.


bash $ echo ${testingHZsort} | tr ' ' '\n' | sort -r
ij
iii
ii
i
hijk
hij
hi
h
defg
de
d
cw
cqw
c
be
bd
b
ab
aa
a
bash $


Then we reverse the previous translation and replace all the newlines (\n) with spaces (' ').


bash $ echo ${testingHZsort} | tr ' ' '\n' | sort -r | tr '\n' ' '
ij iii ii i hijk hij hi h defg de d cw cqw c be bd b ab aa a bash $


That leaves us with a little problem of a final newline being converted into a space and not return our bash prompt (bash $) to its proper position. We can fix that by appending a final newline to the output with printf.


bash $ echo ${testingHZsort} | tr ' ' '\n' | sort -r | tr '\n' ' '; printf "\n"
ij iii ii i hijk hij hi h defg de d cw cqw c be bd b ab aa a
bash $


What about horizontal lines of data that are separated by some other delimiter than a space? What about a comma separated list?

That's easy, actually. It is just a little different from the earlier examples in that we are going to replace the translation of the space into newlines with the translation of your new delimiter with newlines instead.

So, let's say that testingHZLine looks something like this:


bash $ testingHZsort="a,aa,ab,b,bd,be,c,cqw,cw,d,de,defg,h,hi,hij,hijk,i,ii,iii,ij"
bash $ echo ${testingHZsort}
a,aa,ab,b,bd,be,c,cqw,cw,d,de,defg,h,hi,hij,hijk,i,ii,iii,ij
bash $


We'll do the same sorting that we did before, but this time we'll replace all the commas (,) with newlines (\n).


bash $ echo ${testingHZsort} | tr ',' '\n' | sort -r | tr '\n' ','
ij,iii,ii,i,hijk,hij,hi,h,defg,de,d,cw,cqw,c,be,bd,b,ab,aa,a,bash $


We've lost our last newline again and we have an extra comma to deal with.

That's simple to clean up with a sed substitution that will turn only the trailing comma into a newline character. We'll use the end of line positional regex character ($) to do that. So we match a comma at the end of the line (,$) and replace it with a newline (\n).


bash $ echo ${testingHZsort} | tr ',' '\n' | sort -r | tr '\n' ',' | sed -e 's/,$/\n/'
ij,iii,ii,i,hijk,hij,hi,h,defg,de,d,cw,cqw,c,be,bd,b,ab,aa,a
bash $


That's all there is to it.

If you would like to improve your bash scripting skills you might want to consider picking up a copy of the Bash Cookbook. I highly recommend it. You can read my full review of it, here.

cc photo credit: Man vyi

Friday, August 15, 2008

Book Review: The Bash Cookbook

The Elevator Pitch

The Bash Cookbook is a must own book for anyone that uses Unix and Linux for fun or profit. Bash is a powerful shell environment available in everything from Mac OS X to commercial Unix offerings like Solaris. Being comfortable and productive with this shell is going to make your life a helluva lot easier. The Bash Cookbook serves as a digestible tutor to this powerful shell while maintaining a depth that makes it a valuable reference for solutions to many of the common problems that command line power users face.

The Full Review

I've been a Unix user since my first days studying Computer Science at college in the early nineties. While coming from using MS-DOS in high school and being plunked in front of a terminal with a dollar ($) prompt probably wasn't as disorienting as a move from Windows might have been, it was still pretty confusing. I struggled through the first few years until I took a systems programming class and finally started to understand the big picture of Unix. Still, it wasn't until almost a decade later that I decided to really try to wrap my mind around the Unix command line, and more specifically the bash shell.

As a Unix administrator, I have now been using the shell environment professionally for over five years. Bash is my shell of choice and I use it to do everything from processing various system logs, to running assorted backups, to creating system monitors, to wrapping more complex commands into usable interfaces, to transforming data into more usable formats. To get to that point, I spent a lot of time reading books like Learning the Bash Shell, hanging out on the shell scripting forums at Unix.com and reading various sysadmin blogs. All that is to say that I think I have a good grasp of the Unix/Linux command line in general and the bash shell in particular.

Recently, I had the opportunity to read the Bash Cookbook. Of all the technical books that I read for personal and professional gain, I prefer the formats of both O'Reilly's Hacks series and its Cookbooks for how they cover common problems and solutions in various technical subjects. I find them easy to digest, as both formats generally break large technical topics into bite sized chunks that present problems and solutions in very thorough, but approachable, ways. I find myself flying though these books. After reading a couple of pages that cover a single hack or recipe, I generally feel like I have learned something versus having to slog through twenty or so chapter pages in a typical tech book.

Thankfully, The Bash Cookbook stands up with its predecessors. The authors, Carl Albing, JP Vossen and Cameron Newham (also an author of the aforementioned Learning the Bash Shell) have backgrounds ranging from general technologists and authors to software developers for the Cray supercomputer company. This multifaceted experience set serves them well as they tackle various bash scripting topics from the mundane to the puzzling to the downright arcane. All of this is done with an approachable style and format that first identifies a problem, then offers a generalized solution, and finally follows up with a detailed discussion of the problem and solutions. This approach helps identify both the reasoning behind their solutions and the corner cases that will either further inform your own implementations or warn you that here be dragons.

The Bash Cookbook is divided into nineteen chapters and five appendixes, a few of which (most notably "Appendix D: Revision Control") could have served as full-on chapters by themselves. Topics include getting started with bash on various platforms (chapter 1); dealing with the intricacies of standard input and output redirection (chapters 2 and 3); job control (chapter 4); shell variables and arithmetic (chapters 5 and 6); finding and manipulating data (chapters 7, 8, and 9); working with functions and trapping conditions (chapter 10); manipulating dates and time (chapter 11); wrapping complex tasks (chapter 12); parsing files (chapter 13); writing scripts securely (chapter 14) ; bash corner cashes (chapter 15); customizing the bash environment (chapter 16); common system administration tasks (chapter 17); bash tips to be more productive (chapter 18); and, finally, common traps and workarounds for novice bash scripters (chapter 19). As you can see, there is a wealth of information to be had between the covers of this book.

I found useful information from the beginning chapters (which are often throw away generalized instructions for getting up to speed in most tech books) all the way to the appendices themselves. Some standout recipes from the book include:
  • 3.7 Selecting from a List of Options
  • 5.2 Embedding Documentation in Shell Scripts
  • 5.17 Giving an Error Message for Unset Parameters
  • 5.19 Using Array Variables
  • 7.15 Showing Data As a Quick and Easy Histogram
  • 8.3 Sorting IP Addresses
  • 9.9 Finding Files by Content
  • 10.6 Trapping Interrupts
  • 13.4 Parsing Output into an Array
  • 13.12 Isolating Specific Fields in Data
  • 15.10 Finding My IP Address
  • 15.13 Working Around "argument list too long" Errors
  • 15.15 Sending Email from Your Script
  • 16.4 Change your $PATH Temporarily
  • 17.1 Renaming Many Files
  • 17.8 Capturing File Metadata for Recovery
  • 17.13 Prepending Data to a File
  • 17.16 Finding Lines in One File But Not in the Other
  • 17.17 Keeping the Most Recent N Objects
  • 19.11 Seeing Odd Behavior from printf
Chapter 14 of The Bash Cookbook demands special mention in this review. Titled "Writing Secure Shell Scripts", it opens with a general discussion of the need for writing secure shell scripts and gives a basic template utilizing many of the features that can make the average shell script more secure. The subsequent twenty three recipes flesh out this template with surprising, but approachable, detail. Of all the subjects in this book, this chapter's topic makes it worth buying and retaining as a goto reference. After skimming the recipes of chapter fourteen, I found a number of ways to make my scripts better. For instance, much of the "common wisdom" for creating temporary files that I have found around the Internet and in various books is simply wrong and, as chapter fourteen lays out, highly susceptible to race conditions. Some day, out of curiosity, I'd like to survey some open source projects that make use of shell scripts for installation, configuration, and maintenance and see how their handling of temporary files matches up. From the cursory searches that I have made, I am afraid what the results might show.

The writing in the Cookbook is clear and to the point and incredibly consistent given that it was written by three writers. This is either a testament to the writing team and their ability to assimilate each other's styles or to O'Reilly's editorial staff's ability to tie the whole thing together (or, I assume, both). I particularly enjoyed the in depth discussion that many recipes received. It had the feel of looking over the shoulder of a veteran Unix admin and having the chance to pick his brain about why he was making the choices he was and why he was going about his business in a particular way. That is the book's greatest strength. As someone who has had to pick up Unix and Linux skills largely on his own, I found this approach invaluable. If you aren't surrounded by a Unix culture, it can be hard to pick up some of the more useful, but more complex, tricks of the trade. Think of The Bash Cookbook as your grey beard Unix hacker mentor on a shelf.

The book and its Table of Contents and Index are so comprehensive with regards to the common types of tasks that one generally performs while writing shell scripts, that it has become, in the short time that I have had it, my first (and usually last) goto reference. If I forget how to search for keywords in files across directories for instance, it just took a quick scan of the Index to find a very good and working answer. I use this book so much, that I am considering buying a second copy to keep at home so I don't have to haul my dog-eared version back and forth to and from work. It is that useful.

Really.

Some Nits to Pick

As with any large project such as a book, there are bound to be a few things that slip through the cracks. The Bash Cookbook is no different. For instance, recipe 6.6 talks about the different ways to check for equality in bash including the use of the single equals (=) or double equals (==) signs. Functionally these two constructs are exactly the same, but using the single equals is more portable as it follows the POSIX standard. That's fine, and very good to know. However, the use of these constructs isn't consistent in the book, which could lead to confusion as the explanation that it really doesn't matter doesn't happen until page sixty four. Even worse, much earlier in the book recipe 3.7 is an example of the use of these two constructs not even being consistent in a single script where the variable $directory is checked for equality with the string "Finished" on one line with the double equals construct (==) and another with the single equals construct (=). From a script maintainability respect, being this inconsistency is a bad idea.

One problem is the seeming omission of the treatment of arrays in bash. Most people unfamiliar with bash don't even realize that there are simple single dimensional arrays available in the environment, so I was happy to see some recipes that covered this topic. However, some of the more powerful array manipulations techniques, such as the ability to find the number of elements in an array with the simple ${array_name[@]} construct or the length of an individual array item with the ${#array_name[index number]} construct which is covered in the discussion of recipe 13.4, "Parsing Output into an Array", are buried in other recipes and hard to find even with the index. This misunderstanding could probably be helped if the See Also sections of each recipe pointed to other recipes in the book that dealt with similar subjects. Recipe 5.19, "Using Array Variables," only points to a section the the O'Reilly book Learning the Bash Shell. Other recipes in the book do a fine job of pointing out external sources of information as well as other recipes, so I think this is just a matter of some editorial consistency that would need to be beefed up for the next edition.

The authors make a conscious effort to stick with core bash tools throughout the text. As the note in the "Preface" of the book, Perl is covered elsewhere. Though they do say they are okay using the right tool for the job and sometimes they tell you when it is best to use something else... much better than having the reader beat their heads against a wall in my opinion. This is a book about bash after all and it would be maddening if many solutions switched to other non-bash solutions whenever something wasn't readily able to be solved with the bash tool set. Unfortunately there are times when they overlook common bash tools in favor of other scripting languages like sed and awk. The prime example of recipe 7.10, "Keeping Some Output, Discarding the Rest" where they use awk to solve the problem. Sure, awk works, but I would have preferred if they would have at least mentioned the cut utility in this context if only for comparison sake. They should have at least linked to recipe 8.4, "Cutting Out Parts of Your Output", and recipe 13.12, "Isolating Specific Fields in Data." Again, I this is just a matter of editorial consistency and just one of only a few examples where the fact that the book was written by multiple authors becomes mildly apparent.

Some other minor editorial issues revolve around typos and other minor errata. On page 84, "Thought" in the third Discussion paragraph should be "Though." On page 64, the comment (after the #) at the end of the script states "end of while not finished" which can be confusing as the loop construct is actually an until statement. Page 207 should be "what" instead of "hat" in the first full sentence of the page and similarly "fpllowing" on page 233 should be "following." For a book this long (622 pages), that's not bad at all. There may be others, but they weren't obvious during my reading. For a technical book, in its first edition, I was happy with the overall quality of the material.

My final suggestion is for the inclusion of sample input and outputs for the scripts. Many scripts give these types of examples, which makes it endlessly easier to understand exactly what the scripts are doing, but this isn't consistent throughout the book and I am not sure what the editorial decision was in not including these types of examples for those scripts that don't have them. My personal opinion is that there should be input and output examples for every recipe in the Bash Cookbook. I liken it to one of my favorite cooking guides, Cooks Illustrated, whose pictures often clear any confusion about preparations for recipes that the text of the recipe may have missed. I think the same holds for sample input and outputs for the tech recipes of the Bash Cookbook. Every recipe, in my opinion should have these examples even if they are only available from O'Reilly's website.

Conclusion

Nitpicks and suggestions aside, this is a great bash scripting resource and should find a good home on any scripter's bookshelf. It provides enough instruction to help a new-ish user understand the deeper power of bash scripting while having enough breadth and depth to serve as an invaluable resource for the experienced scripting guru.

Book Information


Title:
Bash Cookbook
Authors: Carl Albing, JP Vossen & Cameron Newham
Paperback: 622 pages
Publisher: O'Reilly Media, Inc., 1 edition (May 24, 2007)
ISBN: 0596526784

Tuesday, May 20, 2008

Python Tip: Checking to see if your Python installation has support for SSL

I was trying to figure out if my installation of Python was compiled with SSL support and found it to be non-intuitive if you didn't compile Python for yourself.

So, to check if you have SSL support configured with your installation of Python, go to your command prompt and type:


python


and you'll get the Python interactive shell (that will look something like this):


Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>


At the >>> prompt, type import socket:


Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket


Then check for the ssl attribute, by typing hasattr(socket, "ssl") at the >>> prompt and look for a True or False response:


Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> hasattr(socket, "ssl")
True
>>>


A True response means that SSL is compiled in your Python installation.

Good luck.

If you want some good books on learning to use Python, I highly recommend Beginning Python: From Novice to Professional by Magnus Hetland and Learning Python by Mark Lutz and David Ascher. I am currently using both books to get up to speed on the Python language and I am really enjoying working with both of them.

Wednesday, May 7, 2008

Bash Tip: Proving a Negative with grep and diff

I stumbled across an interesting problem a while back. Given a set of data how do you determine correlating data that isn't there?

I was given a file containing a list of names and some other lines that indicated a successful condition. If a name on one line was followed by a success statement, then the condition was successful for the previous line. However, if a name was followed by another name, then the condition had failed for the first name.

Confused already? Let's look at an example and see if that clears things up. Say we have a file, tally.txt, whose contents look something like this:

Doug
Jim
voteSuccess
Diana
voteSuccess
Thomas
voteSuccess
Drew
Elizabeth
Chris
Adrienne
voteSuccess
Nicholas
voteSuccess
Anita
Greg
Jacob
Trudy
voteSuccess
Alex
voteSuccess
Richard
voteSuccess
Donald
Sam
Steve
Bob
Nathan
voteSuccess
Penelope
voteSuccess
Bishop
voteSuccess
Dustin
voteSuccess
Ron
George
Henry
voteSuccess
Arthur
Reggie
voteSuccess


Here is the same file with line numbers added to make things clearer:


1 Doug
2 Jim
3 voteSuccess
4 Diana
5 voteSuccess
6 Thomas
7 voteSuccess
8 Drew
9 Elizabeth
10 Chris
11 Adrienne
12 voteSuccess
13 Nicholas
14 voteSuccess
15 Anita
16 Greg
17 Jacob
18 Trudy
19 voteSuccess
20 Alex
21 voteSuccess
22 Richard
23 voteSuccess
24 Donald
25 Sam
26 Steve
27 Bob
28 Nathan
29 voteSuccess
30 Penelope
31 voteSuccess
32 Bishop
33 voteSuccess
34 Dustin
35 voteSuccess
36 Ron
37 George
38 Henry
39 voteSuccess
40 Arthur
41 Reggie
42 voteSuccess


Lines 3, 5, 7, 12, 14, 19, 21, 23, 29, 31, 33, 35, 39, and 42 all indicate a success condition (voteSuccess). By the conventions of the file, that means that the people on the preceding lines actually had the success (lines 2, 4, 6, 11, 13, 18, 20, 22, 28, 30, 32, 34, 38, and respectively). The problem is that we want to find out who wasn't able to successfully vote. We need to find some way to extract those that voted successfully from the file and only leave those that weren't able to vote.

It should be noted that I simplified this example quite a bit. The success condition string (voteSuccess) could actually be one of a host of things, so it is not just one known string that we can work against, but it is good enough for this exercise.

Of course this whole situation would be a lot easier if the program that created this file placed some sort of indication of failure after the names of the people that didn't have success in voting. Unfortunately, in many instances, we're often stuck with the formats we're given and have to find a way to make them work.

After a little thought, I came up with a psuedo algorithm that I thought might solve the problem:

  1. Correlate all the success conditions with the appropriate people.

  2. Strip these into a file of successful voters sorted alphabetically.

  3. Strip out all the status messages, leaving only voters, and sort them alphabetically into another file of all voters.

  4. Check the difference between the files. As the successful voters will be in both files, only those that failed will be different.


For step one, we'll use a somewhat current version of the grep utility (the one I used was 2.5.1, you can find the version by typing grep -V) to print all the lines of tally.txt that contain voteSuccess and every line above them. The -B switch for grep prints however many lines you want above the string that you were looking for. Typing:

grep -B 1 voteSuccess tally.txt

You'll notice that the -B switch prints -- between contiguous blocks of matches.

Jim
voteSuccess
Diana
voteSuccess
Thomas
voteSuccess
--
Adrienne
voteSuccess
Nicholas
voteSuccess
--
Trudy
voteSuccess
Alex
voteSuccess
Richard
voteSuccess
--
Nathan
voteSuccess
Penelope
voteSuccess
Bishop
voteSuccess
Dustin
voteSuccess
--
Henry
voteSuccess
--
Reggie
voteSuccess

We'll clean those out by piping the output into an invert match of grep.

grep -B 1 voteSuccess tally.txt | grep -v ^[--]

Now we'll clean out the voteSuccess condition statements and sort the output.

grep -B 1 voteSuccess tally.txt | grep -v ^[--] | grep -v voteSuccess | sort

Our output from the first command sequence looks like this:

Adrienne
Alex
Bishop
Diana
Dustin
Henry
Jim
Nathan
Nicholas
Penelope
Reggie
Richard
Thomas
Trudy

Now that we have a list of the successful voters, let's redirect it to the file, successfulvoters.txt, that we'll later use to ferret out the failed voters.

grep -B 1 voteSuccess tally.txt | grep -v ^[--] | grep -v voteSuccess | sort > successfulvoters.txt

Next, we need to pull together a sorted list of all voters. This is pretty easy, all we have to do is an inverted search for the term voteSuccess. The only things left will be the names of all the voters which we can sort and redirect into the file allvoters.txt

grep -v voteSuccess tally.txt | sort > allvoters.txt

Finally, we'll compares the successfulvoters.txt and allvoters.txt files using the diff utility. As diff can be verbose, we'll ask it to output an ed (line editor) script by employing the -e switch. These script instructions will highlight what needs to happen to the successfulvoters.txt file in order to make it look like the allvoters.txt file... which is add back all the failed users.

The failed users that we were trying to figure out how to isolate.

Let's compare files:

diff -e successfulvoters.txt allvoters.txt

That gives us the following:

12a
Ron
Sam
Steve
.
6a
Jacob
.
5a
Elizabeth
George
Greg
.
4a
Donald
Doug
Drew
.
3a
Bob
Chris
.
2a
Anita
Arthur
.


If you look at this output upside down, you can follow along in the successfullvoters.txt file and see where these additions would be added in order to make a complete list of users. If you can't do it mentally, I've flipped the output here:

.
Arthur
Anita
2a
.
Chris
Bob
3a
.
Drew
Doug
Donald
4a
.
Greg
George
Elizabeth
5a
.
Jacob
6a
.
Steve
Sam
Ron
12a

However, we just need the usernames and not the commands for the ed utility. If we get rid of every line that starts with a number (our voter names don't start with numbers) and every line with a period (.), and then run that through sort, we should have an alphabetical list of people that couldn't successfully vote.

To get rid of any line that begins for a number, we'll do an invert search on the output of the diff command. The expression ^[[:digit:]] uses the carat (^) character to denote starting the line and the shorthand expression [[:digit:]] to denote any number. That will just leave us to content with the periods, which we can remove by piping this output into yet another inverted grep search asking to return any line that doesn't contain them, [.]. Then we sort the output to make it more useable.

diff -e successfulvoters.txt allvoters.txt | grep -v ^[[:digit:]] | grep -v [.] | sort

And that's that!

We can put it all together in a quick and dirty bash script that will parse out the failed voters given a file name to process.


#!/bin/bash

# Take the filename from the command line and stuff it into a variable
TALLY=$1

# Find the successful voters
grep -B 1 voteSuccess $TALLY | grep -v ^[--] | grep -v voteSuccess | sort > successfulvoters.txt

# Find all the voters
grep -v voteSuccess $TALLY | sort > allvoters.txt

# Find the difference between the successful voters
# and all the possible voters (ie the failed voters)
diff -e successfulvoters.txt allvoters.txt | grep -v ^[[:digit:]] | grep -v [.] | sort

Does anyone have other ideas how to tackle this problem? While the test case was relatively small, the actual data set contained tens of thousands of entries.

I see a lot of redundancy in this solution with its two passes over the tally file. On the other hand, I had a working solution in about 15 minutes.

I have tested this a couple of times and it looks to work on all my data sets. Perhaps there's a problem that I am not seeing and, if so, please speak up and let me know.

How would you solve this problem?

Do you have any file processing war stories? Tricks of the trade you'd like to share?

Also, I have a couple of texts that I have use to flesh out my understanding of scripting and the bash shell. The first is the O'Reilly book Learning the bash Shell by Cameron Newham. It is a step by step introduction to bash shell concepts and includes a good overview of many standard shell tools and techniques.

I also really like the more general book by Stephen Kochan and Patrick Wood titled Unix Shell Programming (3rd ed). Kochan and Wood write the book to the POSIX standard for shells which should help in writing maintainable and portable scripts, however they also make an effort to point out how each shell differs in its approach. It has its faults, as most books do, but it is solid nonetheless.

Bash Tip: Finding a Line and the One Following it

I was recently asked the following question:
I need to identify lines containing a string in a file and extract that line and
the next line from the file. There might be mutiple occurrences in the file.

In this example file I need to scan for "gottoget" and then extract line 3
and 4 as well as lines 6 and 7

Example file:
line 1
line 2
gottoget line 3
want this line as well line 4
line 5
gottoget line 6
ok this one must come with line 7
line 8
line 9
line 10

Hope you can help.


I think this puzzle is easily solved through the use of grep's -A flag.

According to the man page for grep (man grep), the -A flag prints the number of lines specified after the matching lines. It sounds like grep -A 1 gottotext examplefile should do the trick. This line will grab the line containing the string we're looking for ("gottotext" in your example) and the first line after that matching line. If we set grep up this way, we get the following:


gottoget line 3
want this line as well line 4
--
gottoget line 6
ok this one must come with line 7


The -- line separates contiguous matches. If you don't want that, the lines are easily removed with another grep filter ( | grep -v ^[--]) which says to show everything but lines that begin with the -- characters. If you have -- characters that are legitimate at the beginning of some lines in your data, you may need to play around a bit to only filter out these unnecessary ones.

Putting it all together, we get:


grep -A 1 gottoget examplefile | grep -v ^[--]


Giving us the cleaned up output of:


gottoget line 3
want this line as well line 4
gottoget line 6
ok this one must come with line 7


And that's it. A simple application of some grep statements provides the answer.

I highly recommend Unix Shell Programming (3rd edition) by Stephen Kochan and Patrick Wood if you are interested improving your understanding of shell scripting. Kochan and Wood do a very thorough job (using plenty of code examples) exploring various aspects of essential shell scripting tools and techniques.

Post a comment if you have a different or better way of handling this puzzle.

Take care.

Tuesday, April 29, 2008

Bash Tip: Reverse Sorting Lists in the Shell

Every once in a while I check my site logs and find a common search phrase in referrals from search engines. Often the visitors appear to leave immediately; presumably because the page they landed on didn't answer their question.

A phrase that has been appearing quite frequently lately is "bash reverse sort list".

I can't tell exactly what they mean by their search query, so we'll take a couple of cracks at it.

My first thought is that they might be looking to reverse the output of the command line tool ls.

Say we have a directory and we see the following files when we run ls:


a aa ab b bd be c cqw cw d de defg h hi hij hijk i ii iii ij


To get a simple reverse listing of those files, we should use the -r switch for ls. typing ls -r in the same directory yields:


ij iii ii i hijk hij hi h defg de d cw cqw c be bd b ab aa a


Having your file list reversed in a horizontal line isn't always useful when you are looking for a vertical list. It just takes a little bit of extra work to turn your list on its side if that's what you need.

First, we'll use the -l switch of ls to show the long listing for the files. Typing ls -lr gives us a reverse listing of our files in a vertical format.


total 0
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 ij
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 iii
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 ii
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 i
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 hijk
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 hij
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 hi
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 h
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 defg
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 de
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 d
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 cw
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 cqw
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 c
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 be
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 bd
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 b
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 ab
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 aa
-rw-r--r-- 1 jjones jjones 0 2007-05-07 21:23 a


That's a bit more verbose than what I expect we're looking for, so we'll need to employ a couple of more tools to trim away the fat.

All we want is the last column (8, if you consider the columns delimited by spaces) of information. This is where the cut command comes in handy. It does exactly what the name implies by slicing and dicing data in multiple handy ways.

By default, the cut command treats data as fields separated by tabs. By sending the output of our ls -lr command as input to the cut command while changing the default delimiter character with the -d switch, we can filter out all but the 8th column. So far our command looks like this, ls -lr | cut -d" " -f8, and our ouput looks like this:



ij
iii
ii
i
hijk
hij
hi
h
defg
de
d
cw
cqw
c
be
bd
b
ab
aa
a


Almost perfect. However, you'll notice one small problem at the top of the list. There's an extra blank. If you look at the original output of ls -lr, it's quickly becomes clear where the blank line came from. The total 0 line in the original output had only two fields, total and 0, leaving nothing but a blank when cut went looking for the eighth field.

It's not too difficult a job to clean this up with a little creative application of the grep command. We'll use the -v or inverse match switch of grep (otherwise known as "show me everything but") of a line with only a beginning, represented by the carat (^) symbol, and an end, represented by the dollar sign ($) and nothing in between, or -v ^$.

Putting it all together as ls -lr | cut -d" " -f8 | grep -v ^$ successfully removes the blank line from our vertical reverse sorted list of files.


ij
iii
ii
i
hijk
hij
hi
h
defg
de
d
cw
cqw
c
be
bd
b
ab
aa
a


Another list you might like to sort is one contained in a file. ls isn't going to help us with this one, but the sort command is here to help.

By default, sort will sort a list in a file by the first field as delimited by white and non-white space. Taking an example file (sort.txt) containing the following:


a
b
bd
hij
be
aa
cqw
ab
c
cw
d
de
iii
defg
h
hi
hijk
i
ii
ij


So, running sort against sort.txt results in:


a
aa
ab
b
bd
be
c
cqw
cw
d
de
defg
h
hi
hij
hijk
i
ii
iii
ij


The sort command also offers a reverse sort option through the -r switch. Running sort -r against sort.txt (sort -r sort.txt) results in:


ij
iii
ii
i
hijk
hij
hi
h
defg
de
d
cw
cqw
c
be
bd
b
ab
aa
a


I hope this answers some of the basic questions about reverse sorting lists. For more information check out the manual pages for ls and sort (man ls and man sort).

However, you might just want your list flipped on its head, with no sorting whatsoever. Say you have the list:


a
d
c
b


You want it like to look like this:


b
c
d
a


As it turns out, there is a command just for that purpose called tac. Where cat will concatenate the contents of a file to the screen (standard output), tac will do the same after reversing the contents of a file.

Take the text of the 1st Amendment to the US Constitution, for example.


Congress shall make no law respecting an establishment of religion,
or prohibiting the free exercise thereof;
or abridging the freedom of speech,
or of the press;
or the right of the people peaceably to assemble,
and to petition the Government for a redress of grievances.


Running tac against these lines compeletely reverses them:


and to petition the Government for a redress of grievances.
or the right of the people peaceably to assemble,
or of the press;
or abridging the freedom of speech,
or prohibiting the free exercise thereof;
Congress shall make no law respecting an establishment of religion,


Whereas if we had used sort, the output would look slightly different:


and to petition the Government for a redress of grievances.
Congress shall make no law respecting an establishment of religion,
or abridging the freedom of speech,
or of the press;
or prohibiting the free exercise thereof;
or the right of the people peaceably to assemble,



If your list isn't vertical with items separated by a newline, you can use tac's -s switch, similar to cut's -d switch, to identify a different separator.

Update 1: A helpful reader pointed out that the ls examples could be a lot smaller with the application of the -1 switch to the ls command. This switch tells the standard ls command to print one file per line. When combined with the reverse, -r, switch, we get a reverse list of files in a vertical as opposed to the standard horizontal layout.

In the end, just typing


ls -1r


will result in this list of files


a aa ab b bd be c cqw cw d de defg h hi hij hijk i ii iii ij


being printed like this


ij
iii
ii
i
hijk
hij
hi
h
defg
de
d
cw
cqw
c
be
bd
b
ab
aa
a


Update 2: It turns out that I didn't cover how to reverse sort a horizontal line. Since it is a little long, you can check out my solution in this post, Bash Tip: Reverse Sorting Lists Revisted; Reversing a Horizontal List.

--

I hope these tips help everyone out. If you want more resources on shell scripting, I highly recommend Unix Shell Programming (3rd edition) by Stephen Kochan and Patrick Wood if you are interested improving your understanding of shell scripting. Kochan and Wood do a very thorough job (using plenty of code examples) exploring various aspects of essential shell scripting tools and techniques