Gurjeet [tech]: May 2008

Run shell commands in parallel

As I said in my previous post, I am getting rid of my Ubuntu in a VM (Gutsy Gibbon running inside VirtualBox), I am posting another script that I think will be useful in some situations.

Here's a little background. The place where I am consulting (Hi5.com) we need to perform rsync on a huge directory tree. And since we want this operation to be as fast as possible, the first measure the guys there took was to use rsync protocol, and not use rsync-over-ssh; thats a great speed booster.

Next, they (actually, Kenny Gorman) devised three scripts, that we need to run after each other; one to generate a list of all files in the directory we want to copy, second to split that list into 4 equal pieces, and the third to actually run these 4 pieces (batches) in background, in parallel.

The problem with this approach is that some batches finish quickly, because the files those batches are rsyncing are smaller than the files that other batches are working on. The result: we start with 4 parallel rsync commands, but somewhere down the line only one or two of them are running. We loose parallelism quite quickly, and end up waiting for the batch(es) containing large files, and that is processing files in sequential order.

So, I got to work trying to parallelize a bunch of commands that are placed in a file. This script reads lines from it's standard-input-stream (stdin) and executes those lines using the shell. At any time, it will run only a specified number of commands, and wait for them to finish. As soon as one of the running command finishes, this script reads next line from stdin and executes that.

I have also added the ability to change the degree of parallelism while this script is running. Just create a file named 'degree' in /tmp/parallel.$PID/ and and put a number in there, denoting the new degree of parallelism. This is quite useful in tweaking the degree of parallelism depending on your system load.

I have made no special efforts in redirecting the stdin/stdout/stderr of the commands that are read and executed by this script. So, if you wish to record the progress of this script, or wish to store away your commands' output, just redirect this script's streams and save them.

An example usage of this script can to remove all the files under a directory, in parallel (although it is a very bad example for such a simple task):


find /home/gurjeet/dev/postgres -type f | sed -e 's/\(.*\)/rm $0/g' > tmp.txt
cat tmp.txt | parallel.sh

Here's the script:


#!/bin/bash
# This script is licensed under GPL 2.0 license.

# This script uses some special features (look for 'wait' command)
# provided by Bash shell.

# get my pid
mypid=$$;

# determine a dir/ where I will keep my running info
MYDIR=/tmp/parallel.$mypid;

# echo my pid for the logs
echo PARALLEL: pid: $mypid;

# remove the directory/file if it is left over from a previous run
if [ -e $MYDIR ] ; then
 rm -r $MYDIR
fi

# make my dir/
mkdir $MYDIR

# determine the degreee of parallelization
degree=$1;

# default degree of parallelism, if not specified on command line
if [ "X$degree" = "X" ] ; then
 degree=2;
fi

# echo for logs
echo PARALLEL: Degree of parallelism: $degree;

# read each line from stdin and process it

while read line ;
do

 while [ true ]; do

   # re-adjust degree of parallelization communicated through this file
   if [ -f $MYDIR/degree ] ; then
     new_degree=`cat $MYDIR/degree`
     rm $MYDIR/degree
   fi

   if [ $new_degree > 0  ] ; then
     degree=$new_degree;
   fi

   # Look for a free slot
   for (( i = 0 ; i < $degree ; ++i )) ; do
     if [ ! -e $MYDIR/parallel.$i ]; then
       break
     fi
   done

   if [ $i -lt $degree ]; then
     break
   fi

   # if can't find any free slot, repeat after a sleep of 1 sec
   sleep 1;

 done

 # occupy this slot
 ( # echo PARALLEL: touching $MYDIR/parallel.$i;
   touch $MYDIR/parallel.$i )

 # perform the task in background, and free the slot when done
 ( echo PARALLEL: $degree $mypid;
   sh -c "$line";
   # echo PARALLEL: removing $MYDIR/parallel.$i;
   rm $MYDIR/parallel.$i ) &
done

# Wait for all child processes to finish
wait;

# echo PARALLEL: removing base dir;
rm -r $MYDIR;

Restart Ubuntu's Wireless Network Driver (script)

I have admitted on more than one occasion that I am a Windows fan; yes, even after using Vista! But when I got my new laptop, on which I installed Vista Business on my own, I tried to push myself into using Ubuntu; I'll leave blogging about that experience for some other post (on my RNFs). That was a long time ago (2 months to be precise) and this post is about something else.

I encountered too many network disconnections on Ubuntu. I noticed that the wireless' indicator on my laptop would just go away after using Ubuntu for a while. The only work-around to start the connection I had was to restart the OS! As I was very committed to using Ubuntu at any cost, I dug up the internet and found some clues. A little while later I developed this script.

What this script does is it uses the utility that is installed with the Intel (restricted) wireless driver, to check if the driver is still running,; if it is not, then it starts it, and if it is already tunning, it will kill and restart it. Worked like a magic for me for the week that I used Ubuntu after this.


$ cat restart_network_driver.sh

#!/bin/sh
# This code is in public domain, under GPL 2.0 license 
if ipw3945d-2.6.22-14-generic --isrunning; then
  echo killing                                  \
  && ipw3945d-2.6.22-14-generic --kill  \
  && echo restarting                    \
  && ipw3945d-2.6.22-14-generic --quiet \
  && ipw3945d-2.6.22-14-generic --isrunning;   \
else
  echo starting                                \
  && ipw3945d-2.6.22-14-generic --quiet        \
  && ipw3945d-2.6.22-14-generic --isrunning;
fi

Here's what I was using:
OS: Ubuntu 7.04 (Gutsy Gibbon)
Laptop: Thinkpad R61i
Wireless: Intel ipw3945d (restricted) driver

and here's how to use it:

sudo ./restart_network_driver.sh

PS: I am posting it now because I am going to give Linux another shot, this time with Hardy heron; and wanted some place to save this script before I wipe out that partition.

ts: the timestampimg script

So I finally got around to implementing one of my ideas (which I don't get to do very often!). The idea was posted here: http://gurjeet-rnf.blogspot.com/2008/05/ts.html

I first thought of implementing it in C, and thought that I'd use the time-tested code from postgres sources. I wanted to implement the code in C for performance reason, but then it looked a bit complex to extract PG's code and make it work independently.

So I cooked up a simple shell script that uses the standard 'date' command to get us what we want. Here it is:


$ cat ts.sh

#!/bin/sh
while read line; do
  echo `date`: $line
done

And here's a sample run, but first the script I used to test:


$ cat del.sh

#!/bin/sh
while [[ 1 ]] ; do echo gurjeet singh; sleep 1; done

And the sample run:


$ ./del.sh | ./ts.sh

Mon May 26 19:16:56 IST 2008: gurjeet singh
Mon May 26 19:16:58 IST 2008: gurjeet singh
Mon May 26 19:16:59 IST 2008: gurjeet singh
Mon May 26 19:17:00 IST 2008: gurjeet singh
Mon May 26 19:17:01 IST 2008: gurjeet singh
Mon May 26 19:17:02 IST 2008: gurjeet singh
Mon May 26 19:17:03 IST 2008: gurjeet singh
Mon May 26 19:17:04 IST 2008: gurjeet singh

Since I have a soft spot for Windows, and since this shell script cannot be easily utilized in Windows platform, I am working on a new binary, that will be based on the 'date' command, and work natively on Windows.

My RNFs branched

This is a new branch from my RNF blog. I think the RNF blog is not appropriate place for technical writings; this were getting too mixed up there!

Techie-mee is a pun on the Mini-Me character from Austin-Powers movies. This is a mini version of my main blog, which is dedicated to everything technical inside of my head.

Update: The techie-mee blog has been renamed to gurjeet-tech, on the lines of gurjeet-rnf, the first blog. I think the new name makes more sense than the old one.