Grep
Using "grep" (a UNIX utility) for Solving Crosswords and Word Puzzles
©2004, Bob Beeman
Updated 2004-12-13 Mo
Read More www.bee-man.us Important Notice

Grep

Grep is a useful UNIX tool that is used for finding pattern matches in files. It has many interesting uses, but the purpose for which we are going to use it now is finding words that match a particular pattern. Such problems often occur in Mu Alpha Theta contests.

For example, finding a 10-letter crossword puzzle word matching the pattern below in a dictionary file containing hundreds of thousands of properly spelled English words:

s _ _ o _ l y _ _ d

About Operating Systems

This page assumes you are using Macintosh OS X. However, most any computer with a or UNIX-based Operating System, such as those in the list below should work almost identically:
Windows does not directly support the following procedures, and does not have a "grep" utility. I did a web search and found a program called Windows Grep. There is an article on Windows Grep at ZDNet UK. To be equivalent to the UNIX grep described here, you will have to use "expert" mode. This still leaves you with the problem of finding a target file to search. I would suggest using the one by John M. Lawler mentioned here in the section on web2. I have no direct knowledge of "Windows grep" and this is not a recommendation that you use it. I cannot be responsible for problems.

Although many people believe that we are in a "Windows World", this is certainly not true in the world of industrial and high-reliability equipment. Virtually all of this equipment uses some variant of UNIX because the reliability of Windows is simply not high enough for things like telephone exchanges, Internet infrastructure equipment, factory floor automation, transportation, and military systems. UNIX excels in reliability because nearly all of the academic research over the past 30 years has been UNIX based. This in turn is because other Operating Systems (Windows, Mac OS 9 and prior, VAX OS, etc.) are proprietary, and thus not fitting platforms for research. Serious research requires full transparency, rather than secrecy, in the computer hardware and software being used.

One of the reasons for the new-found popularity of the Mac among tech enthusiasts is that it is the only UNIX-based system on which you can run regular mass-market programs (Microsoft Office, Word Perfect, etc.) without complex emulation.

Windows is now the only commonly-used operating system that is not based on UNIX.


Entering the UNIX Environment: Starting up the "Terminal"

UNIX commands and utilities are normally accessible only through a "command line interface". We will be using the normal Graphical User Interface (GUI) of Macintosh OS X only as a means of launching the terminal application. After that, the entire operation takes place in the underlying UNIX (Darwin) environment. If you are using a UNIX OS other than Mac OS X you will either already be in a "command line" interface, or there will be similar procedures to get to one.

Mac users originally didn't have access to anything like this, but since Mac OS X is based on UNIX, the terminal application is available to allow you to access the UNIX (Darwin) layer of the operating system and issue UNIX commands. If you remember "DOS" this was a MicroSoft command-line-based operating system, and the predecessor of "Windows". To activate Terminal, navigate to the terminal application as shown below:

Finder path to Terminal Application

Double-click the "Terminal" application, and it will start up and give you a window that looks something like this:

Initial Terminal window

We have started a new UNIX session (called a "shell"). Notice that the host computer is "Robert-Beemans-Computer" (hardly surprising) and the login ID in use is "bee_hive". On your computer these will have the name of your computer and your login ID rather than mine. The "%" sign followed by a space is called a "prompt", and is the shell's invitation to enter a command. In the following discussion we will ignore prompts, but they will always be there when the computer is ready to accept another command.

The command that we want to enter first is one that takes us to a directory where there is a file containing 234,936 English words, one word per line. The reason we want to go to that directory is that it will minimize the amount of typing we have to do to enter the path to the file in the "grep" command later. To do this we enter a "Change Directory" command, abbreviated "cd" which takes us to a directory by following a path that we name. The file we want is in the following directory:

 /usr/share/dict

The leading "/" character means that the file resides in a directory located at the top level of the disk. The name of this directory is "usr". Inside directory "usr" is another directory called "share" and inside that is yet another directory called "dict". To get to this directory we do a "cd" (change directory) command to that location. We do this by typing the "cd" followed by one or more spaces, followed by the path we want to follow.

 cd /usr/share/dict

When we type the command and hit the enter or return key, the computer responds with another prompt if all went well.

We then enter another command: "ls" ("LS" but in lower case), short for "list" with a parameter called "l" (lower case "L", not the numeral one - meaning a "long" listing). In UNIX, parameters of commands are entered with a hyphen "-" after the command and immediately preceding the parameter.

 ls -l

When we push the return key again, we get a long listing of the files in the directory. All of this looks like the following:

Directory listing in the terminal window

The "web2" File of English Words

All UNIX systems have a file, usually named "words" that resides in the UNIX file system. This file contains several hundred thousand recognized words in the target language, which we here assume to be English. In OS X, as well as virtually all versions of UNIX, this file resides at:

 /usr/share/dict/words --> web2

This is in fact exactly where we went with the terminal and the "cd" command above. In Mac OS X the "words" file is actually a link to a file called "web2" so that if you reference "words", you will actually find "web2". This is done for several reasons, including convenience in supporting multiple languages. The "words" file links to the appropriate file for whatever language is set up on the computer, which for English-based BSD systems is "web2".

The "README" file in the same directory has some interesting insights into the origin of the "web2" and "web2a files, including the fact that web2 is a complete list of all 234,936 words included in Webster's Second International dictionary, which was copyrighted in 1934 and whose copyright (according to Webster's) has now expired. The web2a file contains additional hyphenated words.

Because this file is from 1934, you won't find "microprocessor" or "sputnik". To partially remedy this, John M. Lawler, an Associate Professor of Linguistics at the University of Michigan, searched a lot of web pages and sorted the results, obtaining 69,903 words. The file is posted on his web site Here. This will probably catch most of the words that web2 misses, but searching both files would be safest.

Using grep

You can get a full explanation and description of "grep" and how it works, along with all of its options and commands, by typing
 man grep
at the prompt. This is short for "manual of grep. When the description is long, as it is for "grep", you will get it 22 lines at a time. You advance to the next 22 lines by pressing the space bar. You can similarly check out any other UNIX command by typing
 man name_of_command.
You can get a short description of what a command is by typing
 whatis name_of_command
at the prompt followed by the enter key.

For an online version go to the UNIX Manual Pages at the Huntsville, AL Macintosh User's Group.

Grep works when you type "grep" on the command line, followed by a space (press the space bar) and the pattern you are trying to match, followed by a space and the path to and name of a file that you are going to search. The blank spaces are the standard UNIX way to inicate separate command line elements. You active the command by pressing the "return" or "enter" key.

grep Commands

Again, we are using only the simplest capabilities of grep, so we need only the word "grep" followed by one or more spaces, followed by a pattern to match, followed by the path to a file (our "web2" file) that we are going to search.

grep pattern indicators include:

. (a period)
This means "Match any character".
^ (a carat or "shift 6")
This means the next character in the pattern must be at the beginning of a line.
$ (dollar sign or "shift 4)
This means the previous character in the pattern must be at the end of a line.

Other than that, we just enter the explicit letters that we are looking for. For example the following search:

 grep ^.ash$ web2

would match all of the following:

bash cash dash fash gash hash lash mash nash pash rash sash tash wash

but not:
  crash Because the "ash" occurs starting in the 3d position, not the second. Also because the word is 5 letters long, not 4. We allowed only 4-letter words when we specified both the beginning of the word (with a "^") and the end of the word (with a "$").
  washer Because the word is longer than 4 characters.
We are using the beginning and end of line markers for grep rather than the beginning of word "\<" and end of word "\>" markers because the words are arranged one per line, and so strictly speaking there are no word ends or beginnings in the file.

Let's Do It!

So now let's look at the problem originally posed at the beginning of this page. We will try to match:

s _ _ o _ l y _ _ d

The required pattern for this is:

  ^s..o.ly..d$

Indicating that it is 10 letters long, that there must be an "s" in the first position, an "o" in the fourth position, an "l" in the sixth position, a "y" in the seventh position, and a "d" in the tenth position, which must be the last letter of the 10-letter word.

After this pattern is one or more spaces, followed by the description of the file to be searched, which is "web2", the file containing the 234,936 English words.

Our command line is thus:

  grep ^s..o.ly..d$ web2

Here is a picture of the terminal window just an instant before we press the return key to start grep searching the web2 file for the indicated pattern:

Terminal window just before searching

And here is the terminal window a moment after we press the return key, revealing the one word found in the "web2" file that matches the given pattern:

Terminal window with search completed

"Schoolyard" is the word we were seeking. Simple, eh?

When you are through using UNIX, type "exit" at the prompt and then quit the Terminal application, or just quit the Terminal application like any other Mac program.

Assessment

No course or tutorial is complete without an assessment.

To receive my assessment of how well you understood this tutorial, search for the following pattern using grep in web2:

     _ a _ _ _ f _ c _ n _ _ _   (13 letters)

Summary

You can do a lot of interesting things with just a little knowledge of UNIX, which is one of the many reasons why it is so popular in the tech community.

Happy grep-ing!


This page is copyrighted "freeware"
©2004, Bob Beeman
www.bee-man.us
That means that although it is copyrighted, it is intended for you to use for education or entertainment. You may use it yourself, copy and redistribute it, or even put it on your own website. I ask only that you not make any changes. If you reuse any of the code, make sure to list me as one of your sources.

My only reward for writing this is the 15 milliseconds of fame I receive from having my name here. Don't deprive me of that.

You can copy this page by simply doing a "Save As" in your browser and putting it somewhere on your hard drive (or your web site). If you stop there the background will be gone. To preserve the background, copy the following file into this same folder, without changing its name, by again using your browser's "Save As". The next time you refresh the page, the background should be restored:

www_bee-man_us_background.gif

I make NO guarantee of any kind.
This page may contain serious errors.
Use this page entirely at your own risk!