Basic Text Analysis with Command Line Tools in Linux

Introduction

In the Linux and Unix operating systems, everything is treated as a file. Whenever possible, those files are stored as human- and machine-readable text files. As a result, Linux contains a large number of tools that are specialized for working with texts. Here we will use a few of these tools to explore a textual source.

Downloading a text

Our first task is to obtain a sample text to analyze. We will be working with a nineteenth-century book from the Internet Archive: Jane Andrews, The Stories Mother Nature Told Her Children (1888, 1894). Since this text is part of the Project Gutenberg collection, it was typed in by humans, rather than being scanned and OCRed by machine. This greatly reduces the number of textual errors we expect to find in it. To download the file, we will use the wget command, which needs a URL. We don’t want to give the program the URL that we use to read the file in our browser, because if we do the file that we download will have HTML markup tags in it. Instead, we want the raw text file, which is located at

http://archive.org/download/thestoriesmother05792gut/stmtn10.txt

First we download the file with wget, then we use the ls command (list directory contents) to make sure that we have a local copy.

wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt

ls

Our first view of the text

The Linux file command allows us to confirm that we have downloaded a text file. When we type

file stmtn10.txt

the computer responds with

stmtn10.txt: C source, ASCII text, with CRLF line terminators

The output of the file command confirms that this is an ASCII text (which we expect), guesses that it is some code in the C programming language (which is incorrect) and tells us that the ends of the lines in the file are coded with both a carriage return and a line feed. This is standard for Windows computers. Linux and OS X expect the ends of lines in an ASCII text file to be coded only with a line feed. If we want to move text files between operating systems, this is one thing we have to pay attention to. Later we will learn one method to convert the line endings from CRLF to LF, but for now we can leave the file as it is.

[UPDATE 2014. The file command no longer mistakenly identifies the file as C code.]

The head and tail commands show us the first few and last few lines of the file respectively.

head stmtn10.txt

The Project Gutenberg EBook of The Stories Mother Nature Told Her Children

by Jane Andrews

Copyright laws are changing all over the world. Be sure to check the

copyright laws for your country before downloading or redistributing

this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project

Gutenberg file.  Please do not remove it.  Do not change or edit the

header without written permission.

tail stmtn10.txt

[Portions of this eBook's header and trailer may be reprinted only

when distributed free of all fees.  Copyright (C) 2001, 2002 by

Michael S. Hart.  Project Gutenberg is a TradeMark and may not be

used in any sales of Project Gutenberg eBooks or other materials be

they hardware or software or any other related product without

express permission.]

*END THE SMALL PRINT! FOR PUBLIC DOMAIN EBOOKS*Ver.02/11/02*END*

As we can see, the Project Gutenberg text includes some material in the header and footer which we will probably want to remove so we can analyze the source itself. Before modifying files, it is usually a good idea to make a copy of the original. We can do this with the cp command, then use the ls command to make sure we now have two copies of the file.

cp stmtn10.txt stmtn10-backup.txt

ls

In order to have a look at the whole file, we can use the less command. Once we run the following statement, we will be able to use the arrow keys to move up and down in the file one line at a time (or the j and k keys); the page up and page down keys to jump by pages (or the f and b keys); and the forward slash key to search for something (try typing /giantfor example and then press the n key to see the next match). Press the q key to exit from viewing the file with less.

less -N stmtn10.txt

Trimming the header and footer

In the above case, we used the option -N to tell the less command that we wanted it to include line numbers at the beginning of each line. (Try running the less command without that option to see the difference.) Using the line numbers, we can see that the Project Gutenberg header runs from Line 1 to Line 40 inclusive, and that the footer runs from Line 2206 to Line 2525 inclusive. To create a copy of the text that has the header and footer removed, we can use the Linux stream editor sed. We have to start with the footer, because if we removed the header first it would change the line numbering for the rest of the file.

sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt

This command tells sed to delete all of the material between lines 2206 and 2525 and output the results to a file called stmtn10-nofooter.txt. You can use less to confirm that this new file still contains the Project Gutenberg header but not the footer. We can now trim the header from this file to create another version with no header or footer. We will call this file stmtn10-trimmed.txt. Use less to confirm that it looks the way it should. While you are using less to view a file, you can use the g key to jump to the top of the file and the shift-g to jump to the bottom.

sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt

Use the ls command to confirm that you now have four files, stmtn10-backup.txt, stmtn10-nofooter.txt, stmtn10-trimmed.txt and stmtn10.txt.

A few basic statistics

We can use the wc command to find out how many lines (-l option) and how many characters (-m) our file has. Running the following shows us that the answer is 2165 lines and 121038 characters.

wc -l stmtn10-trimmed.txt

wc -m stmtn10-trimmed.txt

Finding patterns

Linux has a very powerful pattern-matching command called grep, which we will use frequently. At its most basic, grep returns lines in a file which match a pattern. The command below shows us lines which contain the word giant. The -n option asks grep to include line numbers. Note that this pattern is case sensitive, and will not match Giant.

grep -n "giant" stmtn10-trimmed.txt

Do you believe in giants? No, do you say? Well, listen to my story,

to admit that to do it needed a giant's strength, and so they deserve

giants think of doing. We have not long to wait before we shall see, and

What if we wanted to find both capitalized and lowercase versions of the word? In the following command, we tell grep that we want to use an extended set of possible patterns (the -E option) and show us line numbers (the -n option). The pattern itself says to match something that starts either with a capital G or a lowercase g, followed by lowercase iant.

grep -E -n "(G|g)iant" stmtn10-trimmed.txt

Creating a standardized version of the text

When we are analyzing the words in a text, it is usually convenient to create a standardized version that eliminates whitespace and punctuation and converts all characters to lowercase. We will use the tr command to translate and delete characters of our trimmed text, to create a standardized version. First we delete all punctuation, using the -d option and a special pattern which matches punctuation characters. Note that in this case the tr command requires that we use the redirection operators to specify both the input file (<) and the output file (>). You can use the less command to confirm that the punctuation has been removed.

tr -d [:punct:] < stmtn10-trimmed.txt > stmtn10-nopunct.txt

The next step is to use tr to convert all characters to lowercase. Once again, use the less command to confirm that the changes have been made.

tr [:upper:] [:lower:] < stmtn10-nopunct.txt > stmtn10-lowercase.txt

Finally, we will use the tr command to convert all of the Windows CRLF line endings to the LF line endings that characterize Linux and OS X files. If we don’t do this, the spurious carriage return characters will interfere with our frequency counts.

tr -d '\r' < stmtn10-lowercase.txt > stmtn10-lowercaself.txt

Counting word frequencies

The first step in counting word frequencies is use the tr command to translate each blank space into an end-of-line character (or newline, represented by \n). This gives us a file where each word is on its own line. Confirm this using the less or head command on stmtn10-oneword.txt.

tr ' ' '\n' < stmtn10-lowercaself.txt > stmtn10-oneword.txt

The next step is to sort that file so the words are in alphabetical order, and so that if a given word appears a number of times, these are listed one after another. Once again, use the less command to look at the resulting file. Note that there are many blank lines at the beginning of this file, but if you page down you start to see the words: a lot of copies of a, followed by one copy of abashed, one of ability, and so on.

sort stmtn10-oneword.txt > stmtn10-onewordsort.txt

Now we use the uniq command with the -c option to count the number of repetitions of each line. This will give us a file where the words are listed alphabetically, each preceded by its frequency. We use the headcommand to look at the first few lines of our word frequency file.

uniq -c stmtn10-onewordsort.txt > stmtn10-wordfreq.txt

head stmtn10-wordfreq.txt

358

  1 1861

  1 1865

  1 1888

  1 1894

426 a

  1 abashed

  1 ability

  4 able

 44 about

Pipelines

When using the tr command, we saw that it is possible to tell a Linux command where it is getting its input from and where it is sending its output to. It is also possible to arrange commands in a pipeline so that the output of one stage feeds into the input of the next. To do this, we use the pipe operator (|). For example, we can create a pipeline to go from our lowercase file (with Linux LF endings) to word frequencies directly, as shown below. This way we don’t create a bunch of intermediate files if we don’t want to. You can use the less command to confirm that stmtn10-wordfreq.txt and stmtn10-wordfreq2.txt look the same.

tr ' ' '\n' < stmtn10-lowercaself.txt | sort | uniq -c > stmtn10-wordfreq2.txt

When we use less to look at one of our word frequency files, we can search for a particular term with the forward slash. Trying /giant, for example, shows us that there are sixteen instances of the word giants in our text. Spend some time exploring the original text and the word frequency file with less.